Process xml file via apache pig
Category : Pig
If you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pig code) contains an XMLLoader. It works in a similar way to our technique and captures all of the content between a start and end tag and supplies it as a single bytearray field in a Pig tuple.
Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the Pig Latin scripting language operations like ETL (Extract, Transform and Load), adhoc data analysis and iterative processing can be easily achieved. Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to process the data stored on HDFS. Pig was first built in Yahoo! and later became a top level Apache project.
Sample input file :hadoop_books.xml
2. Using XPath
Lets discuss one by one.
1. Using Regular Expression : Here using the XMLLoader() in piggy bank UDF to load the xml, so ensure that Piggy Bank UDF is registered. Then I am using regular expression to parse the XML.
2. Using XPath : It is second approach to solve xml parsing problem through Pig. XPath is a function that allows text extraction from xml. Starting PIG 0.13 , Piggy bank UDF comes with XPath support. It eases the XML parsing in PIG scripts.
A sample script using XPath is as shown below.
4 Comments
Process Xml files in hadoop | HadoopMinds
October 18, 2016 at 4:59 pm[…] October 18, 2016 · by Somappa Srinivasan Process xml file via apache pig […]
rahul
June 4, 2017 at 5:55 amIts good blog..
admin
June 8, 2017 at 6:20 amThanks Rahul for your kind words. Please fell free to reach out to us in case of any help or blog required on any topics.
Mukesh Kumar
June 26, 2017 at 11:41 amvikas gupta
Computer programming Course kit
computer
955.18
2011-14-07
An indepth course kit about computerprogramminglanguages
positive
4.5
130
yash gupta
java programming
computer
1955.18
2012-13-04
An indepth course kit about java programming languages
positive
4.5
130
yashwanth
C++ programming
computer
655.28
2013-16-05
An indepth course kit about c++ programming languages
positive
3.5
95
gupta
xml scripting Course kit
scripting
255.18
2011-14-07
An indepth course kit about xml programming languages
negtive
2.5
45
Hi
Above is the XML file . I tried getting book id but i was unable to get it.
Please help me..
xml_file = load ‘/home/training/Desktop/Bookmark.xml’ using org.apache.pig.piggybank.storage.XMLLoader(‘book’) as (x:chararray);
xml_one_line = foreach xml_file generate REPLACE(x,'[\\n]’,”) as x;
xml_remove_brc = foreach xml_one_line generate REGEX_EXTRACT_ALL(x,’.*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*');
dump xml_remove_brc;
()
()
()
()