Process xml file via apache pig
Category : Pig
If you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pig code) contains an XMLLoader. It works in a similar way to our technique and captures all of the content between a start and end tag and supplies it as a single bytearray field in a Pig tuple.
Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the Pig Latin scripting language operations like ETL (Extract, Transform and Load), adhoc data analysis and iterative processing can be easily achieved. Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to process the data stored on HDFS. Pig was first built in Yahoo! and later became a top level Apache project.
Sample input file :hadoop_books.xml
2. Using XPath
Lets discuss one by one.
1. Using Regular Expression : Here using the XMLLoader() in piggy bank UDF to load the xml, so ensure that Piggy Bank UDF is registered. Then I am using regular expression to parse the XML.
2. Using XPath : It is second approach to solve xml parsing problem through Pig. XPath is a function that allows text extraction from xml. Starting PIG 0.13 , Piggy bank UDF comes with XPath support. It eases the XML parsing in PIG scripts.
A sample script using XPath is as shown below.