Process xml file via apache pig

  • 3

Process xml file via apache pig

Category : Pig

If you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pig code) contains an XMLLoader. It works in a similar way to our technique and captures all of the content between a start and end tag and supplies it as a single bytearray field in a Pig tuple.

Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the Pig Latin scripting language operations like ETL (Extract, Transform and Load), adhoc data analysis and iterative processing can be easily achieved. Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to process the data stored on HDFS. Pig was first built in Yahoo! and later became a top level Apache project.

Sample input file :hadoop_books.xml

<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>
There are two approaches to parse an XML file in PIG.1. Using Regular Expression
2. Using XPath

Lets discuss one by one.

1. Using Regular Expression : Here using the XMLLoader() in piggy bank UDF to load the xml, so ensure that Piggy Bank UDF is registered.  Then I am using regular expression to parse the XML.

REGISTER piggybank.jar
 A =  LOAD ‘/user/test/hadoop_books.xml’ using org.apache.pig.piggybank.storage.XMLLoader(‘BOOK’) as (x:chararray);
 B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,‘<BOOK>\\s*<TITLE>(.*)</TITLE>\\s*<AUTHOR>(.*)</AUTHOR>\\s*<COUNTRY>(.*)</COUNTRY>\\s*<COMPANY>(.*)</COMPANY>\\s*<PRICE>(.*)</PRICE>\\s*<YEAR>(.*)</YEAR>\\s*</BOOK>’));
 dump B;
Once you will run this pig script then you will see the following output on your console.
(Hadoop Defnitive Guide,Tom White,US,CLOUDERA,24.90,2012)
(Programming Pig,Alan Gates,USA,Horton Works,30.90,2013)

2. Using XPath : It is second approach to solve xml parsing problem through Pig. XPath is a function that allows text extraction from xml. Starting PIG 0.13 , Piggy bank UDF comes with XPath support. It eases the XML parsing in PIG scripts.

A sample script using XPath is as shown below.

REGISTER piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A =  LOAD /user/test/hadoop_books.xml’ using org.apache.pig.piggybank.storage.XMLLoader(‘BOOK’) as (x:chararray);
B = FOREACH A GENERATE XPath(x, ‘BOOK/AUTHOR’), XPath(x, ‘BOOK/PRICE’);
dump B;
Output:
(Tom White,24.90)
(Alan Gates,30.90)

3 Comments

rahul

June 4, 2017 at 5:55 am

Its good blog..

    admin

    June 8, 2017 at 6:20 am

    Thanks Rahul for your kind words. Please fell free to reach out to us in case of any help or blog required on any topics.

      Mukesh Kumar

      June 26, 2017 at 11:41 am

      vikas gupta
      Computer programming Course kit
      computer
      955.18
      2011-14-07
      An indepth course kit about computerprogramminglanguages
      positive
      4.5
      130

      yash gupta
      java programming
      computer
      1955.18
      2012-13-04
      An indepth course kit about java programming languages
      positive
      4.5
      130

      yashwanth
      C++ programming
      computer
      655.28
      2013-16-05
      An indepth course kit about c++ programming languages
      positive
      3.5
      95

      gupta
      xml scripting Course kit
      scripting
      255.18
      2011-14-07
      An indepth course kit about xml programming languages
      negtive
      2.5
      45

      Hi

      Above is the XML file . I tried getting book id but i was unable to get it.
      Please help me..

      xml_file = load ‘/home/training/Desktop/Bookmark.xml’ using org.apache.pig.piggybank.storage.XMLLoader(‘book’) as (x:chararray);

      xml_one_line = foreach xml_file generate REPLACE(x,'[\\n]’,”) as x;

      xml_remove_brc = foreach xml_one_line generate REGEX_EXTRACT_ALL(x,’.*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*(?:)([^<]*).*');

      dump xml_remove_brc;
      ()
      ()
      ()
      ()

Leave a Reply