Process hdfs data with Hive

The problem we are trying to solve through this tutorial is to find the frequency of books published each year.

Our input data set (file BX-Books.csv) is a csv file. Some sample rows:

“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;“2002“;”Oxford University Press”;”http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;“2001“;”HarperFlamingo Canada”;”http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
  …
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year.
Now as our data is not cleansed and might give us erroneous results due to some serialization support, we clean it by the following command
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.csv

 

The sed commands help us to remove the delimeters “;” (semicolon) from the content and replace them with $$$. Also, the pattern “&” is replaced with ‘&’ only. It also removes the first line (header line). If we don’t remove the header line, Hive will process it as part of the data, which it isn’t.
All the above steps are required to cleanse the data, and help hive give accurate results of our queries.

“0393045218”;”The Mummies of Urumchi;“;”E. J. W. Barber”;”1999″;”W. W. Norton & Company”; “http://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “http://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “http://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”                is changed to

“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “http://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “http://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “http://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

Now, copy the file into Hadoop:
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.csv input

Running Hive using the command line:
$ hive
hive> CREATE TABLE IF NOT EXISTS BXDataSet
    >   (ISBN STRING,
    >   BookTitle STRING,
    >   BookAuthor STRING,
    >   YearOfPublication STRING,
    >   Publisher STRING,
    >   ImageURLS STRING,
    >   ImageURLM STRING,
    >   ImageURLL STRING)
    > COMMENT ‘BX-Books Table’
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ‘;’
    > STORED AS TEXTFILE;
OK
Time taken: 0.086 seconds
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet;
Loading data to table default.bxdataset
Deleted hdfs://localhost:9000/user/hive/warehouse/bxdataset
OK
Time taken: 0.192 seconds
hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine and the username of the hadoop setup.

Output

The output of the query is shown below:

Output-of-Hive-Query


Leave a Reply