The problem we are trying to solve through this tutorial is to find the frequency of books published each year.
Our input data set (file BX-Books.csv) is a csv file. Some sample rows:
“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;“2002“;”Oxford University Press”;”http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;“2001“;”HarperFlamingo Canada”;”http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
…
|
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.csv
|
“0393045218”;”The Mummies of Urumchi;“;”E. J. W. Barber”;”1999″;”W. W. Norton & Company”; “http://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “http://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “http://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg” is changed to “0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “http://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “http://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “http://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg” |
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.csv input
|
$ hive
hive> CREATE TABLE IF NOT EXISTS BXDataSet
> (ISBN STRING,
> BookTitle STRING,
> BookAuthor STRING,
> YearOfPublication STRING,
> Publisher STRING,
> ImageURLS STRING,
> ImageURLM STRING,
> ImageURLL STRING)
> COMMENT ‘BX-Books Table’
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘;’
> STORED AS TEXTFILE;
OK
Time taken: 0.086 seconds
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet;
Loading data to table default.bxdataset
Deleted hdfs://localhost:9000/user/hive/warehouse/bxdataset
OK
Time taken: 0.192 seconds
hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;
|
1 Comment
Pramod
November 22, 2018 at 5:14 amHi Saurabh,
What if i want to store the result set into HDFS, i.e the output at the end of the query need to be stored in HDFS. I have tried it through the scripts but there is some junk data is coming in HDFS. Please help in putting the clean data in the HDFS.
Sample Data:
0.051730
1.0139
2.0249
3.0513
4.0832
5.03446
6.03036
7.06353
8.09479
9.06306
10.06017