hive – BigData

March 16, 2017
2

Insert date into hive tables shows null during select

Tags : cast(to_date(from_unixtime(unix_timestamp from_unixtime hive null values in hive ROW FORMAT DELIMITED timestamp to_date unix_timestamp

Category : Hive

When we try to create table on any files(csv or any other format) and load data into hive table then we may see that during select queries it is showing null value.

You can solve it in the following ways:

[saurkuma@m1 ~]$ ll

total 584

-rw-r–r– 1 saurkuma saurkuma 591414 Mar 16 02:31 SalesData01.csv

[saurkuma@m1 ~]$ hive

WARNING: Use “yarn jar” to launch YARN applications.

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,file:/usr/hdp/2.3.4.0-3485/hadoop/lib/hadoop-lzo-0.6.0.2.3.4.0-3485-sources.jar!/ivysettings.xml will be used

Logging initialized using configuration in file:/etc/hive/2.3.4.0-3485/0/hive-log4j.properties

hive> show databases;

default

demo

testhive

Time taken: 3.341 seconds, Fetched: 3 row(s)

hive> use demo;

Time taken: 1.24 seconds

hive> create table salesdata01 (Row_ID INT, Order_ID INT, Order_date String, Order_Priority STRING, Order_Quantity FLOAT, Sales FLOAT, Discount FLOAT, Shipping_Mode STRING, Profit FLOAT, Unit_Price FLOAT) row format delimited fields terminated by ‘,’;

Time taken: 0.782 seconds

hive> select * from salesdata01;

Time taken: 0.721 seconds

hive> load data local inpath ‘/home/saurkuma/SalesData01.csv’ overwrite into table salesdata01;

Loading data to table demo.salesdata01

Table demo.salesdata01 stats: [numFiles=1, totalSize=591414]

Time taken: 1.921 seconds

hive> select * from salesdata01 limit 10;

1 3 13-10-2010 Low 6.0 261.54 0.04 Regular Air -213.25 38.94

49 293 01-10-2012 High 49.0 10123.02 0.07 Delivery Truck 457.81 208.16

50 293 01-10-2012 High 27.0 244.57 0.01 Regular Air 46.71 8.69

80 483 10-07-2011 High 30.0 4965.7593 0.08 Regular Air 1198.97 195.99

85 515 28-08-2010 Not Specified 19.0 394.27 0.08 Regular Air 30.94 21.78

86 515 28-08-2010 Not Specified 21.0 146.69 0.05 Regular Air 4.43 6.64

97 613 17-06-2011 High 12.0 93.54 0.03 Regular Air -54.04 7.3

98 613 17-06-2011 High 22.0 905.08 0.09 Regular Air 127.7 42.76

103 643 24-03-2011 High 21.0 2781.82 0.07 Express Air -695.26 138.14

107 678 26-02-2010 Low 44.0 228.41 0.07 Regular Air -226.36 4.98

Time taken: 0.143 seconds, Fetched: 10 row(s)

hive> select * from salesdata01 where Order_date=’01-10-2012′ limit 10;

49 293 01-10-2012 High 49.0 10123.02 0.07 Delivery Truck 457.81 208.16

50 293 01-10-2012 High 27.0 244.57 0.01 Regular Air 46.71 8.69

3204 22980 01-10-2012 Not Specified 17.0 224.09 0.0 Regular Air -27.92 12.44

3205 22980 01-10-2012 Not Specified 10.0 56.05 0.06 Regular Air -27.73 4.98

2857 20579 01-10-2012 Medium 16.0 1434.086 0.1 Regular Air -26.25 110.99

145 929 01-10-2012 High 21.0 227.66 0.04 Regular Air -100.16 10.97

146 929 01-10-2012 High 39.0 84.33 0.04 Regular Air -64.29 2.08

859 6150 01-10-2012 Critical 38.0 191.14 0.06 Regular Air 82.65 4.98

Time taken: 0.506 seconds, Fetched: 8 row(s)

hive> select Row_ID, cast(to_date(from_unixtime(unix_timestamp(Order_date, ‘dd-MM-yyyy’))) as date) from salesdata01 limit 10;

1 2010-10-13

49 2012-10-01

50 2012-10-01

80 2011-07-10

85 2010-08-28

86 2010-08-28

97 2011-06-17

98 2011-06-17

103 2011-03-24

107 2010-02-26

hive> select Row_ID, from_unixtime(unix_timestamp(Order_date, ‘dd-MM-yyyy’),’yyyy-MM-dd’) from salesdata01 limit 10;

1 2010-10-13

49 2012-10-01

50 2012-10-01

80 2011-07-10

85 2010-08-28

86 2010-08-28

97 2011-06-17

98 2011-06-17

103 2011-03-24

107 2010-02-26

Time taken: 0.157 seconds, Fetched: 10 row(s)

hive> select Row_ID, from_unixtime(unix_timestamp(Order_date, ‘dd-MM-yyyy’)) from salesdata01 limit 10;

1 2010-10-13 00:00:00

49 2012-10-01 00:00:00

50 2012-10-01 00:00:00

80 2011-07-10 00:00:00

85 2010-08-28 00:00:00

86 2010-08-28 00:00:00

97 2011-06-17 00:00:00

98 2011-06-17 00:00:00

103 2011-03-24 00:00:00

107 2010-02-26 00:00:00

Time taken: 0.09 seconds, Fetched: 10 row(s)

hive> select Row_ID, from_unixtime(unix_timestamp(Order_date, ‘dd-MM-yyyy’),’dd-MM-yyyy’) from salesdata01 limit 10;

1 13-10-2010

49 01-10-2012

50 01-10-2012

80 10-07-2011

85 28-08-2010

86 28-08-2010

97 17-06-2011

98 17-06-2011

103 24-03-2011

107 26-02-2010

Another example:

If you are trying to store the date and timestamp values in timestamp column using hive.The source file contain the values of date or sometimes timestamps.

Sample Data:

[saurkuma@m1 ~]$ cat sample.txt

1,2015-04-15 00:00:00

2,2015-04-16 00:00:00

3,2015-04-17

hive> create table table1 (id int,tsstr string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’LINES TERMINATED BY ‘\n’;

Time taken: 0.241 seconds

hive> LOAD DATA LOCAL INPATH ‘/home/saurkuma/sample.txt’ INTO TABLE table1;

Loading data to table demo.table1

Table demo.table1 stats: [numFiles=1, totalSize=57]

Time taken: 0.855 seconds

hive> select * from table1;

1 2015-04-15 00:00:00

2 2015-04-16 00:00:00

3 2015-04-17

Time taken: 0.097 seconds, Fetched: 3 row(s)

hive> create table table2 (id int,mytimestamp timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’;

Time taken: 0.24 seconds

hive> INSERT INTO TABLE table2 select id,if(length(tsstr) > 10, tsstr, concat(tsstr,’ 00:00:00′)) from table1;

Query ID = saurkuma_20170316032711_63d9129a-38c1-4ae8-89f4-e158218d2587

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_1489644687414_0001, Tracking URL = http://m2.hdp22:8088/proxy/application_1489644687414_0001/

Kill Command = /usr/hdp/2.3.4.0-3485/hadoop/bin/hadoop job -kill job_1489644687414_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-03-16 03:27:36,290 Stage-1 map = 0%, reduce = 0%

2017-03-16 03:27:55,806 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.89 sec

MapReduce Total cumulative CPU time: 1 seconds 890 msec

Ended Job = job_1489644687414_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://TESTHA/apps/hive/warehouse/demo.db/table2/.hive-staging_hive_2017-03-16_03-27-11_740_404528501642205352-1/-ext-10000

Loading data to table demo.table2

Table demo.table2 stats: [numFiles=1, numRows=3, totalSize=66, rawDataSize=63]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.89 sec HDFS Read: 4318 HDFS Write: 133 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 890 msec

Time taken: 47.687 seconds

hive> select * from table2;

1 2015-04-15 00:00:00

2 2015-04-16 00:00:00

3 2015-04-17 00:00:00

Time taken: 0.119 seconds, Fetched: 3 row(s)

I hope this helped you to solve your problem and feel free to give your valuable feedback or suggestions.

October 19, 2016
18

Map side join in Hive

Tags : hive map join mapjoin

Category : Hive

Many time we face a situation that we have very small tables in hive but when we query these tables then it takes long time.

Here I am going to explain Map side join and its advantages over the normal join operation in Hive. But before knowing about this, we should first understand the concept of ‘Join’ and what happens internally when we perform the join in Hive.

Join is a clause that combines the records of two tables (or Data-Sets).
Assume that we have two tables A and B. When we perform join operation on them, it will return the records which are the combination of all columns o f A and B.

Mapjoin is a little-known feature of Hive. It allows a table to be loaded into memory so that a (very fast) join could be performed entirely within a mapper without having to use a Map/Reduce step. If your queries frequently rely on small table joins (e.g. cities or countries, etc.) you might see a very substantial speed-up from using mapjoins.

There are two ways to enable it. First is by using a hint, which looks like /*+ MAPJOIN(aliasname), MAPJOIN(anothertable) */. This C-style comment should be placed immediately following the SELECT. It directs Hive to load aliasname (which is a table or alias of the query) into memory.

SELECT /*+ MAPJOIN(c) */ * FROM orders o JOIN cities c ON (o.city_id = c.id);

Another (better, in my opinion) way to turn on mapjoins is to let Hive do it automatically. Simply set hive.auto.convert.join to true in your config, and Hive will automatically use mapjoins for any tables smaller than hive.mapjoin.smalltable.filesize (default is 25MB).

Assume that we have two tables of which one of them is a small table. When we submit a map reduce task, a Map Reduce local task will be created before the original join Map Reduce task which will read data of the small table from HDFS and store it into an in-memory hash table. After reading, it serializes the in-memory hash table into a hash table file.

In the next stage, when the original join Map Reduce task is running, it moves the data in the hash table file to the Hadoop distributed cache, which populates these files to each mapper’s local disk. So all the mappers can load this persistent hash table file back into the memory and do the join work as before. The execution flow of the optimized map join is shown in the figure below. After optimization, the small table needs to be read just once. Also if multiple mappers are running on the same machine, the distributed cache only needs to push one copy of the hash table file to this machine.

Create two sample table and and insert some data into those table and then perform map join operation.

hive> create table emp(name string,address string, deptid bigint) row format delimited fields terminated by ‘,’;

Time taken: 20.218 seconds

hive> load data local inpath ‘/root/emp.txt’ overwrite into table emp;

Loading data to table test.emp

Table test.emp stats: [numFiles=1, numRows=0, totalSize=56, rawDataSize=0]

Time taken: 0.713 seconds

hive> select * from emp;

Saurabh AA 1

Babu AA 2

Nach BB 2

Jeba CC 1

Abhijit DD 1

Time taken: 20.105 seconds, Fetched: 5 row(s)

hive> create table dept(deptname string, deptid bigint) row format delimited fields terminated by ‘,’;

Time taken: 20.192 seconds

hive> load data local inpath ‘/root/dept.txt’ overwrite into table dept;

Loading data to table test.dept

Table test.dept stats: [numFiles=1, numRows=0, totalSize=13, rawDataSize=0]

Time taken: 20.705 seconds

hive> select * from dept;

IT 1

Infra 2

Time taken: 0.081 seconds, Fetched: 2 row(s)

Without Map join:

hive> select emp.name,dept.deptname from emp join dept on emp.deptid=dept.deptid;

Query ID = root_20161018080320_198dcd9e-7e47-440f-871d-5da56522fced

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1476772397810_0003, Tracking URL = http://m1.hdp22:8088/proxy/application_1476772397810_0003/

Kill Command = /usr/hdp/2.3.0.0-2557/hadoop/bin/hadoop job -kill job_1476772397810_0003

Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1

2016-10-18 08:04:53,713 Stage-1 map = 0%, reduce = 0%

2016-10-18 08:05:52,017 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.08 sec

2016-10-18 08:06:39,906 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.37 sec

MapReduce Total cumulative CPU time: 4 seconds 370 msec

Ended Job = job_1476772397810_0003

MapReduce Jobs Launched:

Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 4.37 sec HDFS Read: 13044 HDFS Write: 52 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 370 msec

Abhijit IT

Jeba IT

Saurabh IT

Nach Infra

Babu Infra

Time taken: 201.293 seconds, Fetched: 5 row(s)

With Map Join :

hive> select /*+ MAPJOIN(dept) */ emp.name,dept.deptname from emp join dept on emp.deptid=dept.deptid;

Query ID = root_20161018075509_476aa0ce-704a-4e3b-91c5-c2a3444a9fd7

Total jobs = 1

WARNING: Use “yarn jar” to launch YARN applications.

Execution log at: /tmp/root/root_20161018075509_476aa0ce-704a-4e3b-91c5-c2a3444a9fd7.log

2016-10-18 07:55:37 Starting to launch local task to process map join; maximum memory = 1065484288

2016-10-18 07:55:59 Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/root/f9baf1b2-42f4-4f89-9ac8-a48f5e8b0170/hive_2016-10-18_07-55-09_410_1173820569043720345-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11–.hashtable

2016-10-18 07:55:59 Uploaded 1 File to: file:/tmp/root/f9baf1b2-42f4-4f89-9ac8-a48f5e8b0170/hive_2016-10-18_07-55-09_410_1173820569043720345-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11–.hashtable (307 bytes)

2016-10-18 07:55:59 End of local task; Time Taken: 22.621 sec.

Execution completed successfully

MapredLocal task succeeded

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_1476772397810_0002, Tracking URL = http://m1.hdp22:8088/proxy/application_1476772397810_0002/

Kill Command = /usr/hdp/2.3.0.0-2557/hadoop/bin/hadoop job -kill job_1476772397810_0002

Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0

2016-10-18 07:57:09,279 Stage-3 map = 0%, reduce = 0%

2016-10-18 07:57:57,324 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.6 sec

MapReduce Total cumulative CPU time: 1 seconds 600 msec

Ended Job = job_1476772397810_0002

MapReduce Jobs Launched:

Stage-Stage-3: Map: 1 Cumulative CPU: 1.6 sec HDFS Read: 6415 HDFS Write: 52 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 600 msec

Saurabh IT

Babu Infra

Nach Infra

Jeba IT

Abhijit IT

Time taken: 169.01 seconds, Fetched: 5 row(s)

I hope it will have help you to understand map join.

September 8, 2016
1

Change default permission of hive database

Tags : 0777 default hive warehouse

Category : Hive

When you create a database or internal tables in hive cli then by default it creates with 777 permission.Even though if you have umask in hdfs then also it will be same permission. But now you can change it with the help of following steps.

1.From the command line in the Ambari server node, edit the file

vi /var/lib/ambari–server/resources/common–services/HIVE/0.12.0.2.0/package/scripts/hive.py

Search for hive_apps_whs_dir which should go to this block:

params.HdfsResource(params.hive_apps_whs_dir,

type=“directory”,

action=“create_on_execute”,

owner=params.hive_user,

group=params.user_group,

mode=0755

)

2. Modify the value for mode from 0777 to the desired permission, for example 0750.Save and close the file.

3. Restart the Ambari server to propagate the change to all nodes in the cluster:

ambari–server restart

4. From the Ambari UI, restart HiveServer2 to apply the new permission to the warehouse directory. If multiple HiveServer2 instances are configured, any one instance can be restarted.

hive> create database test2;

Time taken: 0.156 seconds

hive> dfs -ls /apps/hive/warehouse;

Found 9 items

drwxrwxrwx – hdpuser hdfs 0 2016-09-08 01:54 /apps/hive/warehouse/test.db

drwxr-xr-x -hdpuser hdfs 0 2016-09-08 02:04 /apps/hive/warehouse/test1.db

drwxr-x— -hdpuser hdfs 0 2016-09-08 02:09 /apps/hive/warehouse/test2.db

I hope this will help you to serve your purpose.

September 2, 2016
0

Enable Debug mode for hive in Ambari

Tags : debug mode hive

Category : Hive

Many time we see that during troubleshoot we do not find much information if we are just default logger. So no worries I will help you to guide how to enable debug mode in logs or on your console.

Case 1: Use the following command to start hive: Set follwoing property to turn on debug mode on console.
hive -hiveconf hive.root.logger=ALL,console
It will log all messages to console.

Case 2: In case if you want to write you log in log file only for each and every steps which user or job does then you can use following process.

Login to Ambari portal and click on Hive service.
Goto config and search for hive.root.logger string in search box.
Once you will then you need to change default value(hive.root.logger=INFO,DFRA) to hive.root.logger=DEBUG,DFRA

Note: It may fill your log dir soon so please keep deleting old file or keep maintain log dir.

July 11, 2016
0

Analyze your jobs running on top of Tez

Tags : execution hive tez time

Category : Bigdata

Sometime we have to analyze our jobs to tune our jobs or to prepare some reports. We can use following method to get running time for each and every steps for your job in tez execution engine.

By setting up hive.tez.exec.print.summary=true property you can achieve it.

hive> select count(*) from cars_beeline;

Query ID = s0998dnz_20160711080520_e282c377-5607-4cf4-bcda-bd7010918f9c

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1468229364042_0003)

——————————————————————————–

VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED

——————————————————————————–

Map 1 ………. SUCCEEDED 1 1 0 0 0 0

Reducer 2 …… SUCCEEDED 1 1 0 0 0 0

——————————————————————————–

VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.65 s

——————————————————————————–

Time taken: 11.027 seconds, Fetched: 1 row(s)

hive> set hive.tez.exec.print.summary=true;

hive> select count(*) from cars_beeline;

Query ID = s0998dnz_20160711080557_28453c83-9e17-4874-852d-c5e13dd97f82

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1468229364042_0003)

——————————————————————————–

VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED

——————————————————————————–

Map 1 ………. SUCCEEDED 1 1 0 0 0 0

Reducer 2 …… SUCCEEDED 1 1 0 0 0 0

——————————————————————————–

VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.36 s

——————————————————————————–

Status: DAG finished successfully in 15.36 seconds

METHOD DURATION(ms)

parse 2

semanticAnalyze 130

TezBuildDag 229

TezSubmitToRunningDag 13

TotalPrepTime 979

VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS

Map 1 1 0 0 10.64 9,350 299 6 1

Reducer 2 1 0 0 0.41 760 0 1 0

Time taken: 16.478 seconds, Fetched: 1 row(s)

Insert date into hive tables shows null during select

2

Insert date into hive tables shows null during select

18

Map side join in Hive

1

Change default permission of hive database

0

Enable Debug mode for hive in Ambari

0

Analyze your jobs running on top of Tez

Recent Posts

Recent Comments

Archives