Category Archives: Hive

  • 0

Enable Debug mode in beeline

Some time you have to troubleshoot beeline issue and then you think how to get into debug mode for beeline command shell as you have in hive (-hiveconf hive.root.logger=Debug,console). I know same is not going to work with beeline
So don’t worry following steps will help you and good part is you do not need to restart the hiveserve2.

Step 1: Login to your server and check whether you have beeline-log4j.properties file in /etc/hive/conf/ or not if not then copy the Beeline log4j property file from the given template.

[s0998dnz@m1.hdp22 ~]$ ll /etc/hive/conf/beeline-log4j.properties
ls: cannot access /etc/hive/conf/beeline-log4j.properties: No such file or directory
[s0998dnz@m1.hdp22 ~]$ ll /etc/hive/conf/beeline-log4j.properties.template
-rw-r--r-- 1 root root 1139 Nov 19  2014 /etc/hive/conf/beeline-log4j.properties.template
[s0998dnz@m1.hdp22 ~]$ cp /etc/hive/conf/beeline-log4j.properties.template /etc/hive/conf/beeline-log4j.properties
cp: cannot create regular file `/etc/hive/conf/beeline-log4j.properties': Permission denied
[s0998dnz@m1.hdp22 ~]$ sudo su - hive
[hive@m1.hdp22 ~]$ cp /etc/hive/conf/beeline-log4j.properties.template /etc/hive/conf/beeline-log4j.properties
[hive@m1.hdp22 ~]$ ll /etc/hive/conf/beeline-log4j.properties
-rw-r--r-- 1 hive hadoop 1139 May 31 03:34 /etc/hive/conf/beeline-log4j.properties
[hive@m1.hdp22 ~]$ cat /etc/hive/conf/beeline-log4j.properties
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

log4j.rootLogger=WARN, console

######## console appender ########
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n
log4j.appender.console.encoding=UTF-8
[hive@m1.hdp22 ~]$

Step 2: Now open /etc/hive/conf/beeline-log4j.properties and change log4j.rootLogger from WARN/INFO to DEBUG, console.
log4j.rootLogger=DEBUG, console

Save the changes, run Beeline client and debug output should be displayed.

Please feel free to give your valuable feedback or suggestion.


  • 0

Could not create http connection to jdbc:hive2:HTTP Response code: 413 (state=08S01,code=0)

If you are using HiveServer2 in HTTP transport mode, then the authentication information is sent as part of HTTP headers. And the above error occurs when the default buffer size is set and the HTTP size is insufficient also using Kerberos is used.

This is a known issue and a bug (https://issues.apache.org/jira/browse/HIVE-11720) has been raised to be addressed in a future release.

Workaround:

To resolve this issue, Set the following properties for a bigger HTTP header size in HiveServer properties:
hive.server2.thrift.http.response.header.size to 32768


  • 0

How to remove header from csv during loading to hive

Sometime we may have header in our data file and we do not want that header to loaded into our hive table or we want to ignore header then this article will help you.

[saurkuma@m1 ~]$ cat sampledata.csv

id,Name

1,Saurabh

2,Vishal

3,Jeba

4,Sonu

Step 1: Create a table with table properties to ignore it.

hive> create table test(id int,name string) row format delimited fields terminated by ‘,’ tblproperties(“skip.header.line.count”=”1”) ;

OK

Time taken: 0.233 seconds

hive> show tables;

OK

salesdata01

table1

table2

test

tmp

Time taken: 0.335 seconds, Fetched: 5 row(s)

hive> load data local inpath ‘/home/saurkuma/sampledata.csv’ overwrite into table test;

Loading data to table demo.test

Table demo.test stats: [numFiles=1, totalSize=41]

OK

Time taken: 0.979 seconds

hive> select * from test;

OK

1 Saurabh

2 Vishal

3 Jeba

4 Sonu

Time taken: 0.111 seconds, Fetched: 4 row(s)

To remove header in Pig:

A=load ‘sampledata.csv’ using PigStorage(‘,’);
B=FILTER A BY $0>1;

I hope this helped you to do your job in easy way. Please feel free to give your valuable suggestion or feedback.


  • 2

Insert date into hive tables shows null during select

When we try to create table on any files(csv or any other format) and load data into hive table then we may see that during select queries it is showing null value.

Hive_null_error

You can solve it in the following ways:

[saurkuma@m1 ~]$ ll

total 584

-rw-r–r– 1 saurkuma saurkuma 591414 Mar 16 02:31 SalesData01.csv

[saurkuma@m1 ~]$ hive

WARNING: Use “yarn jar” to launch YARN applications.

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,file:/usr/hdp/2.3.4.0-3485/hadoop/lib/hadoop-lzo-0.6.0.2.3.4.0-3485-sources.jar!/ivysettings.xml will be used

Logging initialized using configuration in file:/etc/hive/2.3.4.0-3485/0/hive-log4j.properties

hive> show databases;

OK

default

demo

testhive

Time taken: 3.341 seconds, Fetched: 3 row(s)

hive> use demo;

OK

Time taken: 1.24 seconds

hive> create table salesdata01 (Row_ID INT, Order_ID INT, Order_date String, Order_Priority STRING, Order_Quantity FLOAT, Sales FLOAT, Discount FLOAT, Shipping_Mode STRING, Profit FLOAT, Unit_Price FLOAT) row format delimited fields terminated by ‘,’;

OK

Time taken: 0.782 seconds

hive> select * from salesdata01;

OK

Time taken: 0.721 seconds

hive> load data local inpath ‘/home/saurkuma/SalesData01.csv’ overwrite into table salesdata01;

Loading data to table demo.salesdata01

Table demo.salesdata01 stats: [numFiles=1, totalSize=591414]

OK

Time taken: 1.921 seconds

hive> select * from salesdata01 limit 10;

OK

1 3 13-10-2010 Low 6.0 261.54 0.04 Regular Air -213.25 38.94

49 293 01-10-2012 High 49.0 10123.02 0.07 Delivery Truck 457.81 208.16

50 293 01-10-2012 High 27.0 244.57 0.01 Regular Air 46.71 8.69

80 483 10-07-2011 High 30.0 4965.7593 0.08 Regular Air 1198.97 195.99

85 515 28-08-2010 Not Specified 19.0 394.27 0.08 Regular Air 30.94 21.78

86 515 28-08-2010 Not Specified 21.0 146.69 0.05 Regular Air 4.43 6.64

97 613 17-06-2011 High 12.0 93.54 0.03 Regular Air -54.04 7.3

98 613 17-06-2011 High 22.0 905.08 0.09 Regular Air 127.7 42.76

103 643 24-03-2011 High 21.0 2781.82 0.07 Express Air -695.26 138.14

107 678 26-02-2010 Low 44.0 228.41 0.07 Regular Air -226.36 4.98

Time taken: 0.143 seconds, Fetched: 10 row(s)

hive> select * from salesdata01 where Order_date=’01-10-2012′ limit 10;

OK

49 293 01-10-2012 High 49.0 10123.02 0.07 Delivery Truck 457.81 208.16

50 293 01-10-2012 High 27.0 244.57 0.01 Regular Air 46.71 8.69

3204 22980 01-10-2012 Not Specified 17.0 224.09 0.0 Regular Air -27.92 12.44

3205 22980 01-10-2012 Not Specified 10.0 56.05 0.06 Regular Air -27.73 4.98

2857 20579 01-10-2012 Medium 16.0 1434.086 0.1 Regular Air -26.25 110.99

145 929 01-10-2012 High 21.0 227.66 0.04 Regular Air -100.16 10.97

146 929 01-10-2012 High 39.0 84.33 0.04 Regular Air -64.29 2.08

859 6150 01-10-2012 Critical 38.0 191.14 0.06 Regular Air 82.65 4.98

Time taken: 0.506 seconds, Fetched: 8 row(s)

hive> select Row_ID, cast(to_date(from_unixtime(unix_timestamp(Order_date, ‘dd-MM-yyyy’))) as date) from salesdata01 limit 10;

OK

1 2010-10-13

49 2012-10-01

50 2012-10-01

80 2011-07-10

85 2010-08-28

86 2010-08-28

97 2011-06-17

98 2011-06-17

103 2011-03-24

107 2010-02-26

hive> select Row_ID, from_unixtime(unix_timestamp(Order_date, ‘dd-MM-yyyy’),’yyyy-MM-dd’) from salesdata01 limit 10;

OK

1 2010-10-13

49 2012-10-01

50 2012-10-01

80 2011-07-10

85 2010-08-28

86 2010-08-28

97 2011-06-17

98 2011-06-17

103 2011-03-24

107 2010-02-26

Time taken: 0.157 seconds, Fetched: 10 row(s)

hive> select Row_ID, from_unixtime(unix_timestamp(Order_date, ‘dd-MM-yyyy’)) from salesdata01 limit 10;

OK

1 2010-10-13 00:00:00

49 2012-10-01 00:00:00

50 2012-10-01 00:00:00

80 2011-07-10 00:00:00

85 2010-08-28 00:00:00

86 2010-08-28 00:00:00

97 2011-06-17 00:00:00

98 2011-06-17 00:00:00

103 2011-03-24 00:00:00

107 2010-02-26 00:00:00

Time taken: 0.09 seconds, Fetched: 10 row(s)

hive> select Row_ID, from_unixtime(unix_timestamp(Order_date, ‘dd-MM-yyyy’),’dd-MM-yyyy’) from salesdata01 limit 10;

OK

1 13-10-2010

49 01-10-2012

50 01-10-2012

80 10-07-2011

85 28-08-2010

86 28-08-2010

97 17-06-2011

98 17-06-2011

103 24-03-2011

107 26-02-2010

Another example:

If you are trying to store the date and timestamp values in timestamp column using hive.The source file contain the values of date or sometimes timestamps.

Sample Data:

[saurkuma@m1 ~]$ cat sample.txt

1,2015-04-15 00:00:00

2,2015-04-16 00:00:00

3,2015-04-17

hive> create table table1 (id int,tsstr string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’LINES TERMINATED BY ‘\n’;

OK

Time taken: 0.241 seconds

hive> LOAD DATA LOCAL INPATH ‘/home/saurkuma/sample.txt’ INTO TABLE table1;

Loading data to table demo.table1

Table demo.table1 stats: [numFiles=1, totalSize=57]

OK

Time taken: 0.855 seconds

hive> select * from table1;

OK

1 2015-04-15 00:00:00

2 2015-04-16 00:00:00

3 2015-04-17

Time taken: 0.097 seconds, Fetched: 3 row(s)

hive> create table table2 (id int,mytimestamp timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’;

OK

Time taken: 0.24 seconds

hive> INSERT INTO TABLE table2 select id,if(length(tsstr) > 10, tsstr, concat(tsstr,’ 00:00:00′)) from table1;

Query ID = saurkuma_20170316032711_63d9129a-38c1-4ae8-89f4-e158218d2587

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_1489644687414_0001, Tracking URL = http://m2.hdp22:8088/proxy/application_1489644687414_0001/

Kill Command = /usr/hdp/2.3.4.0-3485/hadoop/bin/hadoop job  -kill job_1489644687414_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-03-16 03:27:36,290 Stage-1 map = 0%,  reduce = 0%

2017-03-16 03:27:55,806 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.89 sec

MapReduce Total cumulative CPU time: 1 seconds 890 msec

Ended Job = job_1489644687414_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://TESTHA/apps/hive/warehouse/demo.db/table2/.hive-staging_hive_2017-03-16_03-27-11_740_404528501642205352-1/-ext-10000

Loading data to table demo.table2

Table demo.table2 stats: [numFiles=1, numRows=3, totalSize=66, rawDataSize=63]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 1.89 sec   HDFS Read: 4318 HDFS Write: 133 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 890 msec

OK

Time taken: 47.687 seconds

hive> select * from table2;

OK

1 2015-04-15 00:00:00

2 2015-04-16 00:00:00

3 2015-04-17 00:00:00

Time taken: 0.119 seconds, Fetched: 3 row(s)

I hope this helped you to solve your problem and feel free to give your valuable feedback or suggestions.


  • 0

Hive2 action with Oozie in kerberos Env

One of my friend was trying to run some simple hive2 action in their Oozie workflow and was getting error. Then I decided to replicate it on my cluster and finally I did it after some retry.

If you have the same requirement where you have to run hive sql via oozie then this article will help you to do your job.

So there 3 requirements for Oozie Hive 2 Action on Kerberized HiveServer2:
1. Must have “oozie.credentials.credentialclasses” property defined in /etc/oozie/conf/oozie-site.xml. oozie.credentials.credentialclasses must include the value “hive2=org.apache.oozie.action.hadoop.Hive2Credentials”
2. workflow.xml must include a <credentials><credential>…</credential></credentials> section including the 2 properties “hive2.server.principal” and “hive2.jdbc.url”.
3. The Hive 2 action must reference the above defined credential name in the “cred=” field of the <action> definition.

 

Step 1: First create some dir inside hdfs(under your home dir) to have all script in same place and then run it from there:

[s0998dnz@m1 hive2_action_oozie]$ hadoop fs -mkdir -p /user/s0998dnz/hive2demo/app

Step 2: Now create your workflow.xml and job.properties:

[root@m1 hive_oozie_demo]# cat workflow.xml

<workflow-app name=”hive2demo” xmlns=”uri:oozie:workflow:0.4″>

  <global>

    <job-tracker>${jobTracker}</job-tracker>

    <name-node>${nameNode}</name-node>

  </global>

  <credentials>

    <credential name=”hs2-creds” type=”hive2″>

      <property>

        <name>hive2.server.principal</name>

          <value>${jdbcPrincipal}</value>

      </property>

      <property>

       <name>hive2.jdbc.url</name>

         <value>${jdbcURL}</value>

      </property>

    </credential>

  </credentials>

  <start to=”hive2″/>

    <action name=”hive2″ cred=”hs2-creds”>

      <hive2 xmlns=”uri:oozie:hive2-action:0.1″>

        <jdbc-url>${jdbcURL}</jdbc-url>

        <script>${hivescript}</script>

      </hive2>

      <ok to=”End”/>

      <error to=”Kill”/>

    </action>

    <kill name=”Kill”>

    <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>

  </kill>

  <end name=”End”/>

</workflow-app>

[s0998dnz@m1 hive2_action_oozie]$ cat job.properties

# Job.properties file

nameNode=hdfs://HDPINF

jobTracker=m2.hdp22:8050

exampleDir=${nameNode}/user/${user.name}/hive2demo

oozie.wf.application.path=${exampleDir}/app

oozie.use.system.libpath=true

# Hive2 action

hivescript=${oozie.wf.application.path}/hivequery.hql

outputHiveDatabase=default

jdbcURL=jdbc:hive2://m2.hdp22:10000/default

jdbcPrincipal=hive/_HOST@HADOOPADMIN.COM

Step 3: Now create your hive script :

[s0998dnz@m1 hive2_action_oozie]$ cat hivequery.hql

show databases;

Step 4: Now Upload  hivequery.hql and workflow.xml to HDFS:
For example:

[s0998dnz@m1 hive2_action_oozie]$ hadoop fs -put workflow.xml /user/s0998dnz/hive2demo/app/

[s0998dnz@m1 hive2_action_oozie]$ hadoop fs -put hivequery.hql /user/s0998dnz/hive2demo/app/

Step 5: Run the oozie job with the properites (please run kinit to acquire kerberos ticket first if required):

[s0998dnz@m1 hive2_action_oozie]$ oozie job -oozie http://m2.hdp22:11000/oozie -config job.properties -run

job: 0000008-170221004234250-oozie-oozi-W

I hope it will help you to run your hive2 action in oozie, please fell free to give your valuable feedback or suggestions.


  • 0

Some helpful Tips

1. How-to-run-a-hive-query-using-yesterdays-date

Use from_unixtime(unix_timestamp()-1*60*60*24, ‘yyyy-MM-dd’); in your hive query.

For example:

select * from sample where date1=from_unixtime(unix_timestamp()-1*60*60*24, ‘yyyy-MM-dd’);

2. How to diff file(s) in HDFS

How to diff a file in HDFS and a file in the local filesystem:
diff <(hadoop fs -cat /path/to/file) /path/to/localfile

How to diff two files in HDFS:
diff <(hadoop fs -cat /path/to/file1) <(hadoop fs -cat /path/to/file2)


  • 0

GC pool ‘PS MarkSweep’ had collection(s): count=6 time=26445ms

When you create table and it is enforcing authorization using Ranger then it fails to create the table and post that HiveServer2 process crashes.

0: jdbc:hive2://server1> CREATE EXTERNAL TABLE test (cust_id STRING, ACCOUNT_ID STRING,
 ROLE_ID STRING, ROLE_NAME STRING, START_DATE STRING, END_DATE STRING, PRIORITY STRING, 
ACTIVE_ACCOUNT_ROLE STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE LOCATION '/tmp/testTable' 
TBLPROPERTIES ('serialization.null.format'=''); 
Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)


When you check hiveserver2 logs then you will see permission denied error:
Caused by: org.apache.hadoop.hive.ql.security.authorization.plugin.HiveAccessControlException: 
Permission denied: user [saurkuma] does not have [READ] privilege on [hdfs://HDPHA/tmp/testTable]
at org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizer.checkPrivileges
(RangerHiveAuthorizer.java:253)

Along with the above errors, hiveserver2.log also shows repetitive GC pauses and subsequently
HiveServer2 service crashes:
2016-11-15 12:39:54,428 WARN [org.apache.hadoop.util.JvmPauseMonitor$Monitor@24197b13]:
util.JvmPauseMonitor (JvmPauseMonitor.java:run(192)) - Detected pause in JVM or host machine 
(eg GC): pause of approximately 24000ms GC pool 'PS MarkSweep' had collection(s): 
count=6 time=26445ms

Root Cause: It is because process goes to check for a permission (read or write) on a given path 
of query, Ranger checks for permissions on a given directory and all its children. However,
if the directory does not exist, it will try to check the parent directory, or its parent directory,
and so on. Eventually the table creation fails and at the same time as this operation uses too much 
memory and causes GC pauses.

In this case, Ranger checks for permission on /tmp/<databasename>, and since it does not exist it 
starts checking /tmp/ and its child directories, causing the GC Pauses and HiveServer2 service crash.

RESOLUTION:
No permamnetly solution for this issue as of now but we have following workaround. 

WORKAROUND:
Ensure that the Storage Location specified in the create table statement does exist in the system.

  • 18

Map side join in Hive

Category : Hive

Many time we face a situation that we have very small tables in hive but when we query these tables then it takes long time.

Here I am going to explain Map side join and its advantages over the normal join operation in Hive. But before knowing about this, we should first understand the concept of ‘Join’ and what happens internally when we perform the join in Hive.

Join is a clause that combines the records of two tables (or Data-Sets).
Assume that we have two tables A and B. When we perform join operation on them, it will return the records which are the combination of all columns o f A and B.

Mapjoin is a little-known feature of Hive. It allows a table to be loaded into memory so that a (very fast) join could be performed entirely within a mapper without having to use a Map/Reduce step. If your queries frequently rely on small table joins (e.g. cities or countries, etc.) you might see a very substantial speed-up from using mapjoins.

There are two ways to enable it. First is by using a hint, which looks like /*+ MAPJOIN(aliasname), MAPJOIN(anothertable) */. This C-style comment should be placed immediately following the SELECT. It directs Hive to load aliasname (which is a table or alias of the query) into memory.

SELECT /*+ MAPJOIN(c) */ * FROM orders o JOIN cities c ON (o.city_id = c.id);

Another (better, in my opinion) way to turn on mapjoins is to let Hive do it automatically. Simply set hive.auto.convert.join to true in your config, and Hive will automatically use mapjoins for any tables smaller than hive.mapjoin.smalltable.filesize (default is 25MB).

Assume that we have two tables of which one of them is a small table. When we submit a map reduce task, a Map Reduce local task will be created before the original join Map Reduce task which will read data of the small table from HDFS and store it into an in-memory hash table. After reading, it serializes the in-memory hash table into a hash table file.

In the next stage, when the original join Map Reduce task is running, it moves the data in the hash table file to the Hadoop distributed cache, which populates these files to each mapper’s local disk. So all the mappers can load this persistent hash table file back into the memory and do the join work as before. The execution flow of the optimized map join is shown in the figure below. After optimization, the small table needs to be read just once. Also if multiple mappers are running on the same machine, the distributed cache only needs to push one copy of the hash table file to this machine.

Create two sample table and and insert some data into those table and then perform map join operation. 

hive> create table emp(name string,address string, deptid bigint) row format delimited fields terminated by ‘,’;

OK

Time taken: 20.218 seconds

hive> load data local inpath ‘/root/emp.txt’ overwrite into table emp;

Loading data to table test.emp

Table test.emp stats: [numFiles=1, numRows=0, totalSize=56, rawDataSize=0]

OK

Time taken: 0.713 seconds

hive> select * from emp;

OK

Saurabh AA 1

Babu AA 2

Nach BB 2

Jeba CC 1

Abhijit DD 1

Time taken: 20.105 seconds, Fetched: 5 row(s)

hive> create table dept(deptname string, deptid bigint) row format delimited fields terminated by ‘,’;

OK

Time taken: 20.192 seconds

hive> load data local inpath ‘/root/dept.txt’ overwrite into table dept;

Loading data to table test.dept

Table test.dept stats: [numFiles=1, numRows=0, totalSize=13, rawDataSize=0]

OK

Time taken: 20.705 seconds

hive> select * from dept;

OK

IT 1

Infra 2

Time taken: 0.081 seconds, Fetched: 2 row(s)

 

Without Map join: 

hive> select emp.name,dept.deptname from emp join dept on emp.deptid=dept.deptid;

Query ID = root_20161018080320_198dcd9e-7e47-440f-871d-5da56522fced

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1476772397810_0003, Tracking URL = http://m1.hdp22:8088/proxy/application_1476772397810_0003/

Kill Command = /usr/hdp/2.3.0.0-2557/hadoop/bin/hadoop job  -kill job_1476772397810_0003

Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1

2016-10-18 08:04:53,713 Stage-1 map = 0%,  reduce = 0%

2016-10-18 08:05:52,017 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.08 sec

2016-10-18 08:06:39,906 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.37 sec

MapReduce Total cumulative CPU time: 4 seconds 370 msec

Ended Job = job_1476772397810_0003

MapReduce Jobs Launched:

Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 4.37 sec   HDFS Read: 13044 HDFS Write: 52 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 370 msec

OK

Abhijit IT

Jeba IT

Saurabh IT

Nach Infra

Babu Infra

Time taken: 201.293 seconds, Fetched: 5 row(s)

 

With Map Join :

hive> select /*+ MAPJOIN(dept) */ emp.name,dept.deptname from emp join dept on emp.deptid=dept.deptid;

Query ID = root_20161018075509_476aa0ce-704a-4e3b-91c5-c2a3444a9fd7

Total jobs = 1

WARNING: Use “yarn jar” to launch YARN applications.

Execution log at: /tmp/root/root_20161018075509_476aa0ce-704a-4e3b-91c5-c2a3444a9fd7.log

2016-10-18 07:55:37 Starting to launch local task to process map join; maximum memory = 1065484288

2016-10-18 07:55:59 Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/root/f9baf1b2-42f4-4f89-9ac8-a48f5e8b0170/hive_2016-10-18_07-55-09_410_1173820569043720345-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11–.hashtable

2016-10-18 07:55:59 Uploaded 1 File to: file:/tmp/root/f9baf1b2-42f4-4f89-9ac8-a48f5e8b0170/hive_2016-10-18_07-55-09_410_1173820569043720345-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11–.hashtable (307 bytes)

2016-10-18 07:55:59 End of local task; Time Taken: 22.621 sec.

Execution completed successfully

MapredLocal task succeeded

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_1476772397810_0002, Tracking URL = http://m1.hdp22:8088/proxy/application_1476772397810_0002/

Kill Command = /usr/hdp/2.3.0.0-2557/hadoop/bin/hadoop job  -kill job_1476772397810_0002

Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0

2016-10-18 07:57:09,279 Stage-3 map = 0%,  reduce = 0%

2016-10-18 07:57:57,324 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.6 sec

MapReduce Total cumulative CPU time: 1 seconds 600 msec

Ended Job = job_1476772397810_0002

MapReduce Jobs Launched:

Stage-Stage-3: Map: 1   Cumulative CPU: 1.6 sec   HDFS Read: 6415 HDFS Write: 52 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 600 msec

OK

Saurabh IT

Babu Infra

Nach Infra

Jeba IT

Abhijit IT

Time taken: 169.01 seconds, Fetched: 5 row(s)

I hope it will have help you to understand map join.


  • 1

sql workbench connection to hadoop

Category : Hive

Many time we do not want to run our hive query through beeline or hive cli due to so many reason. Here I am not going to talk about reasons as its big debatable point, so in this article I have explain the steps to connect SQL Workbench to out hadoop cluster. In this article I have used hortonworks hdp 2.3 stack.

Step 1: Download SQL Workbench from the following URL and run it. 

http://www.sql-workbench.net/downloads.html

Step 2 : Download following driver list from your hadoop client to your local machine.  

FOR HDP 2.2.x From /usr/hdp/current/hive-client/lib/ sftp or scp to your local desktop

  • hive-*-0.14*.jar
  • httpclient-4.2.5.jar
  • httpcore-4.2.5.jar
  • libthrift-0.9.0.jar
  • libfb303-0.9.0.jar
  • commons-logging-1.1.3.jar

FOR HDP 2.3.x

From /usr/hdp/current/hive-client/lib/ sftp or scp to your local desktop

  • hive-jdbc.jar (you may have symlink so need to download main jar)

FOR BOTH

From /usr/hdp/current/hadoop-client

  • hadoop-common.jar (you may have symlink so need to download main jar)
  • hadoop-auth.jar

From /usr/hdp/current/hadoop-client/lib

  • log4j-1.2.17.jar
  • slf4j-*.jar

Step 3: Goto file and open Manage Driver to configure your driver:

Screen Shot 2016-10-11 at 1.31.51 PM

 

Step 4: Create new Driver with any name(like Hadoop or HiveServer2 ) and upload all required .jar files then click ok. 

Screen Shot 2016-10-11 at 1.36.19 PM

Step 5: Now go to File menu and then click on “Connection Window ” ,select your created driver also pass required parameters(JDBC URL,Username, password  etc): 

Screen Shot 2016-10-11 at 1.40.19 PM

Step 6: Now click on Test button to test connection and then click on ok button to save your configuration. 

Screen Shot 2016-10-11 at 1.40.36 PM

Now you can run your command or query through sql-workbench, I hope you this will help you to make connection to hdp via sql workbench


  • 0

Hive Actions with Oozie

Category : Hive

One of my friend was trying to run some hive .hql in their Oozie workflow and was getting error. Then I decided to replicate it on my cluster and finally I did it after some retry.

If you have the same requirement where you have to run hive sql via oozie then this article will help you to do your job.

Step 1: First create some dir inside hdfs(under your home dir) to have all script in same place and then run it from there: 

[hdfs@m1 ~]$ hadoop fs -mkdir -p /user/ambari-qa/tutorial/hive-oozie

[root@m1 ]# hadoop fs -mkdir -p /user/ambari-qa/tutorial/hive-input

Step 2: Now create your workflow.xml and job.properties: 

[root@m1 hive_oozie_demo]# cat workflow.xml

<workflow-app xmlns=”uri:oozie:workflow:0.4″ name=”hive-wf”>

    <start to=”hive-node”/>

    <action name=”hive-node”>

        <hive xmlns=”uri:oozie:hive-action:0.2″>

        <job-tracker>${jobTracker}</job-tracker>

        <name-node>${nameNode}</name-node>

        <job-xml>hive-site.xml</job-xml>

            <configuration>

                <property>

                    <name>mapred.job.queue.name</name>

                    <value>${queueName}</value>

                </property>              

            </configuration>

            <script>script.hql</script>

            <param>INPUT_PATH=${inputPath}</param>

        </hive>

        <ok to=”end”/>

        <error to=”fail”/>

    </action>

    <kill name=”fail”>

        <message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>

    </kill>

    <end name=”end”/>

</workflow-app>

[root@m1 hive_oozie_demo]# cat job.properties

nameNode=hdfs://HDPINF

jobTracker=192.168.56.42:50300

queueName=default

exampleRoot=example

oozie.use.system.libpath=true

oozie.libpath=/user/oozie/share/lib

oozie.wf.application.path=${nameNode}/user/ambari-qa/${exampleRoot}/hive-oozie

inputPath=${nameNode}/user/ambari-qa/${exampleRoot}/hive-input/*

Step 3: Now create hive table in hive database :

hive> create table demo(id int, name string);

Step 4: Now create your hive script :

[root@m1 hive_oozie_demo]# cat script.hql

insert into test.demo select * from test.demo1;

Step 5: Now you need to setup your Oozie workflow app folder. You need one very important file to execute Hive action through Oozie which is hive-site.xml. When Oozie executes a Hive action, it needs Hive’s configuration file. You can provide multiple configurations file in a single action. You can find your Hive configuration file from “/etc/hive/conf.dist/hive-site.xml” (default location). Copy that file and put it inside your workflow application path in HDFS.

[root@m1 hive_oozie_demo]# hadoop fs -put /etc/hive/conf/hive-site.xml /user/ambari-qa/tutorial/hive-oozie/

[root@m1 hive_oozie_demo]# hadoop fs -put script.hql /user/ambari-qa/tutorial/hive-oozie/

[root@m1 hive_oozie_demo]# hadoop fs -put workflow.xml /user/ambari-qa/tutorial/hive-oozie/

[root@m1 hive_oozie_demo]# hadoop fs -lsr /user/ambari-qa/tutorial/hive-oozie

lsr: DEPRECATED: Please use ‘ls -R’ instead.

-rw-r–r–   3 root hdfs      19542 2016-10-08 04:36 /user/ambari-qa/tutorial/hive-oozie/hive-site.xml

-rw-r–r–   3 root hdfs         65 2016-10-08 04:36 /user/ambari-qa/tutorial/hive-oozie/script.hql

-rw-r–r–   3 root hdfs        878 2016-10-08 04:38 /user/ambari-qa/tutorial/hive-oozie/workflow.xml

Look at the <job-xml> tag, since I’m putting hive-site.xml in my application path, so I’m just passing the file name not the whole location. If you want to keep that file in some other location of your HDFS, then you can pass the whole HDFS path there too. In older version of Hive, user had to provide the hive-default.xml file by using property key oozie.hive.defaults while running Oozie Hive action, but from now on (Hive 0.8+) it’s not required anymore.

Step 6: Now you need to submit oozie job to run it :

[ambari-qa@m1 ~]$ oozie job -oozie http://m2.hdp22:11000/oozie -config job.properties -run

job: 0000004-161008041417432-oozie-oozi-W

Now you can check your oozie workflow status via oozie web UI or command prompt :

[ambari-qa@m1 ~]$ oozie job -oozie http://m2.hdp22:11000/oozie -info 0000004-161008041417432-oozie-oozi-W

Job ID : 0000004-161008041417432-oozie-oozi-W

————————————————————————————————————————————

Workflow Name : hive-wf

App Path      : hdfs://HDPINF/user/ambari-qa/tutorial/hive-oozie

Status        : SUCCEEDED

Run           : 0

User          : ambari-qa

Group         : –

Created       : 2016-10-08 11:02 GMT

Started       : 2016-10-08 11:02 GMT

Last Modified : 2016-10-08 11:02 GMT

Ended         : 2016-10-08 11:02 GMT

CoordAction ID: –

Actions

————————————————————————————————————————————

ID                                                                            Status    Ext ID                 Ext Status Err Code

————————————————————————————————————————————

0000004-161008041417432-oozie-oozi-W@:start:                                  OK                              OK                  

————————————————————————————————————————————

0000004-161008041417432-oozie-oozi-W@hive-node                                OK        job_1475917713796_0007 SUCCEEDED           

————————————————————————————————————————————

0000004-161008041417432-oozie-oozi-W@end                                      OK                              OK                  

————————————————————————————————————————————

If it is successful then you can check your table as it would have been uploaded with data.

hive> select * from demo;

OK

1 saurabh

Time taken: 0.328 seconds, Fetched: 1 row(s)

I hope this article will help you to run your hive sql in oozie workflow. Please feel free to reach out to me in case of any suggestion or doubt.

Common issue : 

Issue 1:You may see namenode issue, if you have hard coded namenode URI in your job.properties and your mentioned nn is in standby then you will see this issue. 

[ambari-qa@m1 ~]$ oozie job -oozie http://m2.hdp22:11000 -config job.properties -run

Error: IO_ERROR : java.io.IOException: Error while connecting Oozie server. No of retries = 1. Exception = Could not authenticate, Authentication failed, status: 404, message: Not Found

[ambari-qa@m1 ~]$ oozie job -oozie http://m2.hdp22:11000/oozie -config job.properties -run

Error: E0501 : E0501: Could not perform authorization operation, Operation category READ is not supported in state standby  at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)  at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1786)  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1305)  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3851)  at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1011)  at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:843)  at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)  at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2081)  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2077)  at java.security.AccessController.doPrivileged(Native Method)  at javax.security.auth.Subject.doAs(Subject.java:415)  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2075)

Resolution : To resolve this issue you need to use your HA service id instead of hardcoded NN URI in job.properties. 

[root@m1 hive_oozie_demo]# cat job.properties

nameNode=hdfs://HDPINF

Issue 2: If you see exit code [40000] error with following message then you need to look into your job.properties.
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if hdfs://HDPINF/apps/hive/warehouse/test.db/demo is encrypted: java.lang.IllegalArgumentException: Wrong FS: hdfs://HDPINF/apps/hive/warehouse/test.db/demo, expected: hdfs://HDPINF:8020
Intercepting System.exit(40000)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [40000]

Resolution : You need to change your namenode URI, i.e remove port from the end of URI.
Please keep nameNode=hdfs://HDPINF instead ofnameNode=hdfs://HDPINF:8020.

Issue 3: If you see following error then you may need to change your hive sql or approach because of this issue seems to be unresolved. 

FAILED: SemanticException [Error 10028]: Line 1:17 Path is not legal ''hdfs://HDPINF:8020/user/ambari-qa/tutorial/hive-input/*'': 
Move from: hdfs://HDPINF:8020/user/ambari-qa/tutorial/hive-input/* to: hdfs://HDPINF/apps/hive/warehouse/test.db/demo is not valid.
 Please check that values for params "default.fs.name" and "hive.metastore.warehouse.dir" do not conflict.
Intercepting System.exit(10028)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [10028]

Resolution : https://issues.apache.org/jira/browse/HIVE-8147