Pig script with HCatLoader on Hive ORC table
Category : Pig
Sometime we have to run some pig command on hive orc tables then this article will help you to do that.
Step 1: First create hive orc table:
hive> CREATE TABLE ORC_Table(COL1 BIGINT,COL2 STRING) CLUSTERED BY (COL1) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\T’ STORED AS ORC TBLPROPERTIES (‘TRANSACTIONAL’=’TRUE’) ;
Step 2: Now insert data to this table:
hive> insert into orc_table values(122342,’test’);
hive> insert into orc_table values(231232,’rest’);
hive> select * from orc_table;
OK
122342 test
231232 rest
Time taken: 1.663 seconds, Fetched: 2 row(s)
Step 3: Now create pig script :
[user1@server1 ~]$ cat myscript.pig
A = LOAD ‘test1.orc_table’ USING org.apache.hive.hcatalog.pig.HCatLoader();
Dump A;
Step 4: Now you have to run your pig script:
[user1@server1 ~]$ pig -useHCatalog -f myscript.pig
WARNING: Use “yarn jar” to launch YARN applications.
16/09/16 03:31:02 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
16/09/16 03:31:02 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
16/09/16 03:31:02 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2016-09-16 03:31:02,440 [main] INFO org.apache.pig.Main – Apache Pig version 0.15.0.2.3.4.0-3485 (rexported) compiled Dec 16 2015, 04:30:33
2016-09-16 03:31:02,440 [main] INFO org.apache.pig.Main – Logging error messages to: /home/user1/pig_1474011062438.log
2016-09-16 03:31:03,233 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/user1/.pigbootup not found
2016-09-16 03:31:03,386 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://HDPINFHA
2016-09-16 03:31:04,269 [main] INFO org.apache.pig.PigServer – Pig Script ID for the session: PIG-myscript.pig-eb253b46-2d2e-495c-9149-ef305ee4e408
2016-09-16 03:31:04,726 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/
2016-09-16 03:31:04,726 [main] INFO org.apache.pig.backend.hadoop.ATSService – Created ATS Hook
2016-09-16 03:31:05,618 [main] INFO hive.metastore – Trying to connect to metastore with URI thrift://server2:9083
2016-09-16 03:31:05,659 [main] INFO hive.metastore – Connected to metastore.
2016-09-16 03:31:06,209 [main] INFO org.apache.pig.tools.pigstats.ScriptState – Pig features used in the script: UNKNOWN
2016-09-16 03:31:06,247 [main] INFO org.apache.pig.data.SchemaTupleBackend – Key [pig.schematuple] was not set… will not generate code.
2016-09-16 03:31:06,284 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer – {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2016-09-16 03:31:06,384 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler – File concatenation threshold: 100 optimistic? false
2016-09-16 03:31:06,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer – MR plan size before optimization: 1
2016-09-16 03:31:06,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer – MR plan size after optimization: 1
2016-09-16 03:31:06,576 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/
2016-09-16 03:31:06,758 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState – Pig script settings are added to the job
2016-09-16 03:31:06,762 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016-09-16 03:31:06,999 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – This job cannot be converted run in-process
2016-09-16 03:31:07,292 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-metastore-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp428549735/hive-metastore-1.2.1.2.3.4.0-3485.jar
2016-09-16 03:31:07,329 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/libthrift-0.9.2.jar to DistributedCache through /tmp/temp-1473630461/tmp568922300/libthrift-0.9.2.jar
2016-09-16 03:31:07,542 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-exec-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-1007595209/hive-exec-1.2.1.2.3.4.0-3485.jar
2016-09-16 03:31:07,577 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/libfb303-0.9.2.jar to DistributedCache through /tmp/temp-1473630461/tmp-1039107423/libfb303-0.9.2.jar
2016-09-16 03:31:07,609 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp-1473630461/tmp-1375931436/jdo-api-3.0.1.jar
2016-09-16 03:31:07,642 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-893657730/hive-hbase-handler-1.2.1.2.3.4.0-3485.jar
2016-09-16 03:31:07,674 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive-hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-1850340790/hive-hcatalog-core-1.2.1.2.3.4.0-3485.jar
2016-09-16 03:31:07,705 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive-hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp58999520/hive-hcatalog-pig-adapter-1.2.1.2.3.4.0-3485.jar
2016-09-16 03:31:07,775 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/pig-0.15.0.2.3.4.0-3485-core-h2.jar to DistributedCache through /tmp/temp-1473630461/tmp-422634726/pig-0.15.0.2.3.4.0-3485-core-h2.jar
2016-09-16 03:31:07,808 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1473630461/tmp1167068812/automaton-1.11-8.jar
2016-09-16 03:31:07,840 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1473630461/tmp708151030/antlr-runtime-3.4.jar
2016-09-16 03:31:07,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Setting up single store job
2016-09-16 03:31:07,932 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 1 map-reduce job(s) waiting for submission.
2016-09-16 03:31:08,248 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader – No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2016-09-16 03:31:08,351 [JobControl] INFO org.apache.hadoop.hive.ql.log.PerfLogger – <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
2016-09-16 03:31:08,355 [JobControl] INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat – ORC pushdown predicate: null
2016-09-16 03:31:08,416 [JobControl] INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat – FooterCacheHitRatio: 0/0
2016-09-16 03:31:08,416 [JobControl] INFO org.apache.hadoop.hive.ql.log.PerfLogger – </PERFLOG method=OrcGetSplits start=1474011068351 end=1474011068416 duration=65 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
2016-09-16 03:31:08,421 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths (combined) to process : 1
2016-09-16 03:31:08,514 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter – number of splits:1
2016-09-16 03:31:08,612 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter – Submitting tokens for job: job_1472564332053_0029
2016-09-16 03:31:08,755 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner – Job jar is not present. Not adding any jar to the list of resources.
2016-09-16 03:31:08,947 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl – Submitted application application_1472564332053_0029
2016-09-16 03:31:08,989 [JobControl] INFO org.apache.hadoop.mapreduce.Job – The url to track the job: http://server2:8088/proxy/application_1472564332053_0029/
2016-09-16 03:31:08,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – HadoopJobId: job_1472564332053_0029
2016-09-16 03:31:08,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Processing aliases A
2016-09-16 03:31:08,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – detailed locations: M: A[1,4] C: R:
2016-09-16 03:31:09,007 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 0% complete
2016-09-16 03:31:09,007 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Running jobs are [job_1472564332053_0029]
2016-09-16 03:31:28,133 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 50% complete
2016-09-16 03:31:28,133 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Running jobs are [job_1472564332053_0029]
2016-09-16 03:31:29,251 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/
2016-09-16 03:31:29,258 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate – Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-09-16 03:31:30,186 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.1.2.3.4.0-3485 0.15.0.2.3.4.0-3485 s0998dnz 2016-09-16 03:31:06 2016-09-16 03:31:30 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1472564332053_0029 1 0 5 5 5 5 0 0 0 0 A MAP_ONLY hdfs://HDPINFHA/tmp/temp-1473630461/tmp1899757076,
Input(s):
Successfully read 2 records (28587 bytes) from: “test1.orc_table”
Output(s):
Successfully stored 2 records (32 bytes) in: “hdfs://HDPINFHA/tmp/temp-1473630461/tmp1899757076”
Counters:
Total records written : 2
Total bytes written : 32
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1472564332053_0029
2016-09-16 03:31:30,822 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Success!
2016-09-16 03:31:30,825 [main] WARN org.apache.pig.data.SchemaTupleBackend – SchemaTupleBackend has already been initialized
2016-09-16 03:31:30,834 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1
2016-09-16 03:31:30,834 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1
(122342,test)
(231232,rest)
2016-09-16 03:31:30,984 [main] INFO org.apache.pig.Main – Pig script completed in 28 seconds and 694 milliseconds (28694 ms)