pig – BigData

January 7, 2017
0

Run Pig Script in Nifi

Category : Nifi , Pig

NiFi can interface directly with Hive, HDFS, HBase, Flume and Phoenix. And I can also trigger Spark and Flink through Kafka and Site-To-Site. Sometimes I need to run some Pig scripts. Apache Pig is very stable and has a lot of functions and tools that make for some smart processing. You can easily augment and add this piece to a larger pipeline or part of the process.

Pig Setup

I like to use Ambari to install the HDP 2.5 clients on my NiFi box to have access to all the tools I may need.

Then I can just do:

$ yum install pig

ExecuteProcess

We call a shell script that wraps the Pig script.

Output of script is stored to HDFS: hdfs dfs -ls /nifi-logs

Shell Script

$ export JAVA_HOME=/opt/jdk1.8.0_101/

$ pig –x local –l /tmp/pig.log –f /opt/demo/pigscripts/test.pig

You can run in different Pig modes like local, mapreduce and tez. You can also pass in parameters or the script.

Pig Script

messages = LOAD ‘/opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/nifi-app.log’;

warns = FILTER messages BY $0 MATCHES ‘.*WARN+.*’;

DUMP warns

store warns into ‘warns.out’

This is a basic example from the internet, with the NIFI 1.0 log used as the source.

As an aside, I run a daily script with the schedule 1 * * * * ? to clean up my logs.

Simply: /bin/rm -rf /opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/*2016*

PutHDFS

Hadoop Configuration: /etc/hadoop/conf/core-site.xml

Pick a directory and store away.

Result:

HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures

2.7.3.2.5.0.0–12450.16.0.2.5.0.0–1245root2016–11–03 19:53:572016–11–03 19:53:59FILTER

Success!

Job Stats (time in seconds):

JobIdMapsReducesMaxMapTimeMinMapTimeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs

job_local72884441_000110n/an/an/an/a0000messages,warnsMAP_ONLYfile:/tmp/temp1540654561/tmp–600070101,

Input(s):

Successfully read 30469 records from: “/opt/demo/HDF/centos7/tars/nifi/nifi-1.0.0.2.0.0.0-579/logs/nifi-app.log”

Output(s):

Successfully stored 1347 records in: “file:/tmp/temp1540654561/tmp-600070101”

Counters:

Total records written : 1347

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_local72884441_0001

ref : https://community.hortonworks.com/articles/64844/running-apache-pig-scripts-from-apache-nifi-and-st.html

November 29, 2016
0

Pig script with HCatLoader on Hive ORC table

Tags : HCatLoader hive action pig

Category : Pig

Sometime we have to run some pig command on hive orc tables then this article will help you to do that.

Step 1: First create hive orc table:

hive> CREATE TABLE ORC_Table(COL1 BIGINT,COL2 STRING) CLUSTERED BY (COL1) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\T’ STORED AS ORC TBLPROPERTIES (‘TRANSACTIONAL’=’TRUE’) ;

Step 2: Now insert data to this table:

hive> insert into orc_table values(122342,’test’);

hive> insert into orc_table values(231232,’rest’);

hive> select * from orc_table;

122342 test

231232 rest

Time taken: 1.663 seconds, Fetched: 2 row(s)

Step 3: Now create pig script :

[user1@server1 ~]$ cat myscript.pig

A = LOAD ‘test1.orc_table’ USING org.apache.hive.hcatalog.pig.HCatLoader();

Dump A;

Step 4: Now you have to run your pig script:

[user1@server1 ~]$ pig -useHCatalog -f myscript.pig

WARNING: Use “yarn jar” to launch YARN applications.

16/09/16 03:31:02 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

16/09/16 03:31:02 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

16/09/16 03:31:02 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2016-09-16 03:31:02,440 [main] INFO org.apache.pig.Main – Apache Pig version 0.15.0.2.3.4.0-3485 (rexported) compiled Dec 16 2015, 04:30:33

2016-09-16 03:31:02,440 [main] INFO org.apache.pig.Main – Logging error messages to: /home/user1/pig_1474011062438.log

2016-09-16 03:31:03,233 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/user1/.pigbootup not found

2016-09-16 03:31:03,386 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://HDPINFHA

2016-09-16 03:31:04,269 [main] INFO org.apache.pig.PigServer – Pig Script ID for the session: PIG-myscript.pig-eb253b46-2d2e-495c-9149-ef305ee4e408

2016-09-16 03:31:04,726 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/

2016-09-16 03:31:04,726 [main] INFO org.apache.pig.backend.hadoop.ATSService – Created ATS Hook

2016-09-16 03:31:05,618 [main] INFO hive.metastore – Trying to connect to metastore with URI thrift://server2:9083

2016-09-16 03:31:05,659 [main] INFO hive.metastore – Connected to metastore.

2016-09-16 03:31:06,209 [main] INFO org.apache.pig.tools.pigstats.ScriptState – Pig features used in the script: UNKNOWN

2016-09-16 03:31:06,247 [main] INFO org.apache.pig.data.SchemaTupleBackend – Key [pig.schematuple] was not set… will not generate code.

2016-09-16 03:31:06,284 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer – {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}

2016-09-16 03:31:06,384 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler – File concatenation threshold: 100 optimistic? false

2016-09-16 03:31:06,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer – MR plan size before optimization: 1

2016-09-16 03:31:06,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer – MR plan size after optimization: 1

2016-09-16 03:31:06,576 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/

2016-09-16 03:31:06,758 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState – Pig script settings are added to the job

2016-09-16 03:31:06,762 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2016-09-16 03:31:06,999 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – This job cannot be converted run in-process

2016-09-16 03:31:07,292 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-metastore-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp428549735/hive-metastore-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,329 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/libthrift-0.9.2.jar to DistributedCache through /tmp/temp-1473630461/tmp568922300/libthrift-0.9.2.jar

2016-09-16 03:31:07,542 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-exec-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-1007595209/hive-exec-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,577 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/libfb303-0.9.2.jar to DistributedCache through /tmp/temp-1473630461/tmp-1039107423/libfb303-0.9.2.jar

2016-09-16 03:31:07,609 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp-1473630461/tmp-1375931436/jdo-api-3.0.1.jar

2016-09-16 03:31:07,642 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-893657730/hive-hbase-handler-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,674 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive-hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp-1850340790/hive-hcatalog-core-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,705 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/hive-hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-1.2.1.2.3.4.0-3485.jar to DistributedCache through /tmp/temp-1473630461/tmp58999520/hive-hcatalog-pig-adapter-1.2.1.2.3.4.0-3485.jar

2016-09-16 03:31:07,775 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/pig-0.15.0.2.3.4.0-3485-core-h2.jar to DistributedCache through /tmp/temp-1473630461/tmp-422634726/pig-0.15.0.2.3.4.0-3485-core-h2.jar

2016-09-16 03:31:07,808 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-1473630461/tmp1167068812/automaton-1.11-8.jar

2016-09-16 03:31:07,840 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Added jar file:/usr/hdp/2.3.4.0-3485/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-1473630461/tmp708151030/antlr-runtime-3.4.jar

2016-09-16 03:31:07,882 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Setting up single store job

2016-09-16 03:31:07,932 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 1 map-reduce job(s) waiting for submission.

2016-09-16 03:31:08,248 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader – No job jar file set. User classes may not be found. See Job or Job#setJar(String).

2016-09-16 03:31:08,351 [JobControl] INFO org.apache.hadoop.hive.ql.log.PerfLogger – <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>

2016-09-16 03:31:08,355 [JobControl] INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat – ORC pushdown predicate: null

2016-09-16 03:31:08,416 [JobControl] INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat – FooterCacheHitRatio: 0/0

2016-09-16 03:31:08,416 [JobControl] INFO org.apache.hadoop.hive.ql.log.PerfLogger – </PERFLOG method=OrcGetSplits start=1474011068351 end=1474011068416 duration=65 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>

2016-09-16 03:31:08,421 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths (combined) to process : 1

2016-09-16 03:31:08,514 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter – number of splits:1

2016-09-16 03:31:08,612 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter – Submitting tokens for job: job_1472564332053_0029

2016-09-16 03:31:08,755 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner – Job jar is not present. Not adding any jar to the list of resources.

2016-09-16 03:31:08,947 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl – Submitted application application_1472564332053_0029

2016-09-16 03:31:08,989 [JobControl] INFO org.apache.hadoop.mapreduce.Job – The url to track the job: http://server2:8088/proxy/application_1472564332053_0029/

2016-09-16 03:31:08,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – HadoopJobId: job_1472564332053_0029

2016-09-16 03:31:08,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Processing aliases A

2016-09-16 03:31:08,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – detailed locations: M: A[1,4] C: R:

2016-09-16 03:31:09,007 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 0% complete

2016-09-16 03:31:09,007 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Running jobs are [job_1472564332053_0029]

2016-09-16 03:31:28,133 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 50% complete

2016-09-16 03:31:28,133 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Running jobs are [job_1472564332053_0029]

2016-09-16 03:31:29,251 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/

2016-09-16 03:31:29,258 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate – Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server

2016-09-16 03:31:30,186 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://server2:8188/ws/v1/timeline/

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.7.1.2.3.4.0-3485 0.15.0.2.3.4.0-3485 s0998dnz 2016-09-16 03:31:06 2016-09-16 03:31:30 UNKNOWN

Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_1472564332053_0029 1 0 5 5 5 5 0 0 0 0 A MAP_ONLY hdfs://HDPINFHA/tmp/temp-1473630461/tmp1899757076,

Input(s):

Successfully read 2 records (28587 bytes) from: “test1.orc_table”

Output(s):

Successfully stored 2 records (32 bytes) in: “hdfs://HDPINFHA/tmp/temp-1473630461/tmp1899757076”

Counters:

Total records written : 2

Total bytes written : 32

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_1472564332053_0029

2016-09-16 03:31:30,822 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Success!

2016-09-16 03:31:30,825 [main] WARN org.apache.pig.data.SchemaTupleBackend – SchemaTupleBackend has already been initialized

2016-09-16 03:31:30,834 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1

2016-09-16 03:31:30,834 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1

(122342,test)

(231232,rest)

2016-09-16 03:31:30,984 [main] INFO org.apache.pig.Main – Pig script completed in 28 seconds and 694 milliseconds (28694 ms)

May 3, 2016
1

Run pig script though Oozie

Tags : pig pig with oozie

Category : Bigdata

If you have a requirement where you have to read some file through pig and you want to schedule your pig script via Oozie then this article will help you to do your job.

Step 1: First create some dir inside hdfs(under your home dir) would be good.

$ hadoop fs -mkdir -p /user/<user_id>/oozie-scripts/PigTest

Step 2: Create your workflow.xml and job.properties:

$ vi job.properties

#*************************************************

# job.properties

#*************************************************

nameNode=hdfs://HDPTSTHA

jobTracker=<RM_Server>:8050

queueName=default

oozie.libpath=${nameNode}/user/oozie/share/lib

oozie.use.system.libpath=true

oozie.wf.rerun.failnodes=true

examplesRoot=oozie-scripts

examplesRootDir=/user/${user.name}/${examplesRoot}

appPath=${examplesRootDir}/PigTest

oozie.wf.application.path=${appPath}

$ vi workflow.xml

<!–******************************************–>

<!–workflow.xml –>

<!–******************************************–>

<workflow-app name=”WorkFlowForPigAction” xmlns=”uri:oozie:workflow:0.1″>

<pig>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

</prepare>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

<name>mapred.compress.map.output</name>

</property>

</configuration>

</pig>

</action>

</workflow-app>

Step 3: Create your pig script :

$ cat pig_script_file.pig

lines = LOAD ‘/user/demouser/file.txt’ AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

store wordcount into ‘/user/demouser/pigOut1’;

— DUMP wordcount;

Step 4: Now copy your workflow.xml and pig script to your hdfs location:

$ hadoop fs -put pig_script_file.pig /user/demouser/oozie-scripts/PigTest/

$ hadoop fs -put workflow.xml /user/demouser/oozie-scripts/PigTest/

Step 5: Now you can schedule or submit oozie job

$ oozie job -oozie http://<Oozie_server>:11000/oozie -config job.properties -run

Now you can see your output in hdfs :

[demouser@<nameNode_server> pig_oozie_demo]$ hadoop fs -ls /user/demouser/pigOut1

Found 2 items

-rw-r–r– 3 demouser hdfs 0 2016-05-03 05:36 /user/demouser/pigOut1/_SUCCESS

-rw-r–r– 3 demouser hdfs 30 2016-05-03 05:36 /user/demouser/pigOut1/part-r-00000

[demouser@<nameNode_server> pig_oozie_demo]$ hadoop fs -cat /user/demouser/pigOut1/part-r-00000

pig 4

test 1

oozie 1

sample 1

Conclusion : I hope this article will help you and if you feel to give any feedback or have any doubt please feel free to write to me in comment.

Run Pig Script in Nifi

0

Run Pig Script in Nifi

0

Top most Hadoop Interview question

0

Pig script with HCatLoader on Hive ORC table

1

Run pig script though Oozie

Recent Posts

Recent Comments

Archives