Run pig script though Oozie
Category : Bigdata
If you have a requirement where you have to read some file through pig and you want to schedule your pig script via Oozie then this article will help you to do your job.
Step 1: First create some dir inside hdfs(under your home dir) would be good.
$ hadoop fs -mkdir -p /user/<user_id>/oozie-scripts/PigTest
Step 2: Create your workflow.xml and job.properties:
$ vi job.properties
#*************************************************
# job.properties
#*************************************************
nameNode=hdfs://HDPTSTHA
jobTracker=<RM_Server>:8050
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
examplesRoot=oozie-scripts
examplesRootDir=/user/${user.name}/${examplesRoot}
appPath=${examplesRootDir}/PigTest
oozie.wf.application.path=${appPath}
$ vi workflow.xml
<!–******************************************–>
<!–workflow.xml –>
<!–******************************************–>
<workflow-app name=”WorkFlowForPigAction” xmlns=”uri:oozie:workflow:0.1″>
<start to=”pigAction”/>
<action name=”pigAction”>
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path=”${nameNode}/${examplesRootDir}/PigTest/temp”/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>pig_script_file.pig</script>
</pig>
<ok to=”end”/>
<error to=”end”/>
</action>
<end name=”end”/>
</workflow-app>
Step 3: Create your pig script :
$ cat pig_script_file.pig
lines = LOAD ‘/user/demouser/file.txt’ AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
store wordcount into ‘/user/demouser/pigOut1’;
— DUMP wordcount;
Step 4: Now copy your workflow.xml and pig script to your hdfs location:
$ hadoop fs -put pig_script_file.pig /user/demouser/oozie-scripts/PigTest/
$ hadoop fs -put workflow.xml /user/demouser/oozie-scripts/PigTest/
Step 5: Now you can schedule or submit oozie job
$ oozie job -oozie http://<Oozie_server>:11000/oozie -config job.properties -run
Now you can see your output in hdfs :
[demouser@<nameNode_server> pig_oozie_demo]$ hadoop fs -ls /user/demouser/pigOut1
Found 2 items
-rw-r–r– 3 demouser hdfs 0 2016-05-03 05:36 /user/demouser/pigOut1/_SUCCESS
-rw-r–r– 3 demouser hdfs 30 2016-05-03 05:36 /user/demouser/pigOut1/part-r-00000
[demouser@<nameNode_server> pig_oozie_demo]$ hadoop fs -cat /user/demouser/pigOut1/part-r-00000
pig 4
test 1
oozie 1
sample 1
Conclusion : I hope this article will help you and if you feel to give any feedback or have any doubt please feel free to write to me in comment.
1 Comment
Varun
July 5, 2018 at 10:28 amHi,
I am getting the following error in oozie when i run the pig Job:
JA017: Unknown hadoop job [job_local621991076_0006] associated with action [0000005-180704170132795-oozie-hdus-W@pig-node]. Failing this action!
This is my workflow:
${jobTracker}
${nameNode}
mapred.job.queue.name
${queueName}
mapred.compress.map.output
true
pig_script_file.pig
INPUT=/${nameNode}/user/hduser1/oozie-scripts/pigtest
OUTPUT=/${nameNode}/user/hduser1/${examplesRoot}/output-data/pig
Pig failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
This is my job properties:
nameNode=hdfs://localhost:9000
jobTracker=localhost:8032
queueName=default
examplesRoot=oozie-scripts
oozie.libpath=${nameNode}/user/hduser1/share/lib
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/hduser1/${examplesRoot}/pigtest
Can you please look into the same.