Run pig script though Oozie

  • 0

Run pig script though Oozie

Category : Bigdata

If you have a requirement where you have to read some file through pig and you want to schedule your pig script via Oozie then this article will help you to do your job.

Step 1: First create some dir inside hdfs(under your home dir) would be good.

$ hadoop fs -mkdir -p /user/<user_id>/oozie-scripts/PigTest

Step 2: Create your workflow.xml and job.properties:

$ vi job.properties

#*************************************************

#  job.properties

#*************************************************

nameNode=hdfs://HDPTSTHA

jobTracker=<RM_Server>:8050

queueName=default

oozie.libpath=${nameNode}/user/oozie/share/lib

oozie.use.system.libpath=true

oozie.wf.rerun.failnodes=true

examplesRoot=oozie-scripts

examplesRootDir=/user/${user.name}/${examplesRoot}

appPath=${examplesRootDir}/PigTest

oozie.wf.application.path=${appPath}

$ vi workflow.xml

<!–******************************************–>

<!–workflow.xml                              –>

<!–******************************************–>

<workflow-app name=”WorkFlowForPigAction” xmlns=”uri:oozie:workflow:0.1″>

<start to=”pigAction”/>

<action name=”pigAction”>

        <pig>

            <job-tracker>${jobTracker}</job-tracker>

            <name-node>${nameNode}</name-node>

            <prepare>

                <delete path=”${nameNode}/${examplesRootDir}/PigTest/temp”/>

            </prepare>

            <configuration>

                <property>

                    <name>mapred.job.queue.name</name>

                    <value>${queueName}</value>

                </property>

                <property>

                    <name>mapred.compress.map.output</name>

                    <value>true</value>

                </property>

            </configuration>

            <script>pig_script_file.pig</script>

          </pig>

        <ok to=”end”/>

        <error to=”end”/>

    </action>

<end name=”end”/>

</workflow-app>

Step 3: Create your pig script :

$ cat pig_script_file.pig

lines = LOAD ‘/user/demouser/file.txt’ AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

store wordcount into ‘/user/demouser/pigOut1’;

— DUMP wordcount;

Step 4: Now copy your workflow.xml and pig script to your hdfs location:

$ hadoop fs -put pig_script_file.pig /user/demouser/oozie-scripts/PigTest/

$ hadoop fs -put workflow.xml /user/demouser/oozie-scripts/PigTest/

Step 5: Now you can schedule or submit oozie job

$ oozie job -oozie http://<Oozie_server>:11000/oozie -config job.properties -run

Now you can see your output in hdfs :

[demouser@<nameNode_server> pig_oozie_demo]$ hadoop fs -ls /user/demouser/pigOut1

Found 2 items

-rw-r–r–   demouser hdfs          0 2016-05-03 05:36 /user/demouser/pigOut1/_SUCCESS

-rw-r–r–   demouser hdfs         30 2016-05-03 05:36 /user/demouser/pigOut1/part-r-00000

[demouser@<nameNode_server> pig_oozie_demo]$ hadoop fs -cat /user/demouser/pigOut1/part-r-00000

pig 4

test 1

oozie 1

sample 1

Conclusion : I hope this article will help you and if you feel to give any feedback or have any doubt please feel free to write to me in comment.


Leave a Reply