YARN – BigData

February 23, 2017
0

script to kill yarn application if it is running more than x mins

Tags : kill_application yarn application -list

Sometime we get a situation where we have to get lists of all long running and based on threshold we need to kill them.Also sometime we need to do it for a specific yarn queue. In such situation following script will help you to do your job.

[root@m1.hdp22~]$ vi kill_application_after_some_time.sh

#!/bin/bash

if [ “$#” -lt 1 ]; then

echo “Usage: $0 <max_life_in_mins>“

exit 1

yarn application -list 2>/dev/null | grep “<queue_name>“ | grep “RUNNING“ | awk ‘{print $1}‘ > job_list.txt

for jobId in `cat job_list.txt`

finish_time=`yarn application -status $jobId 2>/dev/null | grep “Finish-Time“ | awk ‘{print $NF}‘`

if [ $finish_time -ne 0 ]; then

echo “App $jobId is not running“

exit 1

time_diff=`date +%s`–`yarn application -status $jobId 2>/dev/null | grep “Start-Time“ | awk ‘{print $NF}‘ | sed ‘s!$!/1000!‘`

time_diff_in_mins=`echo “(“$time_diff“)/60“ | bc`

echo “App $jobId is running for $time_diff_in_mins min(s)“

if [ $time_diff_in_mins -gt $1 ]; then

echo “Killing app $jobId“

yarn application -kill $jobId

else

echo “App $jobId should continue to run“

done

[yarn@m1.hdp22 ~]$ ./kill_application_after_some_time.sh 30 (pass x tim in mins)

App application_1487677946023_5995 is running for 0 min(s)

App application_1487677946023_5995 should continue to run

I hope it would help you but please feel free to give your valuable feedback or suggestion.

October 8, 2016
3

Process xml file via mapreduce

Tags : mapreduce and xml xml parser

Category : YARN

When you have a requirement to process your data via hadoop which is not default input format then this article will help you. Hadoop provides default input formats like TextInputFormat, NLineInputFormat, KeyValueInputFormat etc., when you get a different types of files for processing you have to create your own custom input format for processing using MapReduce jobs Here I am going to show you how to processing XML files using MapReduce Job by creating custom XMLInputFormat (xmlinputformat hadoop)

So for example if you have following xml input file and you want to process it then you can do with the help of following steps.

<CATALOG>

<BOOK>

<TITLE>Hadoop Defnitive Guide</TITLE>

<AUTHOR>Tom White</AUTHOR>

<COUNTRY>US</COUNTRY>

<COMPANY>CLOUDERA</COMPANY>

<PRICE>24.90</PRICE>

<YEAR>2012</YEAR>

</BOOK>

<BOOK>

<TITLE>Programming Pig</TITLE>

<AUTHOR>Alan Gates</AUTHOR>

<COUNTRY>USA</COUNTRY>

<COMPANY>Horton Works</COMPANY>

<PRICE>30.90</PRICE>

<YEAR>2013</YEAR>

</BOOK>

</CATALOG>

Step 1: Create XMLInputFormat.java:

package xmlparsing.demo;

import java.io.IOException;
import java.util.List;
import org.apache.hadoop.fs.BlockLocation;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

public class XmlInputFormat extends TextInputFormat {
public static final String START_TAG_KEY = “<employee>”;
public static final String END_TAG_KEY = “</employee>”;

@Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) {
return new XmlRecordReader();
}

public static class XmlRecordReader extends
RecordReader<LongWritable, Text> {
private byte[] startTag;
private byte[] endTag;
private long start;
private long end;
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable key = new LongWritable();
private Text value = new Text();

@Override
public void initialize(InputSplit is, TaskAttemptContext tac)
throws IOException, InterruptedException {
FileSplit fileSplit = (FileSplit) is;
String START_TAG_KEY = “<employee>”;
String END_TAG_KEY = “</employee>”;
startTag = START_TAG_KEY.getBytes(“utf-8”);
endTag = END_TAG_KEY.getBytes(“utf-8”);

start = fileSplit.getStart();
end = start + fileSplit.getLength();
Path file = fileSplit.getPath();

FileSystem fs = file.getFileSystem(tac.getConfiguration());
fsin = fs.open(fileSplit.getPath());
fsin.seek(start);

}

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (fsin.getPos() < end) {
if (readUntilMatch(startTag, false)) {
try {
buffer.write(startTag);
if (readUntilMatch(endTag, true)) {

value.set(buffer.getData(), 0, buffer.getLength());
key.set(fsin.getPos());
return true;
}
} finally {
buffer.reset();
}
}
}
return false;
}

@Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return key;
}

@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;

}

@Override
public float getProgress() throws IOException, InterruptedException {
return (fsin.getPos() – start) / (float) (end – start);
}

@Override
public void close() throws IOException {
fsin.close();
}

private boolean readUntilMatch(byte[] match, boolean withinBlock)
throws IOException {
int i = 0;
while (true) {
int b = fsin.read();

if (b == -1)
return false;

if (withinBlock)
buffer.write(b);

if (b == match[i]) {
i++;
if (i >= match.length)
return true;
} else
i = 0;

if (!withinBlock && i == 0 && fsin.getPos() >= end)
return false;
}
}

}

Step 2: Create driver XMLDriver.java

package xmlparsing.demo;

importjavax.xml.stream.XMLInputFactory;

//import mrdp.logging.LogWriter;

importorg.apache.hadoop.fs.FileSystem;

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.conf.*;

importorg.apache.hadoop.io.*;

importorg.apache.hadoop.mapred.TextOutputFormat;

importorg.apache.hadoop.mapreduce.*;

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

importorg.apache.hadoop.util.GenericOptionsParser;

publicclassXMLDriver {

publicstaticvoidmain(String[] args) {

try{

Configuration conf = newConfiguration();

String[] arg = newGenericOptionsParser(conf, args).getRemainingArgs();

conf.set(“START_TAG_KEY”, “<employee>”);

conf.set(“END_TAG_KEY”, “</employee>”);

Job job = newJob(conf, “XML Processing Processing”);

job.setJarByClass(XMLDriver.class);

job.setMapperClass(MyMapper.class);

job.setNumReduceTasks(0);

job.setInputFormatClass(XmlInputFormat.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(LongWritable.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

FileInputFormat.addInputPath(job, newPath(args[0]));

FileOutputFormat.setOutputPath(job, newPath(args[1]));

job.waitForCompletion(true);

} catch(Exception e) {

LogWriter.getInstance().WriteLog(“Driver Error: “+ e.getMessage());

System.out.println(e.getMessage().toString());

}

}

}Step 3: Create MyMapper.javapackage xmlparsing.demo;

import java.io.ByteArrayInputStream;

import java.io.IOException;

import java.io.InputStream;

import javax.xml.parsers.DocumentBuilder;

import javax.xml.parsers.DocumentBuilderFactory;

//import mrdp.logging.LogWriter;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import org.w3c.dom.Document;

import org.w3c.dom.Element;

import org.w3c.dom.Node;

import org.w3c.dom.NodeList;
public class MyMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
private static final Log LOG = LogFactory.getLog(MyMapper.class);

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try {
InputStream is = new ByteArrayInputStream(value.toString().getBytes());

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();

DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();

Document doc = dBuilder.parse(is);
doc.getDocumentElement().normalize();
NodeList nList = doc.getElementsByTagName(“employee”);
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
String id = eElement.getElementsByTagName(“id”).item(0).getTextContent();

String name = eElement.getElementsByTagName(“name”).item(0).getTextContent();

String gender = eElement.getElementsByTagName(“gender”).item(0).getTextContent();
// System.out.println(id + “,” + name + “,” + gender);

context.write(new Text(id + “,” + name + “,” + gender), NullWritable.get());
}

}

} catch (Exception e) {

LogWriter.getInstance().WriteLog(e.getMessage());

}
}
}
ref: thinkbigdataanalytics.com

September 7, 2016
0

Update your Capacity Scheduler through REST API

Tags : CS SCHEDULER yarn capacity scheduler

Category : YARN

Sometime you want change your Capacity Scheduler through REST API or you have a requirement where you have to change your Capacity Scheduler configurations frequently via some script then this article will help you to do your work.

You can achieve it via following command.

[root@sandbox conf.server]# curl -v -u admin:admin -H “Content-Type: application/json” -H “X-Requested-By:ambari” -X PUT http://172.16.162.133:8080/api/v1/views/CAPACITY-SCHEDULER/versions/0.3.0/instanes/CS_1/resources/scheduler/configuration –data ‘{

“Clusters”: {

“desired_config”: [

{

“type”: “capacity-scheduler”,

“tag”: “version14534007568115”,

“service_config_version_note”: “To test”,

“properties”: {

“yarn.scheduler.capacity.maximum-am-resource-percent”: 0.2,

“yarn.scheduler.capacity.maximum-applications”: 10000,

“yarn.scheduler.capacity.node-locality-delay”: 40,

“yarn.scheduler.capacity.resource-calculator”: “org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator”,

“yarn.scheduler.capacity.queue-mappings-override.enable”: false,

“yarn.scheduler.capacity.root.acl_administer_queue”: “*”,

“yarn.scheduler.capacity.root.capacity”: 100,

“yarn.scheduler.capacity.root.queues”: “default”,

“yarn.scheduler.capacity.root.accessible-node-labels”: “*”,

“yarn.scheduler.capacity.root.default.acl_submit_applications”: “*”,

“yarn.scheduler.capacity.root.default.maximum-capacity”: 100,

“yarn.scheduler.capacity.root.default.user-limit-factor”: 0.5,

“yarn.scheduler.capacity.root.default.state”: “RUNNING”,

“yarn.scheduler.capacity.root.default.capacity”: 100

}

]

}

}’

Note : In the above command you have to change following parameters.

<ambari-server hostname or ip address >
version of your CS view(you can get it in web url by visting your view )
name of your view instance (you can get it in web url by visting your view )

After that you can refresh your queue by following command.

yarn rmadmin -refreshQueues

I hope it will help you to control your CS very easily.

Category Archives: YARN

0

script to kill yarn application if it is running more than x mins

3

Process xml file via mapreduce

0

Update your Capacity Scheduler through REST API

Recent Posts

Recent Comments

Archives