Bigdata – BigData

July 19, 2021
0

Do you think file format does matter in big Data technology?

Category : Bigdata

Yes, Thats matter a lot because of following main reasons:

By using correct file format as per your use case you can achieve following.

1. Less storage:
if we select a proper file format with good compatibile compression technique then it’s required less storage.

2. Faster processing of data:
based on our use case if we select correct file format( like row or column based file format) we can achieve high performance while processing the data.

3. Reduce disk I/O cost:
if processing is efficient with best compression method then I/O cost also be optimized.

Also there is multiple factor which we can think of while selecting file format for our use case.
• file is splittable or not
• schema evaluation support
• Predicate Pushdown / Filter Pushdown
• compression technique
• row based or column based
• support for serialization/deserialization
• support for metadata
• whether file format is supported by source and target system
• support for column types
• Ingestion, latency

February 23, 2017
0

script to kill yarn application if it is running more than x mins

Tags : kill_application yarn application -list

Category : Bigdata , HDFS , YARN

Sometime we get a situation where we have to get lists of all long running and based on threshold we need to kill them.Also sometime we need to do it for a specific yarn queue. In such situation following script will help you to do your job.

[root@m1.hdp22~]$ vi kill_application_after_some_time.sh

#!/bin/bash

if [ “$#” -lt 1 ]; then

echo “Usage: $0 <max_life_in_mins>“

exit 1

yarn application -list 2>/dev/null | grep “<queue_name>“ | grep “RUNNING“ | awk ‘{print $1}‘ > job_list.txt

for jobId in `cat job_list.txt`

finish_time=`yarn application -status $jobId 2>/dev/null | grep “Finish-Time“ | awk ‘{print $NF}‘`

if [ $finish_time -ne 0 ]; then

echo “App $jobId is not running“

exit 1

time_diff=`date +%s`–`yarn application -status $jobId 2>/dev/null | grep “Start-Time“ | awk ‘{print $NF}‘ | sed ‘s!$!/1000!‘`

time_diff_in_mins=`echo “(“$time_diff“)/60“ | bc`

echo “App $jobId is running for $time_diff_in_mins min(s)“

if [ $time_diff_in_mins -gt $1 ]; then

echo “Killing app $jobId“

yarn application -kill $jobId

else

echo “App $jobId should continue to run“

done

[yarn@m1.hdp22 ~]$ ./kill_application_after_some_time.sh 30 (pass x tim in mins)

App application_1487677946023_5995 is running for 0 min(s)

App application_1487677946023_5995 should continue to run

I hope it would help you but please feel free to give your valuable feedback or suggestion.

November 29, 2016
0

Check high CPU Intensive process on your server

Tags : 100% CPU CPU

Category : Bigdata

When you start utilizing your cluster heavily then you may encounter a 100% CPU utilize error on a specific server. But as you may have many jobs and process running on that server that time it would be very tough to identify a culprit process whcih is causing this issue. It is like finding a needle in haystack.

I have faced such scenario in my job so you should not worry as I have created following script which will help you to find culprit and then you can shoot them or can do anything with them whatever you want. Only thing you have to schedule this script in your cron and thats all.

[hdfs@m1.hdp22 ~]$ cat cpu_Usage.sh

dateTime=$(date +”%Y-%m-%d”)

for (( i=1; i <= 20; i++ ))

do ps -eo pcpu,pid,user,start,etime,args | sort -k 1 -r | head -5 >> /hdptmp/Metrics/CPU_Usage_$dateTime.log;

sleep 10;

done

Cron your job like below:

[hdfs@m1.hdp22 ~]$ crontab -l

##CPU issue script

20 11 * * * /home/hdfs/cpu_Usage.sh >>/hdptmp/error.log 2>&1

You will your output file like below:

[hdfs@m1.hdp22 ~]$ cat /hdptmp/Metrics/CPU_Usage_2016-08-30.log