Category Archives: Bigdata

  • 0

Do you think file format does matter in big Data technology?

Category : Bigdata

Yes, Thats matter a lot because of following main reasons:

By using correct file format as per your use case you can achieve following.

1. Less storage:
if we select a proper file format with good compatibile compression technique then it’s required less storage.

2. Faster processing of data:
based on our use case if we select correct file format( like row or column based file format) we can achieve high performance while processing the data.

3. Reduce disk I/O cost:
if processing is efficient with best compression method then I/O cost also be optimized.

Also there is multiple factor which we can think of while selecting file format for our use case.
• file is splittable or not
• schema evaluation support
• Predicate Pushdown / Filter Pushdown
• compression technique
• row based or column based
• support for serialization/deserialization
• support for metadata
• whether file format is supported by source and target system
• support for column types
• Ingestion, latency


  • 0

script to kill yarn application if it is running more than x mins

Sometime we get a situation where we have to get lists of all long running and based on threshold we need to kill them.Also sometime we need to do it for a specific yarn queue.  In such situation following script will help you to do your job.

[root@m1.hdp22~]$ vi


if [ “$#” -lt 1 ]; then

  echo Usage: $0  <max_life_in_mins>

  exit 1


yarn application -list 2>/dev/null | grep <queue_name> | grep RUNNING | awk {print $1} > job_list.txt

for jobId in `cat job_list.txt`


finish_time=`yarn application -status $jobId 2>/dev/null | grep Finish-Time | awk {print $NF}`

if [ $finish_time -ne 0 ]; then

  echo App $jobId is not running

  exit 1


time_diff=`date +%s``yarn application -status $jobId 2>/dev/null | grep Start-Time | awk {print $NF} | sed s!$!/1000!`

time_diff_in_mins=`echo ($time_diff)/60 | bc`

echo App $jobId is running for $time_diff_in_mins min(s)

if [ $time_diff_in_mins -gt $1 ]; then

  echo Killing app $jobId

  yarn application -kill $jobId


  echo App $jobId should continue to run



[yarn@m1.hdp22 ~]$ ./ 30 (pass x tim in mins)

App application_1487677946023_5995 is running for 0 min(s)

App application_1487677946023_5995 should continue to run

I hope it would help you but please feel free to give your valuable feedback or suggestion.

  • 0

Top most Hadoop Interview question

1. What are the Side Data Distribution Techniques?

Side data refers to extra static small data required by map reduce to perform job. Main challenge is the availability of side data on the node where the map would be executed. Hadoop provides two side data distribution techniques.

Using Job Configuration

An arbitrary Key value pair can be set in job configuration.

2. What is shuffling in MapReduce?

Once map tasks started to complete, A communication from reducers is started. where map output sent to reducer, which is looking for the output data to process. at same time data nodes are still process multiple other tasks. The data transfer of mappers output to reducer known as shuffling.

3. What is partitioning?

Partitioning is a process to identify the reducer instance, which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identifies the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer.

4. What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars, which could be used by application to improve performance. Application provides details of file to jobconf object to cache. Mapreduce framework would copy the

5. What is a job tracker?

Job tracker is a background service executed on namenode for submitting and tracking a Job. Job in hadoop terminology refers to mapreduce jobs. It further break up the job into tasks. Which would be deployed every data node holding the required data. In a Hadoop cluster, Job tracker is master and task acts like child, acts, performs and revert the progress to job tracker through heartbeat.

6. How to set which framework would be used to run map reduce program? it can be

  1. Local
  2. Classic
  3. Yarn

7. What is replication factor for Job’s JAR?

These are one of the most critical resources used regularly by task completion. it’s replication factor is 10

8. mapred.job.tracker property is used for?

mapred.job.tracker property is used by runner to get the job tracker mode. if it set to local then runner would submit the job to local job tracker running of single JVM. else job would be sent to mentioned address in property.

9. Difference between Job.submit() or waitForCompletion() ?

Job Submit internally creates submitter instance and submit the job, while waitforcompletion poll’s progress at regular interval of one second. if job gets executed successfully, it displays successful message on console else display a relevant error message.


10. What are the types of tables in Hive?

There are two types of tables.

  1. Managed tables.
  2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

11. Does Hive support record level Insert, delete or update?

Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

12. What kind of datawarehouse application is suitable for Hive?

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where

1) Relatively static data is analyzed,

2) Fast response times are not required, and

3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

13. How can the columns of a table in hive be written to a file?

By using awk command in shell, the output from HiveQL (Describe) can be written to a file.

hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.

14. CONCAT function in Hive with Example?

CONCAT function will concat the input strings. You can specify any number of strings separated by comma.


CONCAT (‘Hive’,’-’,’performs’,’-’,’good’,’-’,’in’,’-’,’Hadoop’);



So, every time you delimit the strings by ‘-’. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.

CONCAT_WS (‘-’,’Hive’,’performs’,’good’,’in’,’Hadoop’);

Output: Hive-performs-good-in-Hadoop

15. REPEAT function in Hive with example?

REPEAT function will repeat the input string n times specified in the command.





Note: You can add a space with the input string also.

16. How Pig integrate with Mapreduce to process data?

Pig can easier to execute. When programmer wrote a script to analyze the data sets, Here Pig compiler will convert the programs into MapReduce understandable format. Pig engine execute the query on the MR Jobs. The MapReduce process the data and generate output report. Here MapReduce doesn’t return output to Pig, directly stored in the HDFS.

17. What is the difference between logical and physical plan?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

18. How many ways we can run Pig programs?

Pig programs or commands can be executed in three ways.

  • Script – Batch Method
  • Grunt Shell – Interactive Method
  • Embedded mode

All these ways can be applied to both Local and Mapreduce modes of execution.

19. What is Grunt in Pig?

Grunt is an Interactive Shell in Pig, and below are its major features:

  • Ctrl-E key combination will move the cursor to the end of the line.
  • Grunt remembers command history, and can recall lines in the history buffer using up or down cursor keys.
  • Grunt supports Auto completion mechanism, which will try to complete
  • Pig Latin keywords and functions when you press the Tab key.

20. What are the modes of Pig Execution?

Local Mode:

Local execution in a single JVM, all files are installed and run using local host and file system.

Mapreduce Mode:

Distributed execution on a Hadoop cluster, it is the default mode.

21. What are the main difference between local mode and MapReduce mode?

Local mode:

No need to start or install Hadoop. The pig scripts run in the local system. By default Pig store data in File system. 100% MapReduce and Local mode commands everything same, no need to change anything.

MapReduce Mode:

It’s mandatory to start Hadoop. Pig scripts run and stored in in HDFS. in Both modes, Java and Pig installation is mandatory.

22. Can we process vast amount of data in local mode? Why?

No, System has limited fixed amount of storage, where as Hadoop can handle vast amount of data. So, Pig -x Mapreduce mode is the best choice to process vast amount of data.

23. Does Pig support multi-line commands?


24. Hive doesn’t support multi-line commands, what about Pig?

Pig can support single and multiple line commands.

Single line comments:

Dump B; — It execute the data, but not store in the file system.

Multiple Line comments:

Store B into ‘/output’; /* it can store/persists the data in Hdfs or Local File System. In protection level most often used Store command */

25. Difference Between Pig and SQL ?

Pig is a Procedural SQL is Declarative Nested relational data model SQL flat relational Schema is optional SQL schema is required OLAP works SQL supports OLAP+OLTP works loads Limited Query  Optimization and Significent opportunity for query Optimization.


  • 0

Check high CPU Intensive process on your server

Tags :

Category : Bigdata

When you start utilizing your cluster heavily then you may encounter a 100% CPU utilize error on a specific server. But as you may have many jobs and process running on that server that time it would be very tough to identify a culprit process whcih is causing this issue. It is like finding a needle in haystack.

I have faced such scenario in my job so you should not worry as I have created following script which will help you to find culprit and then you can shoot them or can do anything with them whatever you want. Only thing you have to schedule this script in your cron and thats all.

[hdfs@m1.hdp22 ~]$ cat

dateTime=$(date +”%Y-%m-%d”)

for (( i=1; i <= 20; i++ ))

do ps -eo pcpu,pid,user,start,etime,args | sort -k 1 -r | head -5 >> /hdptmp/Metrics/CPU_Usage_$dateTime.log;

sleep 10;


Cron your job like below: 

[hdfs@m1.hdp22 ~]$ crontab -l

##CPU issue script

20 11 * * * /home/hdfs/ >>/hdptmp/error.log 2>&1

You will your output file like below: 

[hdfs@m1.hdp22 ~]$ cat /hdptmp/Metrics/CPU_Usage_2016-08-30.log


94.5 61100 hdpbatch 11:19:59       00:02 gzip -d 14-prod_2016-08-29.tsv.gz

78.5 60220 hdpbatch 11:19:52       00:09 bzip2 20-mowprod_2016-08-29.tsv

77.2 60221 hdpbatch 11:19:52       00:09 bzip2 21-mowprod_2016-08-29.tsv

77.0 60216 hdpbatch 11:19:52       00:09 bzip2 16-mowprod_2016-08-29.tsv


84.9 60220 hdpbatch 11:19:52       00:19 bzip2 20-mowprod_2016-08-29.tsv

84.9 60216 hdpbatch 11:19:52       00:19 bzip2 16-mowprod_2016-08-29.tsv

84.8 60218 hdpbatch 11:19:52       00:19 bzip2 18-mowprod_2016-08-29.tsv

84.3 60219 hdpbatch 11:19:52       00:19 bzip2 19-mowprod_2016-08-29.tsv


89.0 62082 root     11:20:17       00:05 xz -1 /var/spool/abrt/pyhook-2016-08-30-11:20:10-61697/sosreport-corpadmin-20160830112011.tar

81.7 60220 hdpbatch 11:19:52       00:30 bzip2 20-mowprod_2016-08-29.tsv

81.5 60218 hdpbatch 11:19:52       00:30 bzip2 18-mowprod_2016-08-29.tsv

81.3 60222 hdpbatch 11:19:52       00:30 bzip2 22-mowprod_2016-08-29.tsv


94.0 62886 root     11:20:30       00:02 xz -1 /var/spool/abrt/pyhook-2016-08-30-11:20:22-62093/sosreport-corpadmin-20160830112023.tar

85.1 60218 hdpbatch 11:19:52       00:40 bzip2 18-mowprod_2016-08-29.tsv

85.0 60220 hdpbatch 11:19:52       00:40 bzip2 20-mowprod_2016-08-29.tsv

84.9 60213 hdpbatch 11:19:52       00:40 bzip2 13-mowprod_2016-08-29.tsv


88.5 60220 hdpbatch 11:19:52       00:51 bzip2 20-mowprod_2016-08-29.tsv

88.3 60213 hdpbatch 11:19:52       00:51 bzip2 13-mowprod_2016-08-29.tsv

88.1 60218 hdpbatch 11:19:52       00:51 bzip2 18-mowprod_2016-08-29.tsv

88.0 60214 hdpbatch 11:19:52       00:51 bzip2 14-mowprod_2016-08-29.tsv

I hope it will help you to find culprit. Please fell free to give your feedback for any improvement.

  • 6

Hadoop Admin most lovable commands

If you are working on hadoop and you want to know about your cluster or you want to control your hadoop cluster then following commands should be handy to you. In this article i have tried to explain few commands which will help you a lot to do your day to day works.

  1. hdfs dfsadmin -report :  It will give you summarize view of your hadoop cluster like size,live nodes and their utilization.

[hdfs@m1]$ hdfs dfsadmin -report

Configured Capacity: 51886964736 (48.32 GB)

Present Capacity: 27887029262 (25.97 GB)

DFS Remaining: 24417319950 (22.74 GB)

DFS Used: 3469709312 (3.23 GB)

DFS Used%: 12.44%

Under replicated blocks: 2

Blocks with corrupt replicas: 0

Missing blocks: 0

Missing blocks (with replication factor 1): 2


Live datanodes (3):


2. hdfs dfsadmin -safemode get|enter| leave : It will tell you whether your NN is in safemode or not. if NN is in safemode then you case leave option with main command. 

[hdfs@m1]$ hdfs dfsadmin -safemode get

Safe mode is OFF in m1.hdp22/

Safe mode is OFF in m2.hdp22/

3. hadoop version : It will help you to get which hadoop version you are using:

[hdfs@m1]$ hadoop version


Subversion -r ef0582ca14b8177a3cbb6376807545272677d730

Compiled by jenkins on 2015-12-16T03:01Z

Compiled with protoc 2.5.0

From source with checksum cf48a4c63aaec76a714c1897e2ba8be6

This command was run using /usr/hdp/

4. classpath : This command will help you to know your hadoop class path, which will help you to get the Hadoop jar and the required libraries:

[hdfs@m1 ~]$ hadoop classpath


5. hadoop queue : This command will help you to get information about your yarn queue :

Usage: hadoop queue [-list] | [-info <job-queue-name> [-showJobs]] | [-showacls]

[hdfs@m1 ~]$ hadoop queue -list

DEPRECATED: Use of this script to execute mapred command is deprecated.

Instead use the mapred command for it.

16/08/09 05:44:35 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

16/08/09 05:44:36 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2


Queue Name : batch

Queue State : running

Scheduling Info : Capacity: 30.000002, MaximumCapacity: 60.000004, CurrentCapacity: 0.0


Queue Name : default

Queue State : running

Scheduling Info : Capacity: 30.000002, MaximumCapacity: 90.0, CurrentCapacity: 0.0


Queue Name : user

Queue State : running

Scheduling Info : Capacity: 40.0, MaximumCapacity: 40.0, CurrentCapacity: 0.0


    Queue Name : ado

    Queue State : running

    Scheduling Info : Capacity: 40.0, MaximumCapacity: 100.0, CurrentCapacity: 0.0


    Queue Name : aodp

    Queue State : running

    Scheduling Info : Capacity: 40.0, MaximumCapacity: 40.0, CurrentCapacity: 0.0


    Queue Name : di

    Queue State : running

    Scheduling Info : Capacity: 20.0, MaximumCapacity: 23.0, CurrentCapacity: 0.0

Or you can get information about a specific queue. 

[hdfs@m1 ~]$ hadoop queue -info ado

DEPRECATED: Use of this script to execute mapred command is deprecated.

Instead use the mapred command for it.

16/08/09 05:49:14 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

16/08/09 05:49:15 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2


Queue Name : ado

Queue State : running

Scheduling Info : Capacity: 40.0, MaximumCapacity: 100.0, CurrentCapacity: 0.0

6. yarn job -kill <job_id> : It will help you to kill your running mapred job: 

yarn job -kill job_1462173172032_31967 or you can kill your running application by following command.

yarn application -kill application_1462173172032_31967

7. hadoop distcp : It will help you to copy file or directories recursively within cluster or from one cluster to another cluster: 

[hdfs@m1 ~]$ hadoop distcp hdfs://HDPINFHA/user/s0998dnz/input.txt hdfs://HDPTSTHA/tmp/

Note: HDPINFHA and HDPTSTHA both are namenode high availability id 

8. hadoop archive -archiveName <your_archive_name>.har -p <path_to_be_archive> <dir_to_be_archive> <destination>: This will hep you to hadoop archive yoru hdfs files. 

[hdfs@m1 ~]$ hadoop archive -archiveName testing.har -p /user saurabh /test

It will run a mapred job and will archive your dir.

[hdfs@m1 ~]$ hadoop fs -ls /test/

Found 1 items

drwxr-xr-x   – hdfs hdfs          0 2016-08-09 06:09 /test/testing.har

If you want to list out inside archival file then you can not read by normal ls command. You have to use -lsr like below:

[hdfs@m1 ~]$ hadoop fs -lsr /test/testing.har

lsr: DEPRECATED: Please use ‘ls -R’ instead.

-rw-r–r–   3 hdfs hdfs          0 2016-08-09 06:09 /test/testing.har/_SUCCESS

-rw-r–r–   5 hdfs hdfs        565 2016-08-09 06:09 /test/testing.har/_index

-rw-r–r–   5 hdfs hdfs         23 2016-08-09 06:09 /test/testing.har/_masterindex

-rw-r–r–   3 hdfs hdfs   20710951 2016-08-09 06:09 /test/testing.har/part-0

9. hadoop fsck / : fsck command is used to check the HDFS file system. There are different arguments that can be passed with this command to emit different results.

[hdfs@m1 ~]$ hadoop fsck /

Connecting to namenode via http://m1.hdp22:50070/fsck?ugi=hdfs&path=%2F

FSCK started by hdfs (auth:SIMPLE) from / for path / at Tue Aug 09 06:23:02 EDT 2016


…………………………………………………………………………..Status: HEALTHY

Total size: 1161798713 B (Total open files size: 2242 B)

Total dirs: 11729

Total files: 1086

Total symlinks: 0 (Files currently being written: 4)

Total blocks (validated): 1056 (avg. block size 1100188 B) (Total open file blocks (not validated): 4)

Minimally replicated blocks: 1056 (100.0 %)

Over-replicated blocks: 0 (0.0 %)

Under-replicated blocks: 4 (0.37878788 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 3

Average block replication: 2.9734848

Corrupt blocks: 0

Missing replicas: 18 (0.569981 %)

Number of data-nodes: 3

Number of racks: 1

FSCK ended at Tue Aug 09 06:23:05 EDT 2016 in 2764 milliseconds

The filesystem under path ‘/’ is HEALTHY

10. hadoop fsck / -files : It displays all the files in HDFS while checking. 

11. hadoop fsck / -files -blocksIt displays all the blocks of the files while checking.

12. hadoop fsck / -files -blocks -locations : It displays all the files block locations while checking.

13. hadoop fsck / -files -blocks -locations -racksThis command is used to display the networking topology for data-node locations.

14. hadoop fsck -deleteThis command will delete the corrupted files in HDFS. 

15. hadoop fsck -move :This command is used to move the corrupted files to a particular directory, by default it will move to the /lost+found directory. 

16. hadoop dfsadmin -metasave file_name.txt :This command is used to save the meta data that is present in the namenode in a file in the HDFS. 

17. hadoop dfsadmin -refreshNodesThis command is used to refresh the data nodes that are allowed to connect to the name node. 

18. hadoop fs -count -q /mydirChecks for the quota space for the specified directory or a file.

19. hadoop dfsadmin -setSpaceQuota 10M /dir_name :  This command is used to set the space quota space for a particular directory. Now we will set the directory quota to 10MB and then we will check it using the command hadoop fs -count -q /mydir. 

20. hadoop dfsadmin -clrSpaceQuota /mydir : This command is used to clear the allocated quota to a particular directory in HDFS. Now we will clear the quota which we have previously created and check the quota again.


I hope all the above commands will help you to control your cluster. Please fell free to give your feedback.

  • 0

Rack Awareness on Hadoop

Category : Bigdata

If you have Hadoop clusters of more than 30-40 nodes then it is better you have configured it with rack awarenwss because communication between two data nodes on the same rack is efficient than the same between two nodes on different racks.

It also have us to improve network traffic while reading/writing HDFS files, NameNode chooses data nodes which are on the same rack or a near by rack to read/write request (client node).

NameNode achieves this rack information by maintaining  rack ids of each data node. This concept of choosing closer data nodes based on racks information is called Rack Awareness in Hadoop.

Note : A default Hadoop installation assumes all the nodes belong to the same rack.

So in this article I have explained how to make your cluster rack aware.

Step 1: Create a topology data file anywhere in Master node(i.e NN) and insert all datanodes ip address corresponding to rack. 

[root@m1 ~]# vi

[root@m1 ~]# cat 01 02 01 02 01 02

Step 2: Now create for above data files. 

root@m1 ~]# vi

[root@m1 ~]# cat


# Adjust/Add the property “”

# to core-site.xml with the “absolute” path the this

# file.  ENSURE the file is “executable”.

# Supply appropriate rack prefix


# To test, supply a hostname as script input:

if [ $# -gt 0 ]; then



if [ ! -f ${HADOOP_CONF}/${CTL_FILE} ]; then

  echo -n “/$RACK_PREFIX/rack “

  exit 0


while [ $# -gt 0 ] ; do


  exec< ${HADOOP_CONF}/${CTL_FILE}


  while read line ; do

    ar=( $line )

    if [ “${ar[0]}” = “$nodeArg” ] ; then





  if [ -z “$result” ] ; then

    echo -n “/$RACK_PREFIX/rack “


    echo -n “/$RACK_PREFIX/rack_$result “




  echo -n “/$RACK_PREFIX/rack “


Step 3: Add this property into core-site.xml or through ambari add following property. 

or to your ambari.
Step 4: Now you need to restart your hdfs service to get it reflect. 
I hope this article helped you to make your cluster rack awareness. Please fell free to give your feedback.

  • 0

Namenode installation issue

When you install hdp and during installation if something goes wrong with hdfs components(like namenode) then you may see following errors.

File “/usr/lib/python2.6/site-packages/resource_management/core/”, line 140, in _call_wrapper
result = _call(command, **kwargs_copy)
File “/usr/lib/python2.6/site-packages/resource_management/core/”, line 291, in _call
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of ‘yes Y | hdfs –config /usr/hdp/current/hadoop-client/conf namenode -format’ returned 127.
/usr/hdp/current/hadoop-client/bin/hdfs: line 18: /usr/hdp/ No such file or directory
yes: standard output: Broken pipe
yes: write error
stdout: /var/lib/ambari-agent/data/output-594.txt

The packages were not installed correctly during cluster installation. There were many files that were missing under /usr/hdp/<HDP_VERSION>/hadoop-hdfs/bin.


  • Run below command to check which package owns the missing file:
    • $ rpm -qf /usr/hdp/
  • Re-install the package by running below command.
    $ yum reinstall hadoop_2_3_2_0_2950-hdfs-

I hope it will help you to solve your namenode issue. Please feel free to give your feedback.


  • 0

How to debug distcp jobs

Tags :

Category : Bigdata

Some time when you run distcp jobs on cluster and you see some failure or performance then you want to debug it then you can go by using following command.

To turn on debug mode on the job level, issue the following command before executing the distcp job:

[root@m1.hdp22] export HADOOP_ROOT_LOGGER=hadoop.root.logger=Debug,console

To turn on debugmode on the mapper level, run distcp with mapper debug option as following:

[root@m1.hdp22] hadoop distcp"-Xmxyyyy -Dhadoop.root.logger=DEBUG,console"

  • 0

How to check contents of a JAR file

Tags :

Category : Bigdata

Many times we have to check what are the packages,classes included in one jar files, but due to black box(just a simple jar ) we face a trouble to check.

So with the help of following ways you can check it.

jar tf <PATH_TO_JAR

But if you are looking for a specific class or package then you can use following command.

jar tf <PATH_TO_JAR> | grep -i <PARTIAL_NAME_OF_CLASS>

  • 0

If you delete /hdp/apps/ dir from hdfs

There is situation when unfortunately and unknowingly you delete /hdp/apps/  with skipTrash then you will be in trouble and other services will be impacted. You will not be able to run hive,mapreduce or sqoop command, You will get following error.

[root@m1 ranger-hdfs-plugin]# hadoop fs -rmr -skipTrash /hdp/apps/

rmr: DEPRECATED: Please use ‘rm -r’ instead.

Deleted /hdp/apps/

So when I am trying to access to hive it is throwing below error.

[root@m1 admin]# hive

WARNING: Use “yarn jar” to launch YARN applications.

16/07/27 22:05:04 WARN conf.HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist

Logging initialized using configuration in file:/etc/hive/

Exception in thread “main” java.lang.RuntimeException: File does not exist: /hdp/apps/

at org.apache.hadoop.hive.ql.session.SessionState.start(


Resolution: Don’t worry friends you can resolve this issue by following give steps.

Note: You have to replace version of your hdp.

Step 1: First you will have to create following required dirs :

hdfs dfs -mkdir -p /hdp/apps/<$BUILD>/mapreduce

hdfs dfs -mkdir -p /hdp/apps/<$BUILD>/hive

hdfs dfs -mkdir -p /hdp/apps/<$BUILD>/tez

hdfs dfs -mkdir -p /hdp/apps/<$BUILD>/sqoop

hdfs dfs -mkdir -p /hdp/apps/<$BUILD>/pig

Step 2: Now you have to copy required jars in related dir. 

hdfs dfs -put /usr/hdp/$BUILD/hadoop/mapreduce.tar.gz /hdp/apps/$BUILD/mapreduce/

hdfs dfs -put /usr/hdp/<$version>/hive/hive.tar.gz /hdp/apps/<$version>/hive/
hdfs dfs -put /usr/hdp/<hdp_version>/tez/lib/tez.tar.gz /hdp/apps/<hdp_version>/tez/
hdfs dfs -put /usr/hdp/<hdp-version>/sqoop/sqoop.tar.gz /hdp/apps/<hdp-version>/sqoop/
hdfs dfs -put /usr/hdp/<hdp-version>/pig/pig.tar.gz /hdp/apps/<hdp-version>/pig/

Step 3: Now you need to change dir owner and then change permission:

hdfs dfs -chown -R hdfs:hadoop /hdp
hdfs dfs -chmod -R 555 /hdp/apps/$BUILD

Now you will be able to start your hive CLI or other jobs.

[root@m1 ~]# hive

WARNING: Use “yarn jar” to launch YARN applications.

16/07/27 23:33:42 WARN conf.HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist

Logging initialized using configuration in file:/etc/hive/


I hope it will help you to restore your cluster. Please feel free to give your suggestion.