Hadoop Admin most lovable commands

  • 6

Hadoop Admin most lovable commands

If you are working on hadoop and you want to know about your cluster or you want to control your hadoop cluster then following commands should be handy to you. In this article i have tried to explain few commands which will help you a lot to do your day to day works.

  1. hdfs dfsadmin -report :  It will give you summarize view of your hadoop cluster like size,live nodes and their utilization.

[hdfs@m1]$ hdfs dfsadmin -report

Configured Capacity: 51886964736 (48.32 GB)

Present Capacity: 27887029262 (25.97 GB)

DFS Remaining: 24417319950 (22.74 GB)

DFS Used: 3469709312 (3.23 GB)

DFS Used%: 12.44%

Under replicated blocks: 2

Blocks with corrupt replicas: 0

Missing blocks: 0

Missing blocks (with replication factor 1): 2

————————————————-

Live datanodes (3):

————————————————-

2. hdfs dfsadmin -safemode get|enter| leave : It will tell you whether your NN is in safemode or not. if NN is in safemode then you case leave option with main command. 

[hdfs@m1]$ hdfs dfsadmin -safemode get

Safe mode is OFF in m1.hdp22/192.168.56.41:8020

Safe mode is OFF in m2.hdp22/192.168.56.42:8020

3. hadoop version : It will help you to get which hadoop version you are using:

[hdfs@m1]$ hadoop version

Hadoop 2.7.1.2.3.4.0-3485

Subversion git@github.com:hortonworks/hadoop.git -r ef0582ca14b8177a3cbb6376807545272677d730

Compiled by jenkins on 2015-12-16T03:01Z

Compiled with protoc 2.5.0

From source with checksum cf48a4c63aaec76a714c1897e2ba8be6

This command was run using /usr/hdp/2.3.4.0-3485/hadoop/hadoop-common-2.7.1.2.3.4.0-3485.jar

4. classpath : This command will help you to know your hadoop class path, which will help you to get the Hadoop jar and the required libraries:

[hdfs@m1 ~]$ hadoop classpath

/usr/hdp/2.3.4.0-3485/hadoop/conf:/usr/hdp/2.3.4.0-3485/hadoop/lib/*:/usr/hdp/2.3.4.0-3485/hadoop/.//*:/usr/hdp/2.3.4.0-3485/hadoop-hdfs/./:/usr/hdp/2.3.4.0-3485/hadoop-hdfs/lib/*:/usr/hdp/2.3.4.0-3485/hadoop-hdfs/.//*:/usr/hdp/2.3.4.0-3485/hadoop-yarn/lib/*:/usr/hdp/2.3.4.0-3485/hadoop-yarn/.//*:/usr/hdp/2.3.4.0-3485/hadoop-mapreduce/lib/*:/usr/hdp/2.3.4.0-3485/hadoop-mapreduce/.//*::/usr/share/java/mysql-connector-java-5.1.17.jar:/usr/share/java/mysql-connector-java.jar:/usr/hdp/2.3.4.0-3485/tez/*:/usr/hdp/2.3.4.0-3485/tez/lib/*:/usr/hdp/2.3.4.0-3485/tez/conf

5. hadoop queue : This command will help you to get information about your yarn queue :

Usage: hadoop queue [-list] | [-info <job-queue-name> [-showJobs]] | [-showacls]

[hdfs@m1 ~]$ hadoop queue -list

DEPRECATED: Use of this script to execute mapred command is deprecated.

Instead use the mapred command for it.

16/08/09 05:44:35 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

16/08/09 05:44:36 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

======================

Queue Name : batch

Queue State : running

Scheduling Info : Capacity: 30.000002, MaximumCapacity: 60.000004, CurrentCapacity: 0.0

======================

Queue Name : default

Queue State : running

Scheduling Info : Capacity: 30.000002, MaximumCapacity: 90.0, CurrentCapacity: 0.0

======================

Queue Name : user

Queue State : running

Scheduling Info : Capacity: 40.0, MaximumCapacity: 40.0, CurrentCapacity: 0.0

    ======================

    Queue Name : ado

    Queue State : running

    Scheduling Info : Capacity: 40.0, MaximumCapacity: 100.0, CurrentCapacity: 0.0

    ======================

    Queue Name : aodp

    Queue State : running

    Scheduling Info : Capacity: 40.0, MaximumCapacity: 40.0, CurrentCapacity: 0.0

    ======================

    Queue Name : di

    Queue State : running

    Scheduling Info : Capacity: 20.0, MaximumCapacity: 23.0, CurrentCapacity: 0.0

Or you can get information about a specific queue. 

[hdfs@m1 ~]$ hadoop queue -info ado

DEPRECATED: Use of this script to execute mapred command is deprecated.

Instead use the mapred command for it.

16/08/09 05:49:14 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/

16/08/09 05:49:15 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

======================

Queue Name : ado

Queue State : running

Scheduling Info : Capacity: 40.0, MaximumCapacity: 100.0, CurrentCapacity: 0.0

6. yarn job -kill <job_id> : It will help you to kill your running mapred job: 

yarn job -kill job_1462173172032_31967 or you can kill your running application by following command.

yarn application -kill application_1462173172032_31967

7. hadoop distcp : It will help you to copy file or directories recursively within cluster or from one cluster to another cluster: 

[hdfs@m1 ~]$ hadoop distcp hdfs://HDPINFHA/user/s0998dnz/input.txt hdfs://HDPTSTHA/tmp/

Note: HDPINFHA and HDPTSTHA both are namenode high availability id 

8. hadoop archive -archiveName <your_archive_name>.har -p <path_to_be_archive> <dir_to_be_archive> <destination>: This will hep you to hadoop archive yoru hdfs files. 

[hdfs@m1 ~]$ hadoop archive -archiveName testing.har -p /user saurabh /test

It will run a mapred job and will archive your dir.

[hdfs@m1 ~]$ hadoop fs -ls /test/

Found 1 items

drwxr-xr-x   – hdfs hdfs          0 2016-08-09 06:09 /test/testing.har

If you want to list out inside archival file then you can not read by normal ls command. You have to use -lsr like below:

[hdfs@m1 ~]$ hadoop fs -lsr /test/testing.har

lsr: DEPRECATED: Please use ‘ls -R’ instead.

-rw-r–r–   3 hdfs hdfs          0 2016-08-09 06:09 /test/testing.har/_SUCCESS

-rw-r–r–   5 hdfs hdfs        565 2016-08-09 06:09 /test/testing.har/_index

-rw-r–r–   5 hdfs hdfs         23 2016-08-09 06:09 /test/testing.har/_masterindex

-rw-r–r–   3 hdfs hdfs   20710951 2016-08-09 06:09 /test/testing.har/part-0

9. hadoop fsck / : fsck command is used to check the HDFS file system. There are different arguments that can be passed with this command to emit different results.

[hdfs@m1 ~]$ hadoop fsck /

Connecting to namenode via http://m1.hdp22:50070/fsck?ugi=hdfs&path=%2F

FSCK started by hdfs (auth:SIMPLE) from /192.168.56.41 for path / at Tue Aug 09 06:23:02 EDT 2016

……………………………………………………………………………………….

…………………………………………………………………………..Status: HEALTHY

Total size: 1161798713 B (Total open files size: 2242 B)

Total dirs: 11729

Total files: 1086

Total symlinks: 0 (Files currently being written: 4)

Total blocks (validated): 1056 (avg. block size 1100188 B) (Total open file blocks (not validated): 4)

Minimally replicated blocks: 1056 (100.0 %)

Over-replicated blocks: 0 (0.0 %)

Under-replicated blocks: 4 (0.37878788 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 3

Average block replication: 2.9734848

Corrupt blocks: 0

Missing replicas: 18 (0.569981 %)

Number of data-nodes: 3

Number of racks: 1

FSCK ended at Tue Aug 09 06:23:05 EDT 2016 in 2764 milliseconds

The filesystem under path ‘/’ is HEALTHY

10. hadoop fsck / -files : It displays all the files in HDFS while checking. 

11. hadoop fsck / -files -blocksIt displays all the blocks of the files while checking.

12. hadoop fsck / -files -blocks -locations : It displays all the files block locations while checking.

13. hadoop fsck / -files -blocks -locations -racksThis command is used to display the networking topology for data-node locations.

14. hadoop fsck -deleteThis command will delete the corrupted files in HDFS. 

15. hadoop fsck -move :This command is used to move the corrupted files to a particular directory, by default it will move to the /lost+found directory. 

16. hadoop dfsadmin -metasave file_name.txt :This command is used to save the meta data that is present in the namenode in a file in the HDFS. 

17. hadoop dfsadmin -refreshNodesThis command is used to refresh the data nodes that are allowed to connect to the name node. 

18. hadoop fs -count -q /mydirChecks for the quota space for the specified directory or a file.

19. hadoop dfsadmin -setSpaceQuota 10M /dir_name :  This command is used to set the space quota space for a particular directory. Now we will set the directory quota to 10MB and then we will check it using the command hadoop fs -count -q /mydir. 

20. hadoop dfsadmin -clrSpaceQuota /mydir : This command is used to clear the allocated quota to a particular directory in HDFS. Now we will clear the quota which we have previously created and check the quota again.

 

I hope all the above commands will help you to control your cluster. Please fell free to give your feedback.


6 Comments

Sankar

August 24, 2016 at 6:43 am

Great work sir

    admin

    August 28, 2016 at 11:43 am

    Thanks Sankar. Please feel free to reach out to me anytime for any doubts.

Raghu

September 6, 2016 at 7:17 pm

Great Blog for Hadoop admins who are beginners like me. Appreciate your sharing the knowledge and experience. Like to see more on Hadoop Monitoring tips/tricks.

    admin

    September 7, 2016 at 5:40 am

    Thanks Raghu.I will keep posting more articles or if you need any specific please feel free to post your requirement here.

bibhu

October 16, 2017 at 3:35 pm

sir i want to join hadoop admin course so what is prerequisites for hadoop admin tranning ,
linux admin, basic linux commands or any other programming language.
I know linux basic commands and sql also.

    admin

    October 17, 2017 at 4:16 pm

    Hello Bibhu,

    Welcome to hadoop world.
    To become a good hadoop admin, it would be good if you would have knowledge on following points :
    1. Good knowledge on unix
    2. Basic of sql
    3. Software architecture knowledge.

    For more details you can have a look on following page.
    http://www.hadoopadmin.co.in/faq/

Leave a Reply to Sankar Cancel reply