Distcp between High Availability enabled cluster

  • 1

Distcp between High Availability enabled cluster

Category : Bigdata

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

This impacted the total availability of the HDFS cluster in two major ways:

  • In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
  • Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.

The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its MapReduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.

$ hadoop distcp hdfs://nn1:8020/user/test/sample.txt hdfs://nn2:8020/user/test/

The easiest way to use distcp between two HA clusters would be to identify the current active NameNode and run distcp like you would with two clusters without HA:

$ hadoop distcp hdfs://active:8020/user/test/sample.txt hdfs://active:8020/user/test/

But identifying an active node is not easy also not possible in automating a batch job. So other alternative is to configure the client with both service ids and make it aware of the way to identify the active NameNode of both

Note: You need to add both clusters following HA properties in hdfs-site.xml on your source cluster. So for example you want to copy data from cluste1 to cluste2 then you have to change hdfs-site.xml on cluster1 and need to add cluster2’s HA properties in that file.

  1. Both service id value should be add in following property
    <property>
    <name>dfs.nameservices</name>
    <value>serviceId1,serviceId2</value>
    </property>
  2. Add following properties from cluster2 to cluster1’s hdfs-site.xml.

<!– serviceId2 properties –>

<property>

    <name>dfs.client.failover.proxy.provider.serviceId2</name>

    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

  </property>

  <property>

    <name>dfs.ha.namenodes.serviceId2</name>

    <value>nn1,nn2</value>

  </property>

  <property>

    <name>dfs.namenode.rpc-address.serviceId2.nn1</name>

    <value>m1.hdp22:8020</value>

  </property>

  <property>

    <name>dfs.namenode.servicerpc-address.serviceId2.nn1</name>

    <value>m1.hdp22:54321</value>

  </property>

  <property>

    <name>dfs.namenode.http-address.serviceId2.nn1</name>

    <value>m1.hdp22:50070</value>

  </property>

  <property>

    <name>dfs.namenode.https-address.serviceId2.nn1</name>

    <value>m1.hdp22:50470</value>

  </property>

  <property>

    <name>dfs.namenode.rpc-address.HDPTSTHA.nn2</name>

    <value>m2.hdp22:8020</value>

  </property>

  <property>

    <name>dfs.namenode.servicerpc-address.HDPTSTHA.nn2</name>

    <value>m2.hdp22:54321</value>

  </property>

  <property>

    <name>dfs.namenode.http-address.HDPTSTHA.nn2</name>

    <value>m2.hdp22:50070</value>

  </property>

  <property>

    <name>dfs.namenode.https-address.HDPTSTHA.nn2</name>

    <value>m2.hdp22:50470</value>

</property>

[s0998dnz@m1.hdp22 ~]$ hadoop distcp hdfs://HDPINFHA/user/s0998dnz/sampleTest.txt hdfs://HDPTSTHA/user/s0998dnz/
17/04/11 05:04:37 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile=’null’, copyStrategy=’uniformsize’, preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[hdfs://HDPINFHA/user/s0998dnz/sampleTest.txt], targetPath=hdfs://HDPTSTHA/user/s0998dnz, targetPathExists=true, filtersFile=’null’}
17/04/11 05:04:38 INFO impl.TimelineClientImpl: Timeline service address: http://m1.hdp22:8188/ws/v1/timeline/
17/04/11 05:04:38 INFO client.AHSProxy: Connecting to Application History server at m1.hdp22/127.0.0.1:10200
17/04/11 05:04:38 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
17/04/11 05:04:38 INFO tools.SimpleCopyListing: Build file listing completed.
17/04/11 05:04:38 INFO tools.DistCp: Number of paths in the copy list: 1
17/04/11 05:04:38 INFO tools.DistCp: Number of paths in the copy list: 1
17/04/11 05:04:39 INFO impl.TimelineClientImpl: Timeline service address: http://m1.hdp22:8188/ws/v1/timeline/
17/04/11 05:04:39 INFO client.AHSProxy: Connecting to Application History server at m1.hdp22/127.0.0.1:10200
17/04/11 05:04:39 INFO mapreduce.JobSubmitter: number of splits:1
17/04/11 05:04:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1491850038017_0011
17/04/11 05:04:39 INFO impl.YarnClientImpl: Submitted application application_1491850038017_0011
17/04/11 05:04:39 INFO mapreduce.Job: The url to track the job: http://m1.hdp22:8088/proxy/application_1491850038017_0011/
17/04/11 05:04:39 INFO tools.DistCp: DistCp job-id: job_1491850038017_0011
17/04/11 05:04:39 INFO mapreduce.Job: Running job: job_1491850038017_0011
17/04/11 05:04:46 INFO mapreduce.Job: Job job_1491850038017_0011 running in uber mode : false
17/04/11 05:04:46 INFO mapreduce.Job:  map 0% reduce 0%
17/04/11 05:04:52 INFO mapreduce.Job:  map 100% reduce 0%
17/04/11 05:04:52 INFO mapreduce.Job: Job job_1491850038017_0011 completed successfully
17/04/11 05:04:53 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=153385
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=408
HDFS: Number of bytes written=46
HDFS: Number of read operations=17
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=3837
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3837
Total vcore-milliseconds taken by all map tasks=3837
Total megabyte-milliseconds taken by all map tasks=3929088
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=118
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=99
CPU time spent (ms)=1350
Physical memory (bytes) snapshot=283774976
Virtual memory (bytes) snapshot=3770925056
Total committed heap usage (bytes)=372244480
File Input Format Counters
Bytes Read=244
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=46
BYTESEXPECTED=46
COPY=1
[s0998dnz@m1.hdp22 ~]$

[s0998dnz@m1.hdp22 ~]$ hadoop fs -ls hdfs://HDPTSTHA/user/s0998dnz/
Found 19 items
drwxr-x—   – s0998dnz hdfs          0 2017-04-11 04:38 hdfs://HDPTSTHA/user/s0998dnz/.Trash
drwxr-x—   – s0998dnz hdfs          0 2016-01-21 00:59 hdfs://HDPTSTHA/user/s0998dnz/.hiveJars
drwxr-xr-x   – s0998dnz hdfs          0 2016-04-29 07:13 hdfs://HDPTSTHA/user/s0998dnz/.sparkStaging
drwx——   – s0998dnz hdfs          0 2016-11-21 02:42 hdfs://HDPTSTHA/user/s0998dnz/.staging
drwxr-x—   – s0998dnz hdfs          0 2015-12-28 12:06 hdfs://HDPTSTHA/user/s0998dnz/_sqoop
drwxr-x—   – s0998dnz hdfs          0 2016-12-01 02:42 hdfs://HDPTSTHA/user/s0998dnz/custTable_orc_sql
drwxr-x—   – s0998dnz hdfs          0 2016-11-29 05:27 hdfs://HDPTSTHA/user/s0998dnz/export_sql_new
drwxr-x—   – s0998dnz hdfs          0 2016-11-30 02:09 hdfs://HDPTSTHA/user/s0998dnz/export_table
drwxrwxrwx   – s0998dnz hdfs          0 2016-04-14 02:42 hdfs://HDPTSTHA/user/s0998dnz/falcon
-rw-r–r–   3 s0998dnz hdfs         34 2016-05-03 05:30 hdfs://HDPTSTHA/user/s0998dnz/file.txt
drwxr-x—   – s0998dnz hdfs          0 2016-11-17 06:00 hdfs://HDPTSTHA/user/s0998dnz/import_sql
-rw-r—–   3 s0998dnz hdfs        144 2016-10-24 06:18 hdfs://HDPTSTHA/user/s0998dnz/input.txt
drwxr-x—   – s0998dnz hdfs          0 2017-01-24 02:36 hdfs://HDPTSTHA/user/s0998dnz/oozie-oozi
drwxr-xr-x   – s0998dnz hdfs          0 2016-05-03 05:31 hdfs://HDPTSTHA/user/s0998dnz/oozie-scripts
drwxr-xr-x   – s0998dnz hdfs          0 2016-07-01 06:25 hdfs://HDPTSTHA/user/s0998dnz/pigOut1
-rw-r—–   3 s0998dnz hdfs         46 2017-04-11 05:04 hdfs://HDPTSTHA/user/s0998dnz/sampleTest.txt
drwxr-x—   – hdfs     hdfs          0 2015-12-28 07:41 hdfs://HDPTSTHA/user/s0998dnz/sanity_test
drwxr-x—   – s0998dnz hdfs          0 2016-04-14 06:55 hdfs://HDPTSTHA/user/s0998dnz/summary_table
-rw-r—–   3 s0998dnz hdfs        991 2016-11-16 08:53 hdfs://HDPTSTHA/user/s0998dnz/workflow.xml

Post your feedback/experience or questions if you have any ?


Leave a Reply