March 2016 – BigData

March 7, 2016
15

Encrypt password used by Sqoop to import or export data from database.

Tags : Encrypt password used by Sqoop to import or export hadoop.security.credential.provider.path Java KeyStore mysql.password.jceks org.apache.hadoop.security.alias.JavaKeyStoreProvider

Category : Bigdata , Sqoop

Sqoop became very popular and the darling tool for the industries. Sqoop has developed a lot and become very popular amongst Hadoop ecosystem. When we import or export data from database through Sqoop then we have to give password in command or in file only. I feel this is not a fully secure way to keep password in a file or pass password through command line. In this post, I will try cover the ways to specify database passwords to Sqoop in simple and secure way.

The following ways are common to pass database passwords to Sqoop:

Option 1: We can give password in command itself.

sqoop import --connect jdbc:mysql://servera.com:3306/bigdb --username dbuser --password <passwoord> --table sqoop_import -m 1 --target-dir /user/root/sanity_test/ --append

Option 2: We can keep password in a file and can pass this file in command.

sqoop import --connect jdbc:mysql://servera.com:3306/bigdb --username dbuser -password-file /home/root/.mysql.password --table sqoop_import -m 1 --target-dir /user/root/sanity_test/ --append

However, storing password in a text file is still considered not secure even though we have set the permissions.

As of Sqoop version 1.4.5, Sqoop supports the use of JKS to store passwords which would be encrypted,so that you do not need to store passwords in clear text in a file.This can be achieved using Java KeyStore.

Option 3: To generate the key:

Note: On prompt, enter the password that will be used to access the database.

[root@m1.hdp22 ~]$ hadoop credential create mydb.password.alias -provider jceks://hdfs/user/root/mysql.password.jceks
Enter password:
Enter password again:
mydb.password.alias has been successfully created.
org.apache.hadoop.security.alias.JavaKeyStoreProvider has been updated.

## Now you can use mydb.password.alias file in your sqoop command like below:
## Please add --Dhadoop.security.credential.provider.path=jceks://hdfs/user/root/mysql.password.jceks to the command and change the password like follows: 

sqoop import --Dhadoop.security.credential.provider.path=jceks://hdfs/user/root/mysql.password.jceks --connect jdbc:mysql://server:3306/sqoop --username sqoop --password-alias mydb.password.alias --table Test --delete-target-dir --target-dir /apps/hive/warehouse/test --split-by name

##or
sqoop import -Dhadoop.security.credential.provider.path=jceks://hdfs/user/root/mysql.password.jceks --connect ‘jdbc:mysql://servera.com:3306/bigdb’ --tablesqoop_import --username dbuser --password-alias mydb.password.alias -m 1 --target-dir /user/root/sanity_test/ --append

This way password is hidden inside jceks://hdfs/user/root/mysql.password.jceks and no one is able to see it.

Note: the “mydb.password.alias” is the alias that we can use to pass to Sqoop when running the command, so that no password is needed.

I hope you enjoy reading this article and will help you to protect your password. Please give your valuable feedback.

March 2, 2016
1

Distcp between High Availability enabled cluster

Category : Bigdata

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

This impacted the total availability of the HDFS cluster in two major ways:

In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.

The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its MapReduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.

$ hadoop distcp hdfs://nn1:8020/user/test/sample.txt hdfs://nn2:8020/user/test/

The easiest way to use distcp between two HA clusters would be to identify the current active NameNode and run distcp like you would with two clusters without HA:

$ hadoop distcp hdfs://active:8020/user/test/sample.txt hdfs://active:8020/user/test/

But identifying an active node is not easy also not possible in automating a batch job. So other alternative is to configure the client with both service ids and make it aware of the way to identify the active NameNode of both

Note: You need to add both clusters following HA properties in hdfs-site.xml on your source cluster. So for example you want to copy data from cluste1 to cluste2 then you have to change hdfs-site.xml on cluster1 and need to add cluster2’s HA properties in that file.

Both service id value should be add in following property
<property>
<name>dfs.nameservices</name>
<value>serviceId1,serviceId2</value>
</property>
Add following properties from cluster2 to cluster1’s hdfs-site.xml.

<!– serviceId2 properties –>

<name>dfs.client.failover.proxy.provider.serviceId2</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

</property>

<name>dfs.ha.namenodes.serviceId2</name>

</property>

<name>dfs.namenode.rpc-address.serviceId2.nn1</name>

</property>

<name>dfs.namenode.servicerpc-address.serviceId2.nn1</name>

</property>

<name>dfs.namenode.http-address.serviceId2.nn1</name>

</property>

<name>dfs.namenode.https-address.serviceId2.nn1</name>

</property>

<name>dfs.namenode.rpc-address.HDPTSTHA.nn2</name>

</property>

<name>dfs.namenode.servicerpc-address.HDPTSTHA.nn2</name>

</property>

<name>dfs.namenode.http-address.HDPTSTHA.nn2</name>

</property>

<name>dfs.namenode.https-address.HDPTSTHA.nn2</name>

</property>

[s0998dnz@m1.hdp22 ~]$ hadoop distcp hdfs://HDPINFHA/user/s0998dnz/sampleTest.txt hdfs://HDPTSTHA/user/s0998dnz/
17/04/11 05:04:37 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile=’null’, copyStrategy=’uniformsize’, preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[hdfs://HDPINFHA/user/s0998dnz/sampleTest.txt], targetPath=hdfs://HDPTSTHA/user/s0998dnz, targetPathExists=true, filtersFile=’null’}
17/04/11 05:04:38 INFO impl.TimelineClientImpl: Timeline service address: http://m1.hdp22:8188/ws/v1/timeline/
17/04/11 05:04:38 INFO client.AHSProxy: Connecting to Application History server at m1.hdp22/127.0.0.1:10200
17/04/11 05:04:38 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
17/04/11 05:04:38 INFO tools.SimpleCopyListing: Build file listing completed.
17/04/11 05:04:38 INFO tools.DistCp: Number of paths in the copy list: 1
17/04/11 05:04:38 INFO tools.DistCp: Number of paths in the copy list: 1
17/04/11 05:04:39 INFO impl.TimelineClientImpl: Timeline service address: http://m1.hdp22:8188/ws/v1/timeline/
17/04/11 05:04:39 INFO client.AHSProxy: Connecting to Application History server at m1.hdp22/127.0.0.1:10200
17/04/11 05:04:39 INFO mapreduce.JobSubmitter: number of splits:1
17/04/11 05:04:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1491850038017_0011
17/04/11 05:04:39 INFO impl.YarnClientImpl: Submitted application application_1491850038017_0011
17/04/11 05:04:39 INFO mapreduce.Job: The url to track the job: http://m1.hdp22:8088/proxy/application_1491850038017_0011/
17/04/11 05:04:39 INFO tools.DistCp: DistCp job-id: job_1491850038017_0011
17/04/11 05:04:39 INFO mapreduce.Job: Running job: job_1491850038017_0011
17/04/11 05:04:46 INFO mapreduce.Job: Job job_1491850038017_0011 running in uber mode : false
17/04/11 05:04:46 INFO mapreduce.Job: map 0% reduce 0%
17/04/11 05:04:52 INFO mapreduce.Job: map 100% reduce 0%
17/04/11 05:04:52 INFO mapreduce.Job: Job job_1491850038017_0011 completed successfully
17/04/11 05:04:53 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=153385
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=408
HDFS: Number of bytes written=46
HDFS: Number of read operations=17
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=3837
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3837
Total vcore-milliseconds taken by all map tasks=3837
Total megabyte-milliseconds taken by all map tasks=3929088
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=118
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=99
CPU time spent (ms)=1350
Physical memory (bytes) snapshot=283774976
Virtual memory (bytes) snapshot=3770925056
Total committed heap usage (bytes)=372244480
File Input Format Counters
Bytes Read=244
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=46
BYTESEXPECTED=46
COPY=1
[s0998dnz@m1.hdp22 ~]$

[s0998dnz@m1.hdp22 ~]$ hadoop fs -ls hdfs://HDPTSTHA/user/s0998dnz/
Found 19 items
drwxr-x— – s0998dnz hdfs 0 2017-04-11 04:38 hdfs://HDPTSTHA/user/s0998dnz/.Trash
drwxr-x— – s0998dnz hdfs 0 2016-01-21 00:59 hdfs://HDPTSTHA/user/s0998dnz/.hiveJars
drwxr-xr-x – s0998dnz hdfs 0 2016-04-29 07:13 hdfs://HDPTSTHA/user/s0998dnz/.sparkStaging
drwx—— – s0998dnz hdfs 0 2016-11-21 02:42 hdfs://HDPTSTHA/user/s0998dnz/.staging
drwxr-x— – s0998dnz hdfs 0 2015-12-28 12:06 hdfs://HDPTSTHA/user/s0998dnz/_sqoop
drwxr-x— – s0998dnz hdfs 0 2016-12-01 02:42 hdfs://HDPTSTHA/user/s0998dnz/custTable_orc_sql
drwxr-x— – s0998dnz hdfs 0 2016-11-29 05:27 hdfs://HDPTSTHA/user/s0998dnz/export_sql_new
drwxr-x— – s0998dnz hdfs 0 2016-11-30 02:09 hdfs://HDPTSTHA/user/s0998dnz/export_table
drwxrwxrwx – s0998dnz hdfs 0 2016-04-14 02:42 hdfs://HDPTSTHA/user/s0998dnz/falcon
-rw-r–r– 3 s0998dnz hdfs 34 2016-05-03 05:30 hdfs://HDPTSTHA/user/s0998dnz/file.txt
drwxr-x— – s0998dnz hdfs 0 2016-11-17 06:00 hdfs://HDPTSTHA/user/s0998dnz/import_sql
-rw-r—– 3 s0998dnz hdfs 144 2016-10-24 06:18 hdfs://HDPTSTHA/user/s0998dnz/input.txt
drwxr-x— – s0998dnz hdfs 0 2017-01-24 02:36 hdfs://HDPTSTHA/user/s0998dnz/oozie-oozi
drwxr-xr-x – s0998dnz hdfs 0 2016-05-03 05:31 hdfs://HDPTSTHA/user/s0998dnz/oozie-scripts
drwxr-xr-x – s0998dnz hdfs 0 2016-07-01 06:25 hdfs://HDPTSTHA/user/s0998dnz/pigOut1
-rw-r—– 3 s0998dnz hdfs 46 2017-04-11 05:04 hdfs://HDPTSTHA/user/s0998dnz/sampleTest.txt
drwxr-x— – hdfs hdfs 0 2015-12-28 07:41 hdfs://HDPTSTHA/user/s0998dnz/sanity_test
drwxr-x— – s0998dnz hdfs 0 2016-04-14 06:55 hdfs://HDPTSTHA/user/s0998dnz/summary_table
-rw-r—– 3 s0998dnz hdfs 991 2016-11-16 08:53 hdfs://HDPTSTHA/user/s0998dnz/workflow.xml

Post your feedback/experience or questions if you have any ?

Monthly Archives: March 2016

15

Encrypt password used by Sqoop to import or export data from database.

1

Distcp between High Availability enabled cluster

Recent Posts

Recent Comments

Archives