Category Archives: Bigdata

  • 7

How do I change an existing Ambari DB Postgres to MySQL?

Category : Ambari , Bigdata

By default when you configure your ambari server then it runs on postgres database. And if after sometime we need to change it to our comfortable and your org lovable db(like mysql) then you need to use following steps.

Step 1: Please stop your ambari server and then take back of postgres  ambari db(the default password is ‘bigdata’):

$ ambari-server stop

$ pg_dump -U ambari ambari > /temp/ambari.sql

Step 2: Now you need to setup mysql on any of the node with the help of following command :

$ yum install mysql-connector-java

Step 2: Now confirm that .jar is in the Java share directory and Make sure the .jar file has the appropriate permissions – 644

$ ls /usr/share/java/mysql-connector-java.jar

Step 3: Create a user for Ambari and grant it permissions.

For example, using the MySQL database admin utility:

# mysql -u root -p

CREATE USER ‘ambari’@’%’ IDENTIFIED BY ‘bigdata’;

GRANT ALL PRIVILEGES ON *.* TO ‘ambari’@’%’;

CREATE USER ‘ambari’@’localhost’ IDENTIFIED BY ‘bigdata’;

GRANT ALL PRIVILEGES ON *.* TO ‘ambari’@’localhost’;

CREATE USER ‘ambari’@’hdpm1.com>’ IDENTIFIED BY ‘bigdata’;

GRANT ALL PRIVILEGES ON *.* TO ‘ambari’@’hdpm1.com’;

FLUSH PRIVILEGES;

Step 4: Now you need load/restore the Ambari Server database schema.

mysql -u ambari -p

CREATE DATABASE ambaridb;

USE ambaridb;

SOURCE temp/ambari.sql; (the backup from postgres);

Step 5: Now finally update the ambari-server configuration to reference the MySQL instance:

  • On the ambari-server node you need to run ambari setup:

​            $ ambari-server setup

Enter advanced database configuration [y/n] (n)?

choose “y”, and follow the steps for setting it up as MySQL (option 3) using the guide mentioned above. Once that is done, don’t change any other settings after the db change.

Once setup is complete for the MySQL instance then you can start ambari:

$ ​ambari-server start

So now you have successfully migrated ambari postgres db to mysql.


  • 1

Error: java.io.IOException: java.lang.RuntimeException: serious problem (state=,code=0)

Category : Bigdata

If you run your hive query on ORC tables in hdp 2.3.4 then you may encounter this issue and it is because ORC split generation running on a global threadpool and doAs not being propagated to that threadpool. Threads in the threadpool are created on demand at execute time and thus execute as random users that were active at that time.

It is known issue and fixed by hitting: https://issues.apache.org/jira/browse/HIVE-13120 jira.

Intermittently ODBC users get error that another user doesn’t have permissions on the table. It seems hiveserver2 is checking on wrong user. For example, say you run a job as user ‘user1’, then the error message you will get is something like:

WARN [HiveServer2-Handler-Pool: Thread-587]: thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) – Error fetching results:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: serious problem
Caused by: java.io.IOException: java.lang.RuntimeException: serious problem

Caused by: java.lang.RuntimeException: serious problem
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1059)

Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException: Permission denied: user=haha, access=READ_EXECUTE, inode=”/apps/hive/warehouse/xixitb”:xixi:hdfs:drwx——
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)

Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=haha, access=READ_EXECUTE, inode=”/apps/hive/warehouse/xixitb”:xixi:hdfs:drwx——
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
Note that user ‘haha’ is not querying on ‘xixitb’ at all.

Resolution:
For this issue we need to set following property at run time as a workaround which is to turn off local fetch task for hiveserver2.

set hive.fetch.task.conversion=none

0: jdbc:hive2://localhost:8443/default> set hive.fetch.task.conversion=none;

No rows affected (0.033 seconds)

0: jdbc:hive2://localhost:8443/default> select * from database1.table1 where lct_nbr=2413 and ivo_nbr in (17469,18630);

INFO  : Tez session hasn’t been created yet. Opening session

INFO  : Dag name: select * from ldatabase1.table1…(17469,18630)(Stage-1)

INFO  :

INFO  : Status: Running (Executing on YARN cluster with App id application_1462173172032_65644)

INFO  : Map 1: -/-

INFO  : Map 1: 0/44

INFO  : Map 1: 0(+1)/44

Please feel free to give any feedback or suggestion for any improvements.


  • 0

Ranger User sync does not work due to ERROR UserGroupSync [UnixUserSyncThread]

Category : Bigdata

If we have enabled AD/LDAP user sync in ranger and we get below error then we need to follow given steps to resolve it.

LdapUserGroupBuilder [UnixUserSyncThread] – Updating user count: 148, userName:, groupList: [test, groups]
09 Jun 2016 09:04:34 ERROR UserGroupSync [UnixUserSyncThread] – Failed to initialize UserGroup source/sink. Will retry after 3600000 milliseconds. Error details:
javax.naming.PartialResultException: Unprocessed Continuation Reference(s); remaining name ‘dc=companyName,dc=com’
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2866)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2840)
at com.sun.jndi.ldap.LdapNamingEnumeration.getNextBatch(LdapNamingEnumeration.java:147)
at com.sun.jndi.ldap.LdapNamingEnumeration.hasMoreImpl(LdapNamingEnumeration.java:216)
at com.sun.jndi.ldap.LdapNamingEnumeration.hasMore(LdapNamingEnumeration.java:189)
at org.apache.ranger.ldapusersync.process.LdapUserGroupBuilder.updateSink(LdapUserGroupBuilder.java:318)
at org.apache.ranger.usergroupsync.UserGroupSync.run(UserGroupSync.java:58)
at java.lang.Thread.run(Thread.java:745)

Root Cause:  When ranger usersync is set for ranger.usersync.ldap.referral = ignore the ldap search will prematurely fail when it encounters additional referrals.

Resolution:

  1. Change base db to dc=companyName,dc=com from cn=Users,dc=companyName,dc=com
  2. Also change ranger.usersync.ldap.referral = follow from ranger.usersync.ldap.referral = ignore

So in this way it will resolve the above issue. I hope it helped you to solve your issue very easily.

Please feel free to give feedback or suggestion for any improvements.


  • 2

How to enable Node Label in your cluster

Node Label:

Here we described how to use Node labels to run YARN/Other applications on cluster nodes that have a specified node label. Node labels can be set as exclusive or shareable:

  • Exclusive— Access is restricted to applications running in queues associated with the node label.
  • Sharable— If idle capacity is available on the labeled node, resources are shared with all applications in the cluster.

Note: (Queues without Node Labels) If no node label is assigned to a queue, the applications submitted by the queue can run on any node without a node label, and on nodes with shareable node labels if idle resources are available.

Preemption: Labeled applications that request labeled resources preempt non-labeled applications on labeled nodes. If a labeled resource is not explicitly requested, the normal rules of preemption apply. Non-labeled applications cannot preempt labeled applications running on labeled nodes.

Configuring Node Labels: To enable node labels, make the following configuration changes on the YARN Resource Manager hosts.

Step 1: Create a Label Directory in HDFS:

Use the following commands to create a “node-labels” directory in which to store the node labels in HDFS.

$ sudo su hdfs

$ hadoop fs -mkdir -p /yarn/node-labels

$ hadoop fs -chown -R yarn:yarn /yarn

$ hadoop fs -chmod -R 700 /yarn

 

Note:  -chmod -R 700 specifies that only the yarn user can access the “node-labels” directory.

You can then use the following command to confirm that the directory was created in HDFS.

$ hadoop fs -ls /yarn

The new node label directory should appear in the list returned by the following command. The owner should be yarn, and the permission should be drwx.

Found 1 items

drwx—— – yarn yarn 0 2014-11-24 13:09 /yarn/node-labels

Use the following commands to create a /user/yarn directory that is required by the distributed shell.

$ hadoop fs -mkdir -p /user/yarn

$ hadoop fs -chown -R yarn:yarn /user/yarn

$ hadoop fs -chmod -R 700 /user/yarn

 

Step 2 : Configure YARN for Node Labels

Add the following properties to the /etc/hadoop/conf/yarn-site.xml file on the ResourceManager host.

Set the following property to enable node labels:

<property>

     <name>yarn.node-labels.enabled</name>

     <value>true</value>

</property>

Set the following property to reference the HDFS node label directory:

<property>

     <name>yarn.node-labels.fs-store.root-dir</name>

     <value> hdfs://lxhdpmastinf001.lowes.com:8020/yarn/node-labels/ </value>

</property>

 

Step 3: Start or Restart the YARN ResourceManager : In order for the configuration changes in the yarn-site.xml file to take effect, you must stop and restart the YARN ResourceManager if it is running, or start the ResourceManager if it is not running. To start or stop the ResourceManager, use the applicable commands in the “Controlling HDP Services Manually” section of the HDP Reference Guide.

Step 4: Add Node Labels : Use the following command format to add node labels. You should run these commands as the yarn user. Node labels must be added before they can be assigned to nodes and associated with queues.

$ sudo su yarn

$ yarn rmadmin -addToClusterNodeLabels “<label1>(exclusive=<true|false>),<label2>(exclusive=<true|false>)”

Note: If exclusive is not specified, the default value is true. For example, the following commands add the node label “x” as exclusive, and “y” as shareable (non-exclusive).

$ sudo su yarn

$ yarn rmadmin -addToClusterNodeLabels “spark_nl1(exclusive=true), spark nl2(exclusive=false), spark_nl3(exclusive=false)”

 

You can use the yarn cluster –list-node-labels command to confirm that node labels have been added:

[yarn@localhost ~]$ yarn cluster –list-node-labels

16/04/27 03:19:01 INFO impl.TimelineClientImpl: Timeline service address: http://localhost:8188/ws/v1/timeline/

16/04/27 03:19:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Node Labels: <spark_nl1:exclusivity=true>,<spark_nl2:exclusivity=false>,<spark_nl3:exclusivity=false>

 

Note: You can use the following command format to remove node labels. You cannot remove a node label if it is associated with a queue:

$ yarn rmadmin -removeFromClusterNodeLabels “<label1>,<label2>”

 

Step 5 : Assign Node Labels to Cluster Nodes : 

Use the following command format to add or replace node label assignments on cluster nodes:

$ yarn rmadmin -replaceLabelsOnNode “<node1>:<port>=<label1> <node2>:<port>=<label2>”

For example, the following commands assign node label “x” to “node-1.example.com”, and node label “y” to “node-2.example.com”.

$ sudo su yarn

$ yarn rmadmin -replaceLabelsOnNode “localhost1=spark_nl1 llocalhost2=spark_nl2”

$ yarn rmadmin -replaceLabelsOnNode “localhost1=spark_nl3”

 

Note: You can only assign one node label to each node. Also, if you do not specify a port, the node label change will be applied to all NodeManagers on the host.

To remove node label assignments from a node, use -replaceLabelsOnNode, but do not specify any labels. For example, you would use the following commands to remove the “x” label from lxhdpwrkinf001.lowes.com:

$ sudo su yarnYarn rmadmin -replaceLabelsOnNode “localhost1” 

 

Step 6: Associating Node Labels with Queues: Now that we have created node labels, we can associate them with queues in the /etc/hadoop/conf/capacity-scheduler.xml file.

You must specify capacity on each node label of each queue, and also ensure that the sum of capacities of each node-label of direct children of a parent queue at every level is equal to 100%. Node labels that a queue can access (accessible node labels of a queue) must be the same as, or a subset of, the accessible node labels of its parent queue.

 Example:

Assume that a cluster has a total of 8 nodes. The first 3 nodes (n1-n3) have node label=x, the next 3 nodes (n4-n6) have node label=y, and the final 2 nodes (n7, n8) do not have any node labels. Each node can run 10 containers.

The queue hierarchy is as follows:

Screen Shot 2016-05-03 at 4.10.03 PM

 

Batch can access both node label x,y and user queue can access only node label y

capacity(batch) = 30, capacity(batch, label=x) = 100, capacity(batch, label=y) = 50; capacity(user) =40, capacity(user, label=y) = 50

ado,adop and di: capacity(user.ado) = 40, capacity(user.ado, label=x) =40, capacity(user.adop) = 60, capacity(user.adop, label=x) =40, capacity(user.di) = 20, capacity(user.di, label=x) =10

In this way you can configure Node label with Capacity scheduler queues.

Test Cases: You can use below sample example to test your node label and CS queue. 

Example 1: /usr/hdp/2.3.4.0-3485/spark/bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –queue batch /usr/hdp/2.3.4.0-3485/spark/lib/spark-examples-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar 10

Example 2: /usr/hdp/2.3.4.0-3485/spark/bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –queue ado /usr/hdp/2.3.4.0-3485/spark/lib/spark-examples-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar 10

Example 3: /usr/hdp/2.3.4.0-3485/spark/bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –queue adospark /usr/hdp/2.3.4.0-3485/spark/lib/spark-examples-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar 10

Example 4 : /usr/hdp/2.3.4.0-3485/spark/bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –queue di /usr/hdp/2.3.4.0-3485/spark/lib/spark-examples-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar 10

Example 5 :/usr/hdp/2.3.4.0-3485/spark/bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –queue dispark –num-executors 5 –conf spark.executor.memory=5g –conf spark.driver.memory=2g –conf spark.driver.cores=2 –conf spark.executor.cores=2 –conf spark.executor.instances=2 /usr/hdp/2.3.4.0-3485/spark/lib/spark-examples-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar 10

Example 6 : /usr/hdp/2.3.4.0-3485/spark/bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –queue ea –num-executors 5 /usr/hdp/2.3.4.0-3485/spark/lib/spark-examples-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar 10

 

I hope it will help you to understand node label and capacity scheduler. If you have any suggestion or feedback please feel free to write to me.


  • 1

Run pig script though Oozie

Category : Bigdata

If you have a requirement where you have to read some file through pig and you want to schedule your pig script via Oozie then this article will help you to do your job.

Step 1: First create some dir inside hdfs(under your home dir) would be good.

$ hadoop fs -mkdir -p /user/<user_id>/oozie-scripts/PigTest

Step 2: Create your workflow.xml and job.properties:

$ vi job.properties

#*************************************************

#  job.properties

#*************************************************

nameNode=hdfs://HDPTSTHA

jobTracker=<RM_Server>:8050

queueName=default

oozie.libpath=${nameNode}/user/oozie/share/lib

oozie.use.system.libpath=true

oozie.wf.rerun.failnodes=true

examplesRoot=oozie-scripts

examplesRootDir=/user/${user.name}/${examplesRoot}

appPath=${examplesRootDir}/PigTest

oozie.wf.application.path=${appPath}

$ vi workflow.xml

<!–******************************************–>

<!–workflow.xml                              –>

<!–******************************************–>

<workflow-app name=”WorkFlowForPigAction” xmlns=”uri:oozie:workflow:0.1″>

<start to=”pigAction”/>

<action name=”pigAction”>

        <pig>

            <job-tracker>${jobTracker}</job-tracker>

            <name-node>${nameNode}</name-node>

            <prepare>

                <delete path=”${nameNode}/${examplesRootDir}/PigTest/temp”/>

            </prepare>

            <configuration>

                <property>

                    <name>mapred.job.queue.name</name>

                    <value>${queueName}</value>

                </property>

                <property>

                    <name>mapred.compress.map.output</name>

                    <value>true</value>

                </property>

            </configuration>

            <script>pig_script_file.pig</script>

          </pig>

        <ok to=”end”/>

        <error to=”end”/>

    </action>

<end name=”end”/>

</workflow-app>

Step 3: Create your pig script :

$ cat pig_script_file.pig

lines = LOAD ‘/user/demouser/file.txt’ AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

store wordcount into ‘/user/demouser/pigOut1’;

— DUMP wordcount;

Step 4: Now copy your workflow.xml and pig script to your hdfs location:

$ hadoop fs -put pig_script_file.pig /user/demouser/oozie-scripts/PigTest/

$ hadoop fs -put workflow.xml /user/demouser/oozie-scripts/PigTest/

Step 5: Now you can schedule or submit oozie job

$ oozie job -oozie http://<Oozie_server>:11000/oozie -config job.properties -run

Now you can see your output in hdfs :

[demouser@<nameNode_server> pig_oozie_demo]$ hadoop fs -ls /user/demouser/pigOut1

Found 2 items

-rw-r–r–   demouser hdfs          0 2016-05-03 05:36 /user/demouser/pigOut1/_SUCCESS

-rw-r–r–   demouser hdfs         30 2016-05-03 05:36 /user/demouser/pigOut1/part-r-00000

[demouser@<nameNode_server> pig_oozie_demo]$ hadoop fs -cat /user/demouser/pigOut1/part-r-00000

pig 4

test 1

oozie 1

sample 1

Conclusion : I hope this article will help you and if you feel to give any feedback or have any doubt please feel free to write to me in comment.


  • 0

How To Set Up Master Slave Replication in MySQL

MySQL replication is a process that allows you to easily maintain multiple copies of a MySQL data by having them copied automatically from a master to a slave database. This can helpful for many reasons including facilating a backup for the data, a way to analyze it without using the main database, or simply as a means to scale out.

This tutorial will cover a very simple example of mysql replication—one master will send information to a single slave. For the process to work you will need two IP addresses: one of the master server and one of the slaves.

This section explains how to configure MySQL replication with either the Master-Slave model (M-S) or the Master-Master model in Active-Passive mode (M-M-A-P), and describes how to recover from failures in each case.

  • A lot of the following instructions assume that you are using InnoDB for all your MySQL tables. If you are using some other storage engine, other changes might be required and it would be best to verify with a DBA. InnoDB is the recommended storage engine for storing Hive metadata.
  • Assume we want to useserver-A.my.example.com as the primary master and server-B.my.example.com as the slave (or secondary passive master in the M-M-A-P case). Let’s say we have installed Hive so as to use server-A as its metastore database, as one would when installing with the HDP installer, and we’ve simply installed MySQL (using yum or otherwise) on server-B, but done nothing else.

Master-Slave Replication

  1. Shut down the mysql server process if it is running.

           /etc/init.d/mysqld stop

  1. Edit the cnffiles with the following values(vi /etc/my.cnf)

      log_bin=mysql-bin

          binlog_format=ROW

          server_id=10

         #innodb_flush_logs_at_trx_commit=1

          innodb_support_xa=1

  1. Bring up the mysql server process again — this step may vary based on your installation; in a typical RHEL setup, I can use the system service startup for mysql as follows:

        /etc/init.d/mysqld start

  1. To verify that the server is now logging bin-logs, you can use the SQL command: “SHOW MASTER STATUS;”. It should show you a binlog filename and a position.
  2. Make sure that your current user is able to do a dump of the MySQL database, by running the following as a mysql-root-capable user, for example, on a default installation, “root”.

 GRANT ALL ON *.* to ‘root’@’server-A.my.example.com’;

Also, create a replication user that will be used to conduct future replications:

 GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO ‘replication’@’server-B.my.example.com’ IDENTIFIED BY ‘r3plpwd’;

  1. Run mysqldumpto dump all tables from the master and load them onto the slave as follows:

 mysqldump -u root -p –single-transaction –all-databases –master-data=1 > dump.out

  1. Copy this dump.out file over to the server-B.

         scp dump.out w1.hdp22:

Then, on server-B.my.example.com:

  1. Shut down the mysql server process if it is running.

        /etc/init.d/mysqld stop

  1. Edit the my.cnf files with the following values (vi /etc/my.cnf)

     log_bin = mysql-bin

       binlog_format = ROW

       server_id = 11

       relay_log = mysql-relay-bin

       log_slave_updates = 1

       read_only = 1

  1. Bring up the mysql server process.

       /etc/init.d/mysqld start

  1. Make sure that your current user is able to load the prepared dump of the MySQL database, by running the following as a mysql-root-capable user, for example, on a default installation, “root”.

mysql> create user ‘root’@’w1.hdp22’ identified by ‘hadoop’;

         grant all on *.* to ‘root’@’w1.hdp22’;

 

  1. Load the dump that was dumped out by mysqldumpby running the following:

 mysql –u root –p < dump.out

  • Verify that the metastore database was transferred over by running a ‘SHOW DATABASES’ call on MySQL.
  • Look through the MySQL dump file, and locate values for MASTER_LOG_FILE and MASTER_LOG_POS. We will need to specify values for these to start replication on the slave. Assuming these values were ‘mysql-bin.000001’ and the position was 0, then to copy new entries from the master, run the following:

mysql> CHANGE MASTER TO MASTER_HOST=’m2.hdp22′,MASTER_USER=’replication’,MASTER_PASSWORD=’hadoop’,MASTER_LOG_FILE=’mysql-bin.000001′, MASTER_LOG_POS=4507;

Note that these values can also be obtained from the master by running ‘SHOW MASTER STATUS’ on it.

  • Restart the mysql server.

 /etc/init.d/mysqld restart

  • Check that the replication is correctly configured by running

SHOW SLAVE STATUS;

Note: It should show correct values as set previously for Master_User and Master_Host. If the slave is caught up to the master, then this field will show a value for Seconds_Behind_Master as being 0.

 

And with that, you now have M-S replication set up.


  • 0

hdfs balancer gets failed after every 30 mins when you run it through ambari

Actually there is still a bug in ambari 2.2.0, whenever you run balancer though ambari and it has to balance lots of TBs data then it fails after 30 mins due to timeout.

You can see following error in your logs:

resource_management.core.exceptions.Fail: Execution of ‘ambari-sudo.sh su hdfs -l -s /bin/bash -c ‘export PATH='”‘”‘/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/usr/hdp/current/hadoop-client/bin'”‘”‘ ; hdfs –config /usr/hdp/current/hadoop-client/conf balancer -threshold 10” returned 252. 16/03/08 08:42:03 INFO balancer.Balancer: Using a threshold of 10.0
16/03/08 08:42:03 INFO balancer.Balancer: namenodes = [hdfs://HDPDEVHA]
16/03/08 08:42:03 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false]
16/03/08 08:42:03 INFO balancer.Balancer: included nodes = []
16/03/08 08:42:03 INFO balancer.Balancer: excluded nodes = []
16/03/08 08:42:03 INFO balancer.Balancer: source nodes = []
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
16/03/08 08:42:04 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
16/03/08 08:42:04 INFO block.BlockTokenSecretManager: Setting block keys
16/03/08 08:42:04 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
java.io.IOException: Another Balancer is running.. Exiting …
Mar 8, 2016 8:42:04 AM Balancing took 1.27 seconds
Last login: Tue Mar 8 08:12:09 EST 2016

 

So to resolve this error, you have to change ambari.properties file on ambari server node, after that you have to restart ambari server and run the balancer from ambari.

$ vi /etc/ambari-server/conf/ambari.properties

agent.task.timeout=7200

 


  • 15

Encrypt password used by Sqoop to import or export data from database.

Sqoop became very popular and the darling tool for the industries. Sqoop has developed a lot and become very popular amongst Hadoop ecosystem. When we import or export data from database through Sqoop then we have to give password in command or in file only. I feel this is not a fully secure way to keep password in a file or pass password through command line. In this post, I will try cover the ways to specify database passwords to Sqoop in simple and secure way.

The following ways are common to pass database passwords to Sqoop:

Option 1: We can give password in command itself.

sqoop import --connect jdbc:mysql://servera.com:3306/bigdb --username dbuser --password <passwoord> --table sqoop_import -m 1 --target-dir /user/root/sanity_test/ --append

Option 2: We can keep password in a file and can pass this file in command.

sqoop import --connect jdbc:mysql://servera.com:3306/bigdb --username dbuser -password-file /home/root/.mysql.password --table sqoop_import -m 1 --target-dir /user/root/sanity_test/ --append

However, storing password in a text file is still considered not secure even though we have set the permissions.

As of Sqoop version 1.4.5, Sqoop supports the use of JKS to store passwords which would be encrypted,so that you do not need to store passwords in clear text in a file.This can be achieved using Java KeyStore.

Option 3: To generate the key:

Note: On prompt, enter the password that will be used to access the database.

[root@m1.hdp22 ~]$ hadoop credential create mydb.password.alias -provider jceks://hdfs/user/root/mysql.password.jceks
Enter password:
Enter password again:
mydb.password.alias has been successfully created.
org.apache.hadoop.security.alias.JavaKeyStoreProvider has been updated.

## Now you can use mydb.password.alias file in your sqoop command like below:
## Please add --Dhadoop.security.credential.provider.path=jceks://hdfs/user/root/mysql.password.jceks to the command and change the password like follows: 

sqoop import --Dhadoop.security.credential.provider.path=jceks://hdfs/user/root/mysql.password.jceks --connect jdbc:mysql://server:3306/sqoop --username sqoop --password-alias mydb.password.alias --table Test --delete-target-dir --target-dir /apps/hive/warehouse/test --split-by name

##or
sqoop import -Dhadoop.security.credential.provider.path=jceks://hdfs/user/root/mysql.password.jceks --connect ‘jdbc:mysql://servera.com:3306/bigdb’ --tablesqoop_import --username dbuser --password-alias mydb.password.alias -m 1 --target-dir /user/root/sanity_test/ --append

This way password is hidden inside jceks://hdfs/user/root/mysql.password.jceks and no one is able to see it.

Note: the “mydb.password.alias” is the alias that we can use to pass to Sqoop when running the command, so that no password is needed.

I hope you enjoy reading this article and will help you to protect your password. Please give your valuable feedback.


  • 1

Distcp between High Availability enabled cluster

Category : Bigdata

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

This impacted the total availability of the HDFS cluster in two major ways:

  • In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
  • Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.

The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its MapReduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.

$ hadoop distcp hdfs://nn1:8020/user/test/sample.txt hdfs://nn2:8020/user/test/

The easiest way to use distcp between two HA clusters would be to identify the current active NameNode and run distcp like you would with two clusters without HA:

$ hadoop distcp hdfs://active:8020/user/test/sample.txt hdfs://active:8020/user/test/

But identifying an active node is not easy also not possible in automating a batch job. So other alternative is to configure the client with both service ids and make it aware of the way to identify the active NameNode of both

Note: You need to add both clusters following HA properties in hdfs-site.xml on your source cluster. So for example you want to copy data from cluste1 to cluste2 then you have to change hdfs-site.xml on cluster1 and need to add cluster2’s HA properties in that file.

  1. Both service id value should be add in following property
    <property>
    <name>dfs.nameservices</name>
    <value>serviceId1,serviceId2</value>
    </property>
  2. Add following properties from cluster2 to cluster1’s hdfs-site.xml.

<!– serviceId2 properties –>

<property>

    <name>dfs.client.failover.proxy.provider.serviceId2</name>

    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

  </property>

  <property>

    <name>dfs.ha.namenodes.serviceId2</name>

    <value>nn1,nn2</value>

  </property>

  <property>

    <name>dfs.namenode.rpc-address.serviceId2.nn1</name>

    <value>m1.hdp22:8020</value>

  </property>

  <property>

    <name>dfs.namenode.servicerpc-address.serviceId2.nn1</name>

    <value>m1.hdp22:54321</value>

  </property>

  <property>

    <name>dfs.namenode.http-address.serviceId2.nn1</name>

    <value>m1.hdp22:50070</value>

  </property>

  <property>

    <name>dfs.namenode.https-address.serviceId2.nn1</name>

    <value>m1.hdp22:50470</value>

  </property>

  <property>

    <name>dfs.namenode.rpc-address.HDPTSTHA.nn2</name>

    <value>m2.hdp22:8020</value>

  </property>

  <property>

    <name>dfs.namenode.servicerpc-address.HDPTSTHA.nn2</name>

    <value>m2.hdp22:54321</value>

  </property>

  <property>

    <name>dfs.namenode.http-address.HDPTSTHA.nn2</name>

    <value>m2.hdp22:50070</value>

  </property>

  <property>

    <name>dfs.namenode.https-address.HDPTSTHA.nn2</name>

    <value>m2.hdp22:50470</value>

</property>

[s0998dnz@m1.hdp22 ~]$ hadoop distcp hdfs://HDPINFHA/user/s0998dnz/sampleTest.txt hdfs://HDPTSTHA/user/s0998dnz/
17/04/11 05:04:37 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile=’null’, copyStrategy=’uniformsize’, preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[hdfs://HDPINFHA/user/s0998dnz/sampleTest.txt], targetPath=hdfs://HDPTSTHA/user/s0998dnz, targetPathExists=true, filtersFile=’null’}
17/04/11 05:04:38 INFO impl.TimelineClientImpl: Timeline service address: http://m1.hdp22:8188/ws/v1/timeline/
17/04/11 05:04:38 INFO client.AHSProxy: Connecting to Application History server at m1.hdp22/127.0.0.1:10200
17/04/11 05:04:38 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
17/04/11 05:04:38 INFO tools.SimpleCopyListing: Build file listing completed.
17/04/11 05:04:38 INFO tools.DistCp: Number of paths in the copy list: 1
17/04/11 05:04:38 INFO tools.DistCp: Number of paths in the copy list: 1
17/04/11 05:04:39 INFO impl.TimelineClientImpl: Timeline service address: http://m1.hdp22:8188/ws/v1/timeline/
17/04/11 05:04:39 INFO client.AHSProxy: Connecting to Application History server at m1.hdp22/127.0.0.1:10200
17/04/11 05:04:39 INFO mapreduce.JobSubmitter: number of splits:1
17/04/11 05:04:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1491850038017_0011
17/04/11 05:04:39 INFO impl.YarnClientImpl: Submitted application application_1491850038017_0011
17/04/11 05:04:39 INFO mapreduce.Job: The url to track the job: http://m1.hdp22:8088/proxy/application_1491850038017_0011/
17/04/11 05:04:39 INFO tools.DistCp: DistCp job-id: job_1491850038017_0011
17/04/11 05:04:39 INFO mapreduce.Job: Running job: job_1491850038017_0011
17/04/11 05:04:46 INFO mapreduce.Job: Job job_1491850038017_0011 running in uber mode : false
17/04/11 05:04:46 INFO mapreduce.Job:  map 0% reduce 0%
17/04/11 05:04:52 INFO mapreduce.Job:  map 100% reduce 0%
17/04/11 05:04:52 INFO mapreduce.Job: Job job_1491850038017_0011 completed successfully
17/04/11 05:04:53 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=153385
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=408
HDFS: Number of bytes written=46
HDFS: Number of read operations=17
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=3837
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3837
Total vcore-milliseconds taken by all map tasks=3837
Total megabyte-milliseconds taken by all map tasks=3929088
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=118
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=99
CPU time spent (ms)=1350
Physical memory (bytes) snapshot=283774976
Virtual memory (bytes) snapshot=3770925056
Total committed heap usage (bytes)=372244480
File Input Format Counters
Bytes Read=244
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=46
BYTESEXPECTED=46
COPY=1
[s0998dnz@m1.hdp22 ~]$

[s0998dnz@m1.hdp22 ~]$ hadoop fs -ls hdfs://HDPTSTHA/user/s0998dnz/
Found 19 items
drwxr-x—   – s0998dnz hdfs          0 2017-04-11 04:38 hdfs://HDPTSTHA/user/s0998dnz/.Trash
drwxr-x—   – s0998dnz hdfs          0 2016-01-21 00:59 hdfs://HDPTSTHA/user/s0998dnz/.hiveJars
drwxr-xr-x   – s0998dnz hdfs          0 2016-04-29 07:13 hdfs://HDPTSTHA/user/s0998dnz/.sparkStaging
drwx——   – s0998dnz hdfs          0 2016-11-21 02:42 hdfs://HDPTSTHA/user/s0998dnz/.staging
drwxr-x—   – s0998dnz hdfs          0 2015-12-28 12:06 hdfs://HDPTSTHA/user/s0998dnz/_sqoop
drwxr-x—   – s0998dnz hdfs          0 2016-12-01 02:42 hdfs://HDPTSTHA/user/s0998dnz/custTable_orc_sql
drwxr-x—   – s0998dnz hdfs          0 2016-11-29 05:27 hdfs://HDPTSTHA/user/s0998dnz/export_sql_new
drwxr-x—   – s0998dnz hdfs          0 2016-11-30 02:09 hdfs://HDPTSTHA/user/s0998dnz/export_table
drwxrwxrwx   – s0998dnz hdfs          0 2016-04-14 02:42 hdfs://HDPTSTHA/user/s0998dnz/falcon
-rw-r–r–   3 s0998dnz hdfs         34 2016-05-03 05:30 hdfs://HDPTSTHA/user/s0998dnz/file.txt
drwxr-x—   – s0998dnz hdfs          0 2016-11-17 06:00 hdfs://HDPTSTHA/user/s0998dnz/import_sql
-rw-r—–   3 s0998dnz hdfs        144 2016-10-24 06:18 hdfs://HDPTSTHA/user/s0998dnz/input.txt
drwxr-x—   – s0998dnz hdfs          0 2017-01-24 02:36 hdfs://HDPTSTHA/user/s0998dnz/oozie-oozi
drwxr-xr-x   – s0998dnz hdfs          0 2016-05-03 05:31 hdfs://HDPTSTHA/user/s0998dnz/oozie-scripts
drwxr-xr-x   – s0998dnz hdfs          0 2016-07-01 06:25 hdfs://HDPTSTHA/user/s0998dnz/pigOut1
-rw-r—–   3 s0998dnz hdfs         46 2017-04-11 05:04 hdfs://HDPTSTHA/user/s0998dnz/sampleTest.txt
drwxr-x—   – hdfs     hdfs          0 2015-12-28 07:41 hdfs://HDPTSTHA/user/s0998dnz/sanity_test
drwxr-x—   – s0998dnz hdfs          0 2016-04-14 06:55 hdfs://HDPTSTHA/user/s0998dnz/summary_table
-rw-r—–   3 s0998dnz hdfs        991 2016-11-16 08:53 hdfs://HDPTSTHA/user/s0998dnz/workflow.xml

Post your feedback/experience or questions if you have any ?


  • 2

How to fix corrupted or under replicated blocks issue

Category : Bigdata

To find out whether hadoop hdfs filesystem has corrupt blocks or not also to fix that we can use below steps :

[hdfs@m1 ~]$ hadoop fsck /

or

[hdfs@m1 ~]$ hadoop fsck hdfs://192.168.56.41:50070/

If you get any corrupted blocks or missing at the end of output like below :

Total size: 4396621856 B (Total open files size: 249 B)

Total dirs: 11535

Total files: 841

Total symlinks: 0 (Files currently being written: 4)

Total blocks (validated): 844 (avg. block size 5209267 B) (Total open file blocks (not validated): 3)

Minimally replicated blocks: 844 (100.0 %)

Over-replicated blocks: 0 (0.0 %)

Under-replicated blocks: 2 (0.23696682 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 3

Average block replication: 3.0

Corrupt blocks: 0

Missing replicas: 14 (0.5498822 %)

Number of data-nodes: 3

Number of racks: 1

FSCK ended at Mon Feb 22 11:04:21 EST 2016 in 5505 milliseconds

The filesystem under path ‘/’ is HEALTHY

Now we have to find which files have missing or corrupted blocks via below commands:

hdfs fsck / | egrep -v '^\.+$' | grep -v replica | grep -v Replica

This will list out the affected files, and also files that might currently have under-replicated blocks (which isn't necessarily an issue). The output should include something like this with all your affected files.
/path/to/filename.fileextension: CORRUPT blockpool BP-1016133662-10.29.100.41-1415825958975 block blk_1073904305
/path/to/filename.fileextension: MISSING 1 blocks of total size 15620361 B

Now we have to determine the importance of the file, whether can it just be removed or is there sensitive data that needs to be regenerated?
Remove the corrupted file from your hadoop cluster,This command will move the corrupted file to the trash.

[hdfs@m1 ~]$ hdfs dfs -rm /path/to/filename.ext
[hdfs@m1 ~]$ hdfs dfs -rm hdfs://ip.or.hostname.of.namenode:50070/path/to/filename.ext

This might or might not be possible, To repair a corrupt file first step would be to gather information on the file's location, and blocks.
[hdfs@m1 ~]$ hdfs fsck /path/to/filename/fileextension -locations -blocks -files

Now you can track down the node where the corruption is. On those nodes, you can look through logs anddetermine what the issue is. If a disk was replaced, i/o errors on the server, etc. 
If possible to recover on that machine and get the partition with the blocks online that would report back to hadoop and the file would be healthy again. If that isn't possible, you will unfortunately have find another way to regenerate.