BigData

Every digital process and social media exchange produces it !

Hadoop

Business can start thinking big again when it comes with hadoop !

Hadoop Admin

You can’t lead your troops if your troops do not trust you !

Latest Blog

Do you think file format does matter in big Data technology?

Tags : 

Yes, Thats matter a lot because of following main reasons: By using correct file format as per your use case you can achieve following. 1. Less storage: if we select a proper file format with good compatibile compression technique then it’s required less storage. 2. Faster processing of data: based on our use case if

Read More

Install and configure Spark History Server (SHS) on Kubernetes K8s

Tags : 

We always struggle like how to install and configure SHS on Kubernetes with gas event log. So here is your solution. Create a shs-gcs.yaml deployments file which will be used to deploy shs service.        pvc: enablePVC: false existingClaimName: nfs-pvc eventsDir: “/” nfs: enableExampleNFS: false pvName: nfs-pv pvcName: nfs-pvc gcs: enableGCS: true secret: history-secrets key:

Read More

Install Airflow in your local Macbook

Tags : 

****************************** Step 1 ***************************** Create a new airflow directory anywhere in your laptop (base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % cd ~/Documents (base) saurabhkumar@Saurabhs-MacBook-Pro Documents % mkdir airflow-tutorial (base) saurabhkumar@Saurabhs-MacBook-Pro Documents % cd airflow-tutorial   ************************** Step 2 ******************************* Create a python virtual env (base) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % conda create –name airflow-tutorial1 python=3.7 Collecting package metadata (current_repodata.json): done

Read More

Google Container Registry (GCR) with Minikube or K8s

Tags : 

When you use Google Container Registry (GCR) and seeing the dreaded ImagePullBackoff status on your pods in minikube/K8s Then this article can help you to solve that error. Error : (base) saurabhkumar@Saurabhs-MacBook-Pro ~ % kubectl describe pod airflow-postgres-694899d6fd-lqp2c -n airflow Events: Type Reason Age From Message —- —— —- —- ——- Normal Scheduled 56s default-scheduler

Read More

Insert overwrite query Failed with exception Unable to move source

Tags : 

If you have explicitly setup hive.exec.stagingdir to some location like /tmp/ or some other location then whenever you will run insert overwrite statment then you will get following error. ERROR exec.Task (SessionState.java:printError(989)) – Failed with exception Unable to move source hdfs://clustername/apps/finance/nest/nest_audit_log_final/ .hive-staging_hive_2017-12-12_19-15-30_008_33149322272174981-1/-ext-10000 to destination hdfs://clustername/apps/finance/nest/nest_audit_log_final Example:  INSERT OVERWRITE TABLE nest.nest_audit_log_final SELECT project_name , application , module_seq_num ,

Read More

last access time of a table is showing zero

Tags : 

If you many hundreds or thousands tables and you want to know when was the last time your hive table accessed then you can run following mysql query in mysql under hive database. mysql> use hive; mysql> select TBL_NAME,LAST_ACCESS_TIME from TBLS where DB_ID=<db_id>; +—————————————————————————————————-+——————+ | TBL_NAME | LAST_ACCESS_TIME | +—————————————————————————————————-+——————+ | df_nov_4 | 0 |

Read More

kill hive query where application id was not created

Tags : 

Sometime when you run hive queries then it does not launch application or get hung due to some resources or any other reason. Now in this case you have to kill query to resubmit it. So, please use following steps to kill hive query itself.   hive> select * from table1; Query ID = mapr_201804547_2ad87f0f5627

Read More

Purging history/old data in oozie database

Tags : 

After some period of time your oozie db will be big and it may start throwing space issue or might be some slowness during oozie UI load. There are some properties which will help you to purge your oozie data but sometime, the oozie purge service does not function as expected. It result to a

Read More

Attempt to add *.jar multiple times to the distributed cache

Tags : 

When we submit Spark2 action via oozie then we may see following exception in logs and job will fail: exception: Attempt to add (hdfs://m1:8020/user/oozie/share/lib/lib_20171129113304/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache. java.lang.IllegalArgumentException: Attempt to add (hdfs://m1:8020/user/oozie/share/lib/lib_20171129113304/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache. The above error occurs because the same jar files exists in both(/user/oozie/share/lib/lib_20171129113304/oozie/ and  /user/oozie/share/lib/lib_20171129113304/spark2/) the

Read More

hive jdbc in zeppelin throwing permission error to anonymous user

Tags : 

When users run hive query in zeppelin via jdbc interperator then it is going to some anonymous user not an actual user. INFO [2017-11-02 03:18:20,405] ({pool-2-thread-2} RemoteInterpreter.java[pushAngularObjectRegistryToRemote]:546) – Push local angular object registry from ZeppelinServer to remote interpreter group 2CNQZ1ES5:shared_process WARN [2017-11-02 03:18:21,825] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2058) – Job 20171031-075630_2029577092 is finished, status: ERROR, exception: null, result:

Read More

Namenode may keep crashing due to excessive logging

Tags : 

Namenode may keep crashing even if you restart all services and you have enough heap size. And you see following error in logs. java.io.IOException: IPC’s epoch 197 is less than the last promised epoch 198 or 2017-09-28 09:16:11,371 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(851)) – Local service NameNode at m1.hdp22 entered state: SERVICE_NOT_RESPONDING  Root Cause: In my case

Read More

ERROR : Failed with exception org.apache.hadoop.security.AccessControlException: Permission denied. user=user1 is not the owner of inode=test_copy_1

Tags : 

If users complain that they are not able to load data into hive tables via beeline. Actually while loading data into Hive table using load data inpath ‘/tmp/test’ into table sampledb.sample1 then getting following error: load data inpath ‘/tmp/test’ into table adodevdb.sample1; INFO : Loading data to table adodevdb.sample1 from hdfs://m1.hdp22/tmp/test ERROR : Failed with

Read More

Select does not return any row in mr execution engine but returns in tez via beeline

Tags : 

When I ran a select statement via setting set hive.execution.engine=mr; then select * from table is not returning any rows in beeline but when I run it in tez then it is returning result. 0: jdbc:hive2://m1.hdp22:10001/default> select * from test_db.table1 limit 25; +————————+————————-+————————-+—————————+—————————+—————————+————————-+————————-+————————-+——————————-+————————-+–+ | cus_id  | prx_nme  | fir_nme  | mid_1_nme  | mid_2_nme  | mid_3_nme 

Read More

knox is not getting start, failing with error Gateway SSL Certificate is Expired

Tags : 

When you try to start knox then if it fails with following error then don’t worry, this article will help you to solve problem. INFO hadoop.gateway (JettySSLService.java: logAndValidateCertificate(122)) – The Gateway SSL certificate is valid between:  FATAL hadoop.gateway (GatewayServer.java:main (120)) – Failed to start gateway: org.apache.hadoop.gateway.services. ServiceLifecycleException: Gateway SSL Certificate is Expired.   Root cause: It

Read More

Hive metastore critical alerts with ExecutionFailed: Execution of ‘export HIVE_CONF_DIR=’/usr/hdp/current/hive-metastore/conf

Tags : 

When you install Atlas and configure it then you may see following alert in Ambari Hive Service. And once you check this alert details, you will see following error : Metastore on m1.hdp22 failed (Traceback (most recent call last): File “/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_metastore.py”, line 200, in execute timeout_kill_strategy=TerminateStrategy.KILL_PROCESS_TREE, File “/usr/lib/python2.6/site-packages/resource_management/core/base.py”, line 155, in __init__ self.env.run() File “/usr/lib/python2.6/site-packages/resource_management/core/environment.py”,

Read More

Sqoop import is failing after enabling atlas with ERROR security.InMemoryJAASConfiguration: Unable to add JAAS configuration

Tags : 

When you run Sqoop import with teradata or mysql/oracle then it might fail after installing and enabling atlas in your cluster with following error. 17/08/10 04:31:56 ERROR security.InMemoryJAASConfiguration: Unable to add JAAS configuration for client [KafkaClient] as it is missing param [atlas.jaas.KafkaClient.loginModuleName]. Skipping JAAS config for [KafkaClient] 17/08/10 04:31:58 INFO checking on the exit code

Read More

/usr/hdp/2.6.1.0-129/atlas/hook-bin/import-hive.sh is failing with Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/Bytes

Tags : 

When you have installed atlas on top of your cluster and you want to sync your hive data to atlas via following method then you may see following error after sometime(~20-30 mins) running your command. [hive@m1.hdp22 ~]$ export HADOOP_CLASSPATH=`hadoop classpath` [hive@m1.hdp22 ~]$ export HIVE_CONF_DIR=/etc/hive/conf [hive@m1.hdp22 ~]$ /usr/hdp/2.6.1.0-129/atlas/hook-bin/import-hive.sh Using Hive configuration directory [/etc/hive/conf] Log file for

Read More

Spark job run successfully in client mode but failing in cluster mode

Tags : 

If you build a pyspark application which can run successfully  in both the local and yarn-client modes.  However, when you try to run in cluster mode, then you may receive following errors : Error 1:  Exception: (“You must build Spark with Hive. Export ‘SPARK_HIVE=true’ and run build/sbt assembly”, Py4JJavaError(u’An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n’, JavaObject id=o52))

Read More

Unable to view OS Host information in the Ambari Dashboard(No data Available)

Tags : 

On the Ambari dashboard, the memory usage, Network Usage, CPU usage and Cluster Load information are missing.The dashboard displays the following error: Root Cause : This issue occurs when there are some temporary files present in the AMS collector folder. Solution:  You need to stop ams service vi ambari and then remove all temp files.

Read More

Beeline java.lang.OutOfMemoryError: Requested array size exceeds VM limit

Tags : 

When we run beeline jobs very heavily then sometime we can see following error : Root Cause : By default, the history file is located under ~/.beeline/history for that user who is facing this issue and beeline will load the latest 500 rows into memory. If those queries are super big, containing lots of characters, it

Read More

Run all service checks in bulk

Tags : 

In this blogs I tried to explain that how you can use ambari API to trigger all Service Checks with a single command. In order to check the status and stability of any service in your cluster you need to run the service checks that are included in Ambari. Usually each Service provides its own

Read More

Enable Debug mode in beeline

Tags : 

Some time you have to troubleshoot beeline issue and then you think how to get into debug mode for beeline command shell as you have in hive (-hiveconf hive.root.logger=Debug,console). I know same is not going to work with beeline So don’t worry following steps will help you and good part is you do not need

Read More

hadoop cluster Benchmarking and Stress Testing

Tags : 

When we install our cluster then we should do some benchmarking or Stress Testing. So in this article I have explained a inbuilt TestDFSIO functionality which will help you to to perform Stress Testing on your configured cluster. The Hadoop distribution comes with a number of benchmarks, which are bundled in hadoop-*test*.jar and hadoop-*examples*.jar. The TestDFSIO benchmark is

Read More

Atlas Metadata Server error HTTP 503 response from http://localhost:21000/api/atlas/admin/status in 0.000s (HTTP Error 503: Service Unavailable)

Tags : 

In case if you are not able to access your atlas portal or you see following error in your browser or logs. HTTP 503 response from http://localhost:21000/api/atlas/admin/status in 0.000s (HTTP Error 503: Service Unavailable) Then please check application.log file in /var/log/atlas location and if you see following error in logs then do not worry,following the given

Read More

extend your VirtualBox image size

Tags : 

When you first time use your HDP sandbox in VirtualBox then by default it assign 20GB of your harddisk to your sandbox. But later as far as I know this would not be enough size and you want to extend size.Then this article will help you to extend your VBox size. Step 1: Right click

Read More

Could not create http connection to jdbc:hive2:HTTP Response code: 413 (state=08S01,code=0)

Tags : 

If you are using HiveServer2 in HTTP transport mode, then the authentication information is sent as part of HTTP headers. And the above error occurs when the default buffer size is set and the HTTP size is insufficient also using Kerberos is used. This is a known issue and a bug (https://issues.apache.org/jira/browse/HIVE-11720) has been raised

Read More

Error: org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions

Tags : 

If you try to connect to phoenix server from hbase or you do some service checks then if you are facing following error then do not worry,be relax as here you will find solution of this problem. Error : SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/2.3.4.0-3485/phoenix/phoenix-4.4.0.2.3.4.0-3485-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in

Read More

Ambari is showing “Add Service Wizard in Progress” or “Move Master Wizard In Progress”

Tags : 

If you are using ambari 2.4.1 or 2.4.2 then you may see following message in your ambari page and you will not get any option to “Service Action” to restart or doing anything to any services. Root Cause : If there are more than one Ambari Admin users present. Then if one of the admin user

Read More

java.lang.IllegalArgumentException: stream exceeds limit [2,048]

Tags : 

When we run oozie job with SSH action and we use capture output then it may fail with following error. java.lang.IllegalArgumentException: stream exceeds limit [2,048] at org.apache.oozie.util.IOUtils.getReaderAsString(IOUtils.java:84) at org.apache.oozie.servlet.CallbackServlet.doPost(CallbackServlet.java:117) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at org.apache.oozie.servlet.JsonRestServlet.service(JsonRestServlet.java:304) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.oozie.servlet.HostnameFilter.doFilter(HostnameFilter.java:86) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)

Read More

hadoop snapshots

Tags : 

Hdfs snapshots are to protect important enterprise data sets from user or application errors.HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are: To demonstrate functionality of snapshots, we will create a directory in HDFS, will create

Read More

Ssh action with oozie

Tags : 

When you want to run your shell script via oozie then following article will help you to do your job in easy way. Following steps you need to setup Oozie workflow using ssh-action: 1. Configure job.properties Example: 2. Configure workflow.xml Example: 3. Write sample sampletest.sh script Example: 4. Upload workflow.xml to ${appPath} defined in job.properties

Read More

How to remove header from csv during loading to hive

Tags : 

Sometime we may have header in our data file and we do not want that header to loaded into our hive table or we want to ignore header then this article will help you. [saurkuma@m1 ~]$ cat sampledata.csv id,Name 1,Saurabh 2,Vishal 3,Jeba 4,Sonu Step 1: Create a table with table properties to ignore it. hive>

Read More

Insert date into hive tables shows null during select

Tags : 

When we try to create table on any files(csv or any other format) and load data into hive table then we may see that during select queries it is showing null value. You can solve it in the following ways: [saurkuma@m1 ~]$ ll total 584 -rw-r–r– 1 saurkuma saurkuma 591414 Mar 16 02:31 SalesData01.csv [saurkuma@m1

Read More

Unix useful commands

Tags : 

Sometime we need a user who can do everything in our server as root does. So we may do the following: Create a new user with the same privileges as root Grant same same privileges to existing user as root Case 1: Lets say we need to add a new user and grant him root

Read More

Oozie server failing with error “cannot load JDBC driver class ‘com.mysql.jdbc.Driver'”

Tags : 

Issue : Oozie server is failing with following error : FATAL Services:514 – SERVER[m2.hdp22] E0103: Could not load service classes, Cannot load JDBC driver class ‘com.mysql.jdbc.Driver’ org.apache.oozie.service.ServiceException: E0103: Could not load service classes, Cannot load JDBC driver class ‘com.mysql.jdbc.Driver’ at org.apache.oozie.service.Services.loadServices(Services.java:309) at org.apache.oozie.service.Services.init(Services.java:213) at org.apache.oozie.servlet.ServicesLoader.contextInitialized(ServicesLoader.java:46) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4210) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4709) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:802) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:583)

Read More

script to kill yarn application if it is running more than x mins

Tags : 

Sometime we get a situation where we have to get lists of all long running and based on threshold we need to kill them.Also sometime we need to do it for a specific yarn queue.  In such situation following script will help you to do your job. [root@m1.hdp22~]$ vi kill_application_after_some_time.sh #!/bin/bash if [ “$#” -lt

Read More

Hive2 action with Oozie in kerberos Env

Tags : 

One of my friend was trying to run some simple hive2 action in their Oozie workflow and was getting error. Then I decided to replicate it on my cluster and finally I did it after some retry. If you have the same requirement where you have to run hive sql via oozie then this article

Read More

Enable GUI for Centos 6 on top of command line

Tags : 

If you have installed CentOS 6.5, and you just have a terminal with a black background and you want to enable GUI then thsi article is for you to get it done. Desktop environment is not necessary for Server usage, though. But Sometimes installation or using an application requires Desktop Environment, then build Desktop Environment

Read More

Encrypt Database and LDAP Passwords for Ambari-Server

Tags : 

By default the passwords to access the Ambari database and the LDAP server are stored in a plain text configuration file. To have those passwords encrypted, you need to run a special setup command. [root@m1 ~]# cd /etc/ambari-server/conf/ [root@m1 conf]# ls -ltrh total 52K -rw-r–r– 1 root root 2.8K Mar 31  2015 ambari.properties.rpmsave.20161004015858 -rwxrwxrwx 1

Read More

Cannot retrieve repository metadata (repomd.xml) for repository

Tags : 

When you upgrade your hdp cluster through satellite server or local repository and you start your cluster via ambari or add some new services to your cluster then you may see following error. resource_management.core.exceptions.Fail: Execution of ‘/usr/bin/yum -d 0 -e 0 -y install ambari-metrics-collector’ returned 1. Error: Cannot retrieve repository metadata (repomd.xml) for repository: HDP-2.3.0.0-2557.

Read More

Unable to initialize Falcon Client object. Cause : Could not authenticate, Authentication failed

Tags : 

If you upgrade to or install HDP 2.5.0 or later without first installing the Berkeley DB file, you will get the error “Unable to initialize Falcon Client object. Cause : Could not authenticate, Authentication failed” or  HTTP ERROR: 503 Problem accessing /index.html. Reason: SERVICE_UNAVAILABL or  Falcon UI is unavailable. From Falcon logs: java.lang.RuntimeException: org.apache.falcon.FalconException: Unable

Read More

Enable logging for client connections and running queries with Phoenix Query Server

Tags : 

Phoenix Query server (PQS) does not log details about client connections and the queries run using the default log level of INFO. It is required to modify the log4j configuration for certain classes to obtain such logs. To enable logging such messages by PQS, perform the following: On the node that runs PQS service, edit

Read More

Some helpful Tips

Tags : 

1. How-to-run-a-hive-query-using-yesterdays-date Use from_unixtime(unix_timestamp()-1*60*60*24, ‘yyyy-MM-dd’); in your hive query. For example: select * from sample where date1=from_unixtime(unix_timestamp()-1*60*60*24, ‘yyyy-MM-dd’); 2. How to diff file(s) in HDFS How to diff a file in HDFS and a file in the local filesystem: diff <(hadoop fs -cat /path/to/file) /path/to/localfile How to diff two files in HDFS: diff <(hadoop fs -cat /path/to/file1)

Read More

Run Pig Script in Nifi

Tags : 

NiFi can interface directly with Hive, HDFS, HBase, Flume and Phoenix. And I can also trigger Spark and Flink through Kafka and Site-To-Site. Sometimes I need to run some Pig scripts. Apache Pig is very stable and has a lot of functions and tools that make for some smart processing. You can easily augment and

Read More

Exception in thread “main” org.apache.spark.SparkException: Application

Tags : 

When you run python script on top of hive but it is failing with following error : $ spark-submit –master yarn –deploy-mode cluster –queue ado –num-executors 60 –executor-memory 3G –executor-cores 5 –py-files argparse.py,load_iris_2.py –driver-memory 10G  load_iris.py -p ado_secure.iris_places -s ado_secure.iris_places_stg -f /user/admin/iris/places/2016-11-30-place.csv Exception in thread “main” org.apache.spark.SparkException: Application application_1476997468030_142120 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:974)

Read More

HDFS disk space vs NameNode heap size

Tags : 

In HDFS, data and metadata are decoupled. Data files are split into block files that are stored, and replicated on DataNodes across the cluster. The filesystem namespace tree and associated metadata are stored on the NameNode. Namespace objects are file inodes and blocks that point to block files on the DataNodes. These namespace objects are

Read More

GC pool ‘PS MarkSweep’ had collection(s): count=6 time=26445ms

Tags : 

When you create table and it is enforcing authorization using Ranger then it fails to create the table and post that HiveServer2 process crashes. 0: jdbc:hive2://server1> CREATE EXTERNAL TABLE test (cust_id STRING, ACCOUNT_ID STRING, ROLE_ID STRING, ROLE_NAME STRING, START_DATE STRING, END_DATE STRING, PRIORITY STRING, ACTIVE_ACCOUNT_ROLE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED

Read More

Datanode doesn’t start with error “java.net.BindException: Address already in use”

Tags : 

In many real time scenario we have seen a error “java.net.BindException: Address already in use” with datanode when we start datanode. You can observe following things during that issue. 1. Datanode doesn’t start with error saying “address already in use”. 2. “netstat -anp | grep 50010” shows no result. ROOT CAUSE: There are 3 ports

Read More

Top most Hadoop Interview question

Tags : 

1. What are the Side Data Distribution Techniques? Side data refers to extra static small data required by map reduce to perform job. Main challenge is the availability of side data on the node where the map would be executed. Hadoop provides two side data distribution techniques. Using Job Configuration An arbitrary Key value pair

Read More

Installing grafana and it is failing with resource_management.core.exceptions.Fail: Ambari Metrics Grafana data source creation failed. POST request status: 401 Unauthorized

Tags : 

When we do fresh install for grafana in ambari 2.4 and when you start it then it may be fail with following error. stderr:   /var/lib/ambari-agent/data/errors-14517.txt Traceback (most recent call last): File “/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/ metrics_grafana.py”, line 67, in <module> AmsGrafana().execute() File “/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py”, line 280, in execute method(env) File “/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py”, line 725, in restart self.start(env) File “/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/

Read More

Standby NameNode is faling and only one is running

Standby NameNode is unable to start up. Or, once bring up standby NameNode, the active NameNode will go down soon, leaving only one live NameNode. NameNode log shows: FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) – Error: flush failed for required journal (JournalAndStream(mgr=QJM to )) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.  ROOT CAUSE: 

Read More

Map side join in Hive

Tags : 

Many time we face a situation that we have very small tables in hive but when we query these tables then it takes long time. Here I am going to explain Map side join and its advantages over the normal join operation in Hive. But before knowing about this, we should first understand the concept of

Read More

sql workbench connection to hadoop

Tags : 

Many time we do not want to run our hive query through beeline or hive cli due to so many reason. Here I am not going to talk about reasons as its big debatable point, so in this article I have explain the steps to connect SQL Workbench to out hadoop cluster. In this article

Read More

Hive Actions with Oozie

Tags : 

One of my friend was trying to run some hive .hql in their Oozie workflow and was getting error. Then I decided to replicate it on my cluster and finally I did it after some retry. If you have the same requirement where you have to run hive sql via oozie then this article will help

Read More

Process xml file via apache pig

Tags : 

If you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pig code) contains an XMLLoader. It works in a similar way to our technique and captures all of the content between a start and end tag and supplies it as a single bytearray field in a Pig tuple.

Read More

Process xml file via mapreduce

Tags : 

When you have a requirement to process your data via hadoop which is not default input format then this article will help you. Hadoop provides default input formats like TextInputFormat, NLineInputFormat, KeyValueInputFormat etc., when you get a different types of files for processing you have to create your own custom input format for processing using

Read More

How to use Hive Query result in a variable for other query

Tags : 

Many time we want to store one query result into a variable and then use this variable in some other query. So now it is possible in your favorite hadoop ecosystem i.e hive. With the help of this article you can achieve it. [root@m1 etc]# hive 16/10/04 02:40:45 WARN conf.HiveConf: HiveConf of name hive.optimize.mapjoin.mapreduce does not exist

Read More

“INSERT OVERWRITE” functional details

Tags : 

If the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table. Note that if the target table (or partition) already has a file whose name collides with

Read More

Ranger admin install fails with “007-updateBlankPolicyName.sql import failed”

Tags : 

If you see following error during ranger install then no need to worry as you can solve it by following just one step. 2016-03-18 16:10:44,048 [JISQL] /usr/jdk64/jdk1.8.0_60/bin/java -cp /usr/share/java/mysql-connector-java.jar:/usr/hdp/current/ranger-admin/jisql/lib/* org.apache.util.sql.Jisql -driver mysqlconj -cstring jdbc:mysql://mysqldb/ranger -u ‘user’ -p ‘********’ -noheader -trim -c \; -input /usr/hdp/current/ranger-admin/db/mysql/patches/007-updateBlankPolicyName.sql Resolution : SET GLOBAL log_bin_trust_function_creators = 1 Reinstall again Ranger service.

Read More

Enable ‘Job Error Log’ in oozie

Tags : 

In the Oozie UI, ‘Job Error Log’ is a tab which was introduced in HDP v2.3 on Oozie v4.2 . By default it is disabled so with the help of following steps you can enable it. In the Oozie UI, ‘Job Error Log’ is a tab which was introduced in HDP v2.3 on Oozie v4.2

Read More

After upgrading ambari it is not coming up (hostcomponentdesiredstate.admin_state)

Tags : 

If you upgrade ambari and in case if you see following error then you should not worry, following steps will help you to bring your cluster into running state. Issue: Once you upgrade your cluster and after restarting you don’t see any service or their metrics on ambari then you need following given steps. You

Read More

Hadoop Archive Files – HAR

Tags : 

Hadoop archive files or HAR files are facility to pack HDFS files into archives. This is the best option for storing large number of small sized files in HDFS as storing large number of small sized files directly in HDFS is not very efficient. The advantage of har files is that, these files can be

Read More

Falcon MQ log files location

Tags : 

Sometime we see that falcon use 90-100% of / space like showing in following example. [user1@server localhost]$ du -sh /hadoop/falcon/hadoop/falcon/embeddedmq/data/localhost/KahaDB 67M     /hadoop/falcon/hadoop/falcon/embeddedmq/data/localhost/KahaDB [users1@server localhost]$ du -sh /hadoop/falcon/embeddedmq/data/localhost/KahaDB/ 849M   /hadoop/falcon/embeddedmq/data/localhost/KahaDB/ This is because we have installed falcon in embedded mode and we have set falcon.embeddedmq.data to that location. Falcon server starts embedded active mq whenever we

Read More

Pig script with HCatLoader on Hive ORC table

Tags : 

Sometime we have to run some pig command on hive orc tables then this article will help you to do that. Step 1: First create hive orc table: hive> CREATE TABLE ORC_Table(COL1 BIGINT,COL2 STRING) CLUSTERED BY (COL1) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\T’ STORED AS ORC TBLPROPERTIES (‘TRANSACTIONAL’=’TRUE’) ; Step 2:

Read More

hive date time issue

Tags : 

Many time when we load data into hive tables and if we have a date & time field in our data then we may have seen an issue with getting data field. So to solve this issue I have created this article and explained steps in details. I have the following sample input file(a.txt) a,20-11-2015

Read More

Compression in Hadoop

Tags : 

File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop. 1.

Read More

Change default permission of hive database

Tags : 

When you create a database or internal tables in hive cli then by default it creates with 777 permission.Even though if you have umask in hdfs then also it will be same permission. But now you can change it with the help of following steps. 1.From the command line in the Ambari server node, edit

Read More

Update your Capacity Scheduler through REST API

Tags : 

Sometime you want change your Capacity Scheduler through REST API or you have a requirement where you have to change your Capacity Scheduler configurations frequently via some script then this article will help you to do your work. You can achieve it via following command. [root@sandbox conf.server]# curl -v -u admin:admin -H “Content-Type: application/json” -H “X-Requested-By:ambari” -X PUT

Read More

Start Namenode manually

Tags : 

Sometime we do not want to start all hdfs services at once or we just want to start NN,DN or SNN only via command then this article will help you to do this in a very simple manner. 1. Kill the current operation if already going on from ambari for namenode startup 2. set hadoop.root.logger=DEBUG,console

Read More

Enable Debug mode for hive in Ambari

Tags : 

Many time we see that during troubleshoot we do not find much information if we are just default logger. So no worries I will help you to guide how to enable debug mode in logs or on your console. Case 1: Use the following command to start hive: Set follwoing property to turn on debug mode

Read More

How to integrate Ambari with ldap

Tags : 

By default, Ambari uses an internal database as the user store for authentication and authorization. If you wish to add LDAP external authentication in addition for Ambari Web, you need to make some edits to the Ambari properties file. Collect following information : ldap.primaryUrl=<ldap_server_name>:389 ldap.useSSL=false ldap.usernameAttribute=sAMAccountName ldap.baseDn=cn=Users,dc=<sreach_dir>,dc=com ldap.bindAnonymously=false ldap.managerDn=cn=ambari,cn=users,dc=<sreach_dir>,dc=com ldap.managerPassword=/etc/ambari-server/conf/ldap-password.dat ldap.userObjectClass=user ldap.groupObjectClass=group ldap.groupMembershipAttr=memberOf ldap.groupNamingAttr=cn ldap.referral=ignore

Read More

Check high CPU Intensive process on your server

Tags : 

When you start utilizing your cluster heavily then you may encounter a 100% CPU utilize error on a specific server. But as you may have many jobs and process running on that server that time it would be very tough to identify a culprit process whcih is causing this issue. It is like finding a

Read More

Tez job fails with ‘vertex failure’ error

Tags : 

When you run your hive job on tez execution engine then you may see job failure due to ‘vertex failure’ error. Or you may see following error in your logs. Vertex failed, vertexName=Reducer 34, vertexId=vertex_1424999265634_0222_1_23, diagnostics=[Task failed, taskId=task_1424999265634_01422_1_23_000008, diagnostics=[AttemptID:attempt_1424999265634_01422_1_23_000008_0 Info:Error: java.lang.RuntimeException: java.lang.RuntimeException: Reduce operator initialization failed  at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:188) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:307) at org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:564) at java.security.AccessController.doPrivileged(Native Method) at

Read More

heap size issue in Hive Metastore

Tags : 

Sometime during your job running you may see job failure due to heap size. It might be because of metastore heap issue. It is encountering OutOfMemory errors, or is known to be insufficient to handle the cluster workload. Resolution: To fix this issue you have to increase heap size for metastore in hive-end.sh (or hive-end.cmd)

Read More

Hadoop Admin most lovable commands

Tags : 

If you are working on hadoop and you want to know about your cluster or you want to control your hadoop cluster then following commands should be handy to you. In this article i have tried to explain few commands which will help you a lot to do your day to day works. hdfs dfsadmin

Read More

Rack Awareness on Hadoop

Tags : 

If you have Hadoop clusters of more than 30-40 nodes then it is better you have configured it with rack awarenwss because communication between two data nodes on the same rack is efficient than the same between two nodes on different racks. It also have us to improve network traffic while reading/writing HDFS files, NameNode

Read More

Namenode installation issue

Tags : 

When you install hdp and during installation if something goes wrong with hdfs components(like namenode) then you may see following errors. File “/usr/lib/python2.6/site-packages/resource_management/core/shell.py”, line 140, in _call_wrapper result = _call(command, **kwargs_copy) File “/usr/lib/python2.6/site-packages/resource_management/core/shell.py”, line 291, in _call raise Fail(err_msg) resource_management.core.exceptions.Fail: Execution of ‘yes Y | hdfs –config /usr/hdp/current/hadoop-client/conf namenode -format’ returned 127. /usr/hdp/current/hadoop-client/bin/hdfs: line 18:

Read More

How to debug distcp jobs

Tags : 

Some time when you run distcp jobs on cluster and you see some failure or performance then you want to debug it then you can go by using following command. To turn on debug mode on the job level, issue the following command before executing the distcp job: To turn on debugmode on the mapper level,

Read More

How to check contents of a JAR file

Tags : 

Many times we have to check what are the packages,classes included in one jar files, but due to black box(just a simple jar ) we face a trouble to check. So with the help of following ways you can check it. jar tf <PATH_TO_JAR But if you are looking for a specific class or package

Read More

If you delete /hdp/apps/ dir from hdfs

Tags : 

There is situation when unfortunately and unknowingly you delete /hdp/apps/2.3.4.0-3485  with skipTrash then you will be in trouble and other services will be impacted. You will not be able to run hive,mapreduce or sqoop command, You will get following error. [root@m1 ranger-hdfs-plugin]# hadoop fs -rmr -skipTrash /hdp/apps/2.3.4.0-3485 rmr: DEPRECATED: Please use ‘rm -r’ instead. Deleted /hdp/apps/2.3.4.0-3485 So

Read More

Application Timeline Server (ATS) issue error code: 500, message: Internal Server Error

Tags : 

I have seen an issue with Application Timeline Server (ATS). Actually Application Timeline Server (ATS) uses a LevelDB database which is stored in the location specified by yarn.timeline-service.leveldb-timeline-store.path in yarn-site.xml.All metadata store in *.sst files under specified location. Due to this we may face an space issue.But It is not good practice to delete *.sst files directly. An *.sst file is a

Read More

Real time use cases of Hadoop

As data continues to grow, businesses now have access to (or generate) more data than ever before–much of which goes unused. How can you turn this data into a competitive advantage? In this article, we explore different ways businesses are capitalizing on data. We keep hearing statistics about the growth of data. For instance: Data

Read More

How to change knox heap size

Some time due to heavy load you may a requirement to increase your knox jvm size to deal more reques and to give response in a time. So in that case you can change your knox jvm size in following ways. go to /usr/hdp/current/knox-server/bin/gateway.sh and seach for APP_MEM_OPTS string. Once you get it then you can change

Read More

Analyze your jobs running on top of Tez

Tags : 

Sometime we have to analyze our jobs to tune our jobs or to prepare some reports. We can use following method to get running time for each and every steps for your job in tez execution engine. By setting up hive.tez.exec.print.summary=true property you can achieve it. hive> select count(*) from cars_beeline; Query ID = s0998dnz_20160711080520_e282c377-5607-4cf4-bcda-bd7010918f9c Total

Read More

Import & Export in Hive

When we work on Hive, there would be lots of scenarios that we need to move data(i.e tables from one cluster to another cluster) from one cluster to another. For example, sometimes we need to copy some production table from one cluster to another cluster. Now we have got very good functionality in hive which give us two

Read More

Ambari shows all services down though hadoop services running

Tags : 

We have seen many time that our hadoop services are up and running but when we open ambari then it shows all are down. So basically it means services do not have any issue,it is a problem with ambari-agent. Ambari server typically gets to know about the service availability from Ambari agent and using the

Read More

How to enable debug logging for HDFS

Tags : 

I have seen many time that sometime error does not give a clear picture about issue and it can be mislead to us. Also we have to waste so much time to investigate it. I have found enabling debug mode is a easy way to troubleshoot any hadoop problem as it gives us a detail

Read More

Backup and Restore of Postgres Database

Tags : 

How To Backup Postgres Database 1. Backup a single postgres database This example will backup erp database that belongs to user geekstuff, to the file mydb.sql $ pg_dump -U geekstuff erp -f mydb.sql It prompts for password, after authentication mydb.sql got created with create table, alter table and copy commands for all the tables in

Read More

Hive Cross Cluster replication

Tags : 

Hive Cross-Cluster Replication Here I tried to explain cross-Cluster Replication with a Feed entity. This is a simple way to enforce Disaster Recovery policies or aggregate data from multiple clusters to a single cluster for enterprise reporting. To further illustrate Apache Falcon’s capabilities, we will use an HCatalog/Hive table as the Feed entity. Step 1:

Read More

How to read compressed data from hdfs through hadoop command

Sometime we have a requirement where we need to read compressed data from hdfs through hdfs command. And we have many compressed algorithms like(.gz, .snappy, .lzo and .bz2 etc). I have tried to explain how we can achieve this requirement with the help of following ways : Step 1: Copy any compressed file to your hdfs

Read More

How do I change an existing Ambari DB Postgres to MySQL?

Tags : 

By default when you configure your ambari server then it runs on postgres database. And if after sometime we need to change it to our comfortable and your org lovable db(like mysql) then you need to use following steps. Step 1: Please stop your ambari server and then take back of postgres  ambari db(the default password

Read More

Error: java.io.IOException: java.lang.RuntimeException: serious problem (state=,code=0)

If you run your hive query on ORC tables in hdp 2.3.4 then you may encounter this issue and it is because ORC split generation running on a global threadpool and doAs not being propagated to that threadpool. Threads in the threadpool are created on demand at execute time and thus execute as random users that

Read More

Ranger User sync does not work due to ERROR UserGroupSync [UnixUserSyncThread]

If we have enabled AD/LDAP user sync in ranger and we get below error then we need to follow given steps to resolve it. LdapUserGroupBuilder [UnixUserSyncThread] – Updating user count: 148, userName:, groupList: [test, groups] 09 Jun 2016 09:04:34 ERROR UserGroupSync [UnixUserSyncThread] – Failed to initialize UserGroup source/sink. Will retry after 3600000 milliseconds. Error details:

Read More

How to enable Node Label in your cluster

Tags : 

Node Label: Here we described how to use Node labels to run YARN/Other applications on cluster nodes that have a specified node label. Node labels can be set as exclusive or shareable: Exclusive— Access is restricted to applications running in queues associated with the node label. Sharable— If idle capacity is available on the labeled node, resources are

Read More

Run pig script though Oozie

Tags : 

If you have a requirement where you have to read some file through pig and you want to schedule your pig script via Oozie then this article will help you to do your job. Step 1: First create some dir inside hdfs(under your home dir) would be good. $ hadoop fs -mkdir -p /user/<user_id>/oozie-scripts/PigTest Step 2:

Read More

How To Set Up Master Slave Replication in MySQL

Tags : 

MySQL replication is a process that allows you to easily maintain multiple copies of a MySQL data by having them copied automatically from a master to a slave database. This can helpful for many reasons including facilating a backup for the data, a way to analyze it without using the main database, or simply as

Read More

hdfs balancer gets failed after every 30 mins when you run it through ambari

Tags : 

Actually there is still a bug in ambari 2.2.0, whenever you run balancer though ambari and it has to balance lots of TBs data then it fails after 30 mins due to timeout. You can see following error in your logs: resource_management.core.exceptions.Fail: Execution of ‘ambari-sudo.sh su hdfs -l -s /bin/bash -c ‘export PATH=’”‘”‘/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/usr/hdp/current/hadoop-client/bin’”‘”‘ ; hdfs

Read More

Encrypt password used by Sqoop to import or export data from database.

Tags : 

Sqoop became very popular and the darling tool for the industries. Sqoop has developed a lot and become very popular amongst Hadoop ecosystem. When we import or export data from database through Sqoop then we have to give password in command or in file only. I feel this is not a fully secure way to

Read More

Distcp between High Availability enabled cluster

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine. This impacted the

Read More

How to fix corrupted or under replicated blocks issue

To find out whether hadoop hdfs filesystem has corrupt blocks or not also to fix that we can use below steps : [hdfs@m1 ~]$ hadoop fsck / or [hdfs@m1 ~]$ hadoop fsck hdfs://192.168.56.41:50070/ If you get any corrupted blocks or missing at the end of output like below : Total size: 4396621856 B (Total open files

Read More

How to disable ‘Kill Application’ button in Resource Manager web UI

In Resource Manager UI there is option to kill Application and because of that all users can kill jobs. If you want to disable it then you can use following steps : You can login to Ambari and go to YARN Configs page. Search yarn.resourcemanager.webapp.ui-actions.enabled If it exists, change the value to false. If it does not exist, clear

Read More

Rolling Upgrade HDP 2.2 to HDP 2.3

Use this procedure to perform a rolling upgrade from HDP 2.2 to HDP 2.3. It is highly recommended you validate these steps in a test environment to adjust + account for any special configurations for your cluster. Before upgrading to HDP 2.3, you must first upgrade to Ambari 2.1. Make sure Ambari is upgraded and the cluster are

Read More

Why Learn Big Data and Hadoop?

In my experience, people who do things in their career that they are excited about and have a passion for, can go farther and faster with the self-motivation than if they did something that they didn’t like, but felt like they needed to do it for other reasons.  You are awesome in already taking initiative

Read More

How to start learning hadoop

Tags : 

The easiest way to get started with Hadoop is Sandbox with VM Player or Virtual Box. It is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest CDH/HDP distribution, packaged up in a virtual environment. You can start working on hadoop

Read More

Hello world!

Welcome to hadoopadmin.co.in. This is welcome post.

Read More

No data on air propellers was available, but we had always understood that it was not a difficult matter to secure an efficiency of 50% with marine propellers.