"You can have data without information, but you cannot have information without Big data."
Every digital process and social media exchange produces it !
Business can start thinking big again when it comes with hadoop !
You can’t lead your troops if your troops do not trust you !
Yes, Thats matter a lot because of following main reasons: By using correct file format as per your use case you can achieve following. 1. Less storage: if we select a proper file format with good compatibile compression technique then it’s required less storage. 2. Faster processing of data: based on our use case if
Read MoreWe always struggle like how to install and configure SHS on Kubernetes with gas event log. So here is your solution. Create a shs-gcs.yaml deployments file which will be used to deploy shs service. pvc: enablePVC: false existingClaimName: nfs-pvc eventsDir: “/” nfs: enableExampleNFS: false pvName: nfs-pv pvcName: nfs-pvc gcs: enableGCS: true secret: history-secrets key:
Read More****************************** Step 1 ***************************** Create a new airflow directory anywhere in your laptop (base) saurabhkumar@Saurabhs-MacBook-Pro spark-3.1.1-bin-hadoop2.7 % cd ~/Documents (base) saurabhkumar@Saurabhs-MacBook-Pro Documents % mkdir airflow-tutorial (base) saurabhkumar@Saurabhs-MacBook-Pro Documents % cd airflow-tutorial ************************** Step 2 ******************************* Create a python virtual env (base) saurabhkumar@Saurabhs-MacBook-Pro airflow-tutorial % conda create –name airflow-tutorial1 python=3.7 Collecting package metadata (current_repodata.json): done
Read MoreWhen you use Google Container Registry (GCR) and seeing the dreaded ImagePullBackoff status on your pods in minikube/K8s Then this article can help you to solve that error. Error : (base) saurabhkumar@Saurabhs-MacBook-Pro ~ % kubectl describe pod airflow-postgres-694899d6fd-lqp2c -n airflow Events: Type Reason Age From Message —- —— —- —- ——- Normal Scheduled 56s default-scheduler
Read MoreIf you have explicitly setup hive.exec.stagingdir to some location like /tmp/ or some other location then whenever you will run insert overwrite statment then you will get following error. ERROR exec.Task (SessionState.java:printError(989)) – Failed with exception Unable to move source hdfs://clustername/apps/finance/nest/nest_audit_log_final/ .hive-staging_hive_2017-12-12_19-15-30_008_33149322272174981-1/-ext-10000 to destination hdfs://clustername/apps/finance/nest/nest_audit_log_final Example: INSERT OVERWRITE TABLE nest.nest_audit_log_final SELECT project_name , application , module_seq_num ,
Read MoreIf you many hundreds or thousands tables and you want to know when was the last time your hive table accessed then you can run following mysql query in mysql under hive database. mysql> use hive; mysql> select TBL_NAME,LAST_ACCESS_TIME from TBLS where DB_ID=<db_id>; +—————————————————————————————————-+——————+ | TBL_NAME | LAST_ACCESS_TIME | +—————————————————————————————————-+——————+ | df_nov_4 | 0 |
Read MoreSometime when you run hive queries then it does not launch application or get hung due to some resources or any other reason. Now in this case you have to kill query to resubmit it. So, please use following steps to kill hive query itself. hive> select * from table1; Query ID = mapr_201804547_2ad87f0f5627
Read MoreAfter some period of time your oozie db will be big and it may start throwing space issue or might be some slowness during oozie UI load. There are some properties which will help you to purge your oozie data but sometime, the oozie purge service does not function as expected. It result to a
Read MoreWhen we submit Spark2 action via oozie then we may see following exception in logs and job will fail: exception: Attempt to add (hdfs://m1:8020/user/oozie/share/lib/lib_20171129113304/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache. java.lang.IllegalArgumentException: Attempt to add (hdfs://m1:8020/user/oozie/share/lib/lib_20171129113304/oozie/aws-java-sdk-core-1.10.6.jar) multiple times to the distributed cache. The above error occurs because the same jar files exists in both(/user/oozie/share/lib/lib_20171129113304/oozie/ and /user/oozie/share/lib/lib_20171129113304/spark2/) the
Read MoreWhen users run hive query in zeppelin via jdbc interperator then it is going to some anonymous user not an actual user. INFO [2017-11-02 03:18:20,405] ({pool-2-thread-2} RemoteInterpreter.java[pushAngularObjectRegistryToRemote]:546) – Push local angular object registry from ZeppelinServer to remote interpreter group 2CNQZ1ES5:shared_process WARN [2017-11-02 03:18:21,825] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2058) – Job 20171031-075630_2029577092 is finished, status: ERROR, exception: null, result:
Read MoreNamenode may keep crashing even if you restart all services and you have enough heap size. And you see following error in logs. java.io.IOException: IPC’s epoch 197 is less than the last promised epoch 198 or 2017-09-28 09:16:11,371 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(851)) – Local service NameNode at m1.hdp22 entered state: SERVICE_NOT_RESPONDING Root Cause: In my case
Read MoreIf users complain that they are not able to load data into hive tables via beeline. Actually while loading data into Hive table using load data inpath ‘/tmp/test’ into table sampledb.sample1 then getting following error: load data inpath ‘/tmp/test’ into table adodevdb.sample1; INFO : Loading data to table adodevdb.sample1 from hdfs://m1.hdp22/tmp/test ERROR : Failed with
Read MoreWhen I ran a select statement via setting set hive.execution.engine=mr; then select * from table is not returning any rows in beeline but when I run it in tez then it is returning result. 0: jdbc:hive2://m1.hdp22:10001/default> select * from test_db.table1 limit 25; +————————+————————-+————————-+—————————+—————————+—————————+————————-+————————-+————————-+——————————-+————————-+–+ | cus_id | prx_nme | fir_nme | mid_1_nme | mid_2_nme | mid_3_nme
Read MoreWhen you try to start knox then if it fails with following error then don’t worry, this article will help you to solve problem. INFO hadoop.gateway (JettySSLService.java: logAndValidateCertificate(122)) – The Gateway SSL certificate is valid between: FATAL hadoop.gateway (GatewayServer.java:main (120)) – Failed to start gateway: org.apache.hadoop.gateway.services. ServiceLifecycleException: Gateway SSL Certificate is Expired. Root cause: It
Read MoreWhen you install Atlas and configure it then you may see following alert in Ambari Hive Service. And once you check this alert details, you will see following error : Metastore on m1.hdp22 failed (Traceback (most recent call last): File “/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_metastore.py”, line 200, in execute timeout_kill_strategy=TerminateStrategy.KILL_PROCESS_TREE, File “/usr/lib/python2.6/site-packages/resource_management/core/base.py”, line 155, in __init__ self.env.run() File “/usr/lib/python2.6/site-packages/resource_management/core/environment.py”,
Read MoreWhen you run Sqoop import with teradata or mysql/oracle then it might fail after installing and enabling atlas in your cluster with following error. 17/08/10 04:31:56 ERROR security.InMemoryJAASConfiguration: Unable to add JAAS configuration for client [KafkaClient] as it is missing param [atlas.jaas.KafkaClient.loginModuleName]. Skipping JAAS config for [KafkaClient] 17/08/10 04:31:58 INFO checking on the exit code
Read MoreWhen you have installed atlas on top of your cluster and you want to sync your hive data to atlas via following method then you may see following error after sometime(~20-30 mins) running your command. [hive@m1.hdp22 ~]$ export HADOOP_CLASSPATH=`hadoop classpath` [hive@m1.hdp22 ~]$ export HIVE_CONF_DIR=/etc/hive/conf [hive@m1.hdp22 ~]$ /usr/hdp/2.6.1.0-129/atlas/hook-bin/import-hive.sh Using Hive configuration directory [/etc/hive/conf] Log file for
Read MoreIf you build a pyspark application which can run successfully in both the local and yarn-client modes. However, when you try to run in cluster mode, then you may receive following errors : Error 1: Exception: (“You must build Spark with Hive. Export ‘SPARK_HIVE=true’ and run build/sbt assembly”, Py4JJavaError(u’An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n’, JavaObject id=o52))
Read MoreOn the Ambari dashboard, the memory usage, Network Usage, CPU usage and Cluster Load information are missing.The dashboard displays the following error: Root Cause : This issue occurs when there are some temporary files present in the AMS collector folder. Solution: You need to stop ams service vi ambari and then remove all temp files.
Read MoreWhen we run beeline jobs very heavily then sometime we can see following error : Root Cause : By default, the history file is located under ~/.beeline/history for that user who is facing this issue and beeline will load the latest 500 rows into memory. If those queries are super big, containing lots of characters, it
Read MoreIn this blogs I tried to explain that how you can use ambari API to trigger all Service Checks with a single command. In order to check the status and stability of any service in your cluster you need to run the service checks that are included in Ambari. Usually each Service provides its own
Read MoreSome time you have to troubleshoot beeline issue and then you think how to get into debug mode for beeline command shell as you have in hive (-hiveconf hive.root.logger=Debug,console). I know same is not going to work with beeline So don’t worry following steps will help you and good part is you do not need
Read MoreWhen we install our cluster then we should do some benchmarking or Stress Testing. So in this article I have explained a inbuilt TestDFSIO functionality which will help you to to perform Stress Testing on your configured cluster. The Hadoop distribution comes with a number of benchmarks, which are bundled in hadoop-*test*.jar and hadoop-*examples*.jar. The TestDFSIO benchmark is
Read MoreIn case if you are not able to access your atlas portal or you see following error in your browser or logs. HTTP 503 response from http://localhost:21000/api/atlas/admin/status in 0.000s (HTTP Error 503: Service Unavailable) Then please check application.log file in /var/log/atlas location and if you see following error in logs then do not worry,following the given
Read MoreWhen you first time use your HDP sandbox in VirtualBox then by default it assign 20GB of your harddisk to your sandbox. But later as far as I know this would not be enough size and you want to extend size.Then this article will help you to extend your VBox size. Step 1: Right click
Read MoreIf you are using HiveServer2 in HTTP transport mode, then the authentication information is sent as part of HTTP headers. And the above error occurs when the default buffer size is set and the HTTP size is insufficient also using Kerberos is used. This is a known issue and a bug (https://issues.apache.org/jira/browse/HIVE-11720) has been raised
Read MoreIf you try to connect to phoenix server from hbase or you do some service checks then if you are facing following error then do not worry,be relax as here you will find solution of this problem. Error : SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/2.3.4.0-3485/phoenix/phoenix-4.4.0.2.3.4.0-3485-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in
Read MoreIf you are using ambari 2.4.1 or 2.4.2 then you may see following message in your ambari page and you will not get any option to “Service Action” to restart or doing anything to any services. Root Cause : If there are more than one Ambari Admin users present. Then if one of the admin user
Read MoreWhen we run oozie job with SSH action and we use capture output then it may fail with following error. java.lang.IllegalArgumentException: stream exceeds limit [2,048] at org.apache.oozie.util.IOUtils.getReaderAsString(IOUtils.java:84) at org.apache.oozie.servlet.CallbackServlet.doPost(CallbackServlet.java:117) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at org.apache.oozie.servlet.JsonRestServlet.service(JsonRestServlet.java:304) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.oozie.servlet.HostnameFilter.doFilter(HostnameFilter.java:86) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
Read MoreHdfs snapshots are to protect important enterprise data sets from user or application errors.HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are: To demonstrate functionality of snapshots, we will create a directory in HDFS, will create
Read MoreWhen you want to run your shell script via oozie then following article will help you to do your job in easy way. Following steps you need to setup Oozie workflow using ssh-action: 1. Configure job.properties Example: 2. Configure workflow.xml Example: 3. Write sample sampletest.sh script Example: 4. Upload workflow.xml to ${appPath} defined in job.properties
Read MoreSometime we may have header in our data file and we do not want that header to loaded into our hive table or we want to ignore header then this article will help you. [saurkuma@m1 ~]$ cat sampledata.csv id,Name 1,Saurabh 2,Vishal 3,Jeba 4,Sonu Step 1: Create a table with table properties to ignore it. hive>
Read MoreWhen we try to create table on any files(csv or any other format) and load data into hive table then we may see that during select queries it is showing null value. You can solve it in the following ways: [saurkuma@m1 ~]$ ll total 584 -rw-r–r– 1 saurkuma saurkuma 591414 Mar 16 02:31 SalesData01.csv [saurkuma@m1
Read MoreSometime we need a user who can do everything in our server as root does. So we may do the following: Create a new user with the same privileges as root Grant same same privileges to existing user as root Case 1: Lets say we need to add a new user and grant him root
Read MoreIssue : Oozie server is failing with following error : FATAL Services:514 – SERVER[m2.hdp22] E0103: Could not load service classes, Cannot load JDBC driver class ‘com.mysql.jdbc.Driver’ org.apache.oozie.service.ServiceException: E0103: Could not load service classes, Cannot load JDBC driver class ‘com.mysql.jdbc.Driver’ at org.apache.oozie.service.Services.loadServices(Services.java:309) at org.apache.oozie.service.Services.init(Services.java:213) at org.apache.oozie.servlet.ServicesLoader.contextInitialized(ServicesLoader.java:46) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4210) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4709) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:802) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:583)
Read MoreSometime we get a situation where we have to get lists of all long running and based on threshold we need to kill them.Also sometime we need to do it for a specific yarn queue. In such situation following script will help you to do your job. [root@m1.hdp22~]$ vi kill_application_after_some_time.sh #!/bin/bash if [ “$#” -lt
Read MoreOne of my friend was trying to run some simple hive2 action in their Oozie workflow and was getting error. Then I decided to replicate it on my cluster and finally I did it after some retry. If you have the same requirement where you have to run hive sql via oozie then this article
Read MoreIf you have installed CentOS 6.5, and you just have a terminal with a black background and you want to enable GUI then thsi article is for you to get it done. Desktop environment is not necessary for Server usage, though. But Sometimes installation or using an application requires Desktop Environment, then build Desktop Environment
Read MoreBy default the passwords to access the Ambari database and the LDAP server are stored in a plain text configuration file. To have those passwords encrypted, you need to run a special setup command. [root@m1 ~]# cd /etc/ambari-server/conf/ [root@m1 conf]# ls -ltrh total 52K -rw-r–r– 1 root root 2.8K Mar 31 2015 ambari.properties.rpmsave.20161004015858 -rwxrwxrwx 1
Read MoreWhen you upgrade your hdp cluster through satellite server or local repository and you start your cluster via ambari or add some new services to your cluster then you may see following error. resource_management.core.exceptions.Fail: Execution of ‘/usr/bin/yum -d 0 -e 0 -y install ambari-metrics-collector’ returned 1. Error: Cannot retrieve repository metadata (repomd.xml) for repository: HDP-2.3.0.0-2557.
Read MoreIf you upgrade to or install HDP 2.5.0 or later without first installing the Berkeley DB file, you will get the error “Unable to initialize Falcon Client object. Cause : Could not authenticate, Authentication failed” or HTTP ERROR: 503 Problem accessing /index.html. Reason: SERVICE_UNAVAILABL or Falcon UI is unavailable. From Falcon logs: java.lang.RuntimeException: org.apache.falcon.FalconException: Unable
Read MorePhoenix Query server (PQS) does not log details about client connections and the queries run using the default log level of INFO. It is required to modify the log4j configuration for certain classes to obtain such logs. To enable logging such messages by PQS, perform the following: On the node that runs PQS service, edit
Read More1. How-to-run-a-hive-query-using-yesterdays-date Use from_unixtime(unix_timestamp()-1*60*60*24, ‘yyyy-MM-dd’); in your hive query. For example: select * from sample where date1=from_unixtime(unix_timestamp()-1*60*60*24, ‘yyyy-MM-dd’); 2. How to diff file(s) in HDFS How to diff a file in HDFS and a file in the local filesystem: diff <(hadoop fs -cat /path/to/file) /path/to/localfile How to diff two files in HDFS: diff <(hadoop fs -cat /path/to/file1)
Read MoreNiFi can interface directly with Hive, HDFS, HBase, Flume and Phoenix. And I can also trigger Spark and Flink through Kafka and Site-To-Site. Sometimes I need to run some Pig scripts. Apache Pig is very stable and has a lot of functions and tools that make for some smart processing. You can easily augment and
Read MoreWhen you run python script on top of hive but it is failing with following error : $ spark-submit –master yarn –deploy-mode cluster –queue ado –num-executors 60 –executor-memory 3G –executor-cores 5 –py-files argparse.py,load_iris_2.py –driver-memory 10G load_iris.py -p ado_secure.iris_places -s ado_secure.iris_places_stg -f /user/admin/iris/places/2016-11-30-place.csv Exception in thread “main” org.apache.spark.SparkException: Application application_1476997468030_142120 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:974)
Read MoreIn HDFS, data and metadata are decoupled. Data files are split into block files that are stored, and replicated on DataNodes across the cluster. The filesystem namespace tree and associated metadata are stored on the NameNode. Namespace objects are file inodes and blocks that point to block files on the DataNodes. These namespace objects are
Read MoreWhen you create table and it is enforcing authorization using Ranger then it fails to create the table and post that HiveServer2 process crashes. 0: jdbc:hive2://server1> CREATE EXTERNAL TABLE test (cust_id STRING, ACCOUNT_ID STRING, ROLE_ID STRING, ROLE_NAME STRING, START_DATE STRING, END_DATE STRING, PRIORITY STRING, ACTIVE_ACCOUNT_ROLE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED
Read MoreIn many real time scenario we have seen a error “java.net.BindException: Address already in use” with datanode when we start datanode. You can observe following things during that issue. 1. Datanode doesn’t start with error saying “address already in use”. 2. “netstat -anp | grep 50010” shows no result. ROOT CAUSE: There are 3 ports
Read More1. What are the Side Data Distribution Techniques? Side data refers to extra static small data required by map reduce to perform job. Main challenge is the availability of side data on the node where the map would be executed. Hadoop provides two side data distribution techniques. Using Job Configuration An arbitrary Key value pair
Read MoreWhen we do fresh install for grafana in ambari 2.4 and when you start it then it may be fail with following error. stderr: /var/lib/ambari-agent/data/errors-14517.txt Traceback (most recent call last): File “/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/ metrics_grafana.py”, line 67, in <module> AmsGrafana().execute() File “/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py”, line 280, in execute method(env) File “/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py”, line 725, in restart self.start(env) File “/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/
Read MoreStandby NameNode is unable to start up. Or, once bring up standby NameNode, the active NameNode will go down soon, leaving only one live NameNode. NameNode log shows: FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) – Error: flush failed for required journal (JournalAndStream(mgr=QJM to )) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. ROOT CAUSE:
Read MoreMany time we face a situation that we have very small tables in hive but when we query these tables then it takes long time. Here I am going to explain Map side join and its advantages over the normal join operation in Hive. But before knowing about this, we should first understand the concept of
Read MoreMany time we do not want to run our hive query through beeline or hive cli due to so many reason. Here I am not going to talk about reasons as its big debatable point, so in this article I have explain the steps to connect SQL Workbench to out hadoop cluster. In this article
Read MoreOne of my friend was trying to run some hive .hql in their Oozie workflow and was getting error. Then I decided to replicate it on my cluster and finally I did it after some retry. If you have the same requirement where you have to run hive sql via oozie then this article will help
Read MoreIf you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pig code) contains an XMLLoader. It works in a similar way to our technique and captures all of the content between a start and end tag and supplies it as a single bytearray field in a Pig tuple.
Read MoreWhen you have a requirement to process your data via hadoop which is not default input format then this article will help you. Hadoop provides default input formats like TextInputFormat, NLineInputFormat, KeyValueInputFormat etc., when you get a different types of files for processing you have to create your own custom input format for processing using
Read MoreMany time we want to store one query result into a variable and then use this variable in some other query. So now it is possible in your favorite hadoop ecosystem i.e hive. With the help of this article you can achieve it. [root@m1 etc]# hive 16/10/04 02:40:45 WARN conf.HiveConf: HiveConf of name hive.optimize.mapjoin.mapreduce does not exist
Read MoreIf the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table. Note that if the target table (or partition) already has a file whose name collides with
Read MoreIf you see following error during ranger install then no need to worry as you can solve it by following just one step. 2016-03-18 16:10:44,048 [JISQL] /usr/jdk64/jdk1.8.0_60/bin/java -cp /usr/share/java/mysql-connector-java.jar:/usr/hdp/current/ranger-admin/jisql/lib/* org.apache.util.sql.Jisql -driver mysqlconj -cstring jdbc:mysql://mysqldb/ranger -u ‘user’ -p ‘********’ -noheader -trim -c \; -input /usr/hdp/current/ranger-admin/db/mysql/patches/007-updateBlankPolicyName.sql Resolution : SET GLOBAL log_bin_trust_function_creators = 1 Reinstall again Ranger service.
Read MoreIn the Oozie UI, ‘Job Error Log’ is a tab which was introduced in HDP v2.3 on Oozie v4.2 . By default it is disabled so with the help of following steps you can enable it. In the Oozie UI, ‘Job Error Log’ is a tab which was introduced in HDP v2.3 on Oozie v4.2
Read MoreIf you upgrade ambari and in case if you see following error then you should not worry, following steps will help you to bring your cluster into running state. Issue: Once you upgrade your cluster and after restarting you don’t see any service or their metrics on ambari then you need following given steps. You
Read MoreHadoop archive files or HAR files are facility to pack HDFS files into archives. This is the best option for storing large number of small sized files in HDFS as storing large number of small sized files directly in HDFS is not very efficient. The advantage of har files is that, these files can be
Read MoreSometime we see that falcon use 90-100% of / space like showing in following example. [user1@server localhost]$ du -sh /hadoop/falcon/hadoop/falcon/embeddedmq/data/localhost/KahaDB 67M /hadoop/falcon/hadoop/falcon/embeddedmq/data/localhost/KahaDB [users1@server localhost]$ du -sh /hadoop/falcon/embeddedmq/data/localhost/KahaDB/ 849M /hadoop/falcon/embeddedmq/data/localhost/KahaDB/ This is because we have installed falcon in embedded mode and we have set falcon.embeddedmq.data to that location. Falcon server starts embedded active mq whenever we
Read MoreSometime we have to run some pig command on hive orc tables then this article will help you to do that. Step 1: First create hive orc table: hive> CREATE TABLE ORC_Table(COL1 BIGINT,COL2 STRING) CLUSTERED BY (COL1) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\T’ STORED AS ORC TBLPROPERTIES (‘TRANSACTIONAL’=’TRUE’) ; Step 2:
Read MoreMany time when we load data into hive tables and if we have a date & time field in our data then we may have seen an issue with getting data field. So to solve this issue I have created this article and explained steps in details. I have the following sample input file(a.txt) a,20-11-2015
Read MoreFile compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop. 1.
Read MoreWhen you create a database or internal tables in hive cli then by default it creates with 777 permission.Even though if you have umask in hdfs then also it will be same permission. But now you can change it with the help of following steps. 1.From the command line in the Ambari server node, edit
Read MoreSometime you want change your Capacity Scheduler through REST API or you have a requirement where you have to change your Capacity Scheduler configurations frequently via some script then this article will help you to do your work. You can achieve it via following command. [root@sandbox conf.server]# curl -v -u admin:admin -H “Content-Type: application/json” -H “X-Requested-By:ambari” -X PUT
Read MoreSometime we do not want to start all hdfs services at once or we just want to start NN,DN or SNN only via command then this article will help you to do this in a very simple manner. 1. Kill the current operation if already going on from ambari for namenode startup 2. set hadoop.root.logger=DEBUG,console
Read MoreMany time we see that during troubleshoot we do not find much information if we are just default logger. So no worries I will help you to guide how to enable debug mode in logs or on your console. Case 1: Use the following command to start hive: Set follwoing property to turn on debug mode
Read MoreBy default, Ambari uses an internal database as the user store for authentication and authorization. If you wish to add LDAP external authentication in addition for Ambari Web, you need to make some edits to the Ambari properties file. Collect following information : ldap.primaryUrl=<ldap_server_name>:389 ldap.useSSL=false ldap.usernameAttribute=sAMAccountName ldap.baseDn=cn=Users,dc=<sreach_dir>,dc=com ldap.bindAnonymously=false ldap.managerDn=cn=ambari,cn=users,dc=<sreach_dir>,dc=com ldap.managerPassword=/etc/ambari-server/conf/ldap-password.dat ldap.userObjectClass=user ldap.groupObjectClass=group ldap.groupMembershipAttr=memberOf ldap.groupNamingAttr=cn ldap.referral=ignore
Read MoreWhen you start utilizing your cluster heavily then you may encounter a 100% CPU utilize error on a specific server. But as you may have many jobs and process running on that server that time it would be very tough to identify a culprit process whcih is causing this issue. It is like finding a
Read MoreWhen you run your hive job on tez execution engine then you may see job failure due to ‘vertex failure’ error. Or you may see following error in your logs. Vertex failed, vertexName=Reducer 34, vertexId=vertex_1424999265634_0222_1_23, diagnostics=[Task failed, taskId=task_1424999265634_01422_1_23_000008, diagnostics=[AttemptID:attempt_1424999265634_01422_1_23_000008_0 Info:Error: java.lang.RuntimeException: java.lang.RuntimeException: Reduce operator initialization failed at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:188) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:307) at org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:564) at java.security.AccessController.doPrivileged(Native Method) at
Read MoreSometime during your job running you may see job failure due to heap size. It might be because of metastore heap issue. It is encountering OutOfMemory errors, or is known to be insufficient to handle the cluster workload. Resolution: To fix this issue you have to increase heap size for metastore in hive-end.sh (or hive-end.cmd)
Read MoreIf you are working on hadoop and you want to know about your cluster or you want to control your hadoop cluster then following commands should be handy to you. In this article i have tried to explain few commands which will help you a lot to do your day to day works. hdfs dfsadmin
Read MoreIf you have Hadoop clusters of more than 30-40 nodes then it is better you have configured it with rack awarenwss because communication between two data nodes on the same rack is efficient than the same between two nodes on different racks. It also have us to improve network traffic while reading/writing HDFS files, NameNode
Read MoreWhen you install hdp and during installation if something goes wrong with hdfs components(like namenode) then you may see following errors. File “/usr/lib/python2.6/site-packages/resource_management/core/shell.py”, line 140, in _call_wrapper result = _call(command, **kwargs_copy) File “/usr/lib/python2.6/site-packages/resource_management/core/shell.py”, line 291, in _call raise Fail(err_msg) resource_management.core.exceptions.Fail: Execution of ‘yes Y | hdfs –config /usr/hdp/current/hadoop-client/conf namenode -format’ returned 127. /usr/hdp/current/hadoop-client/bin/hdfs: line 18:
Read MoreSome time when you run distcp jobs on cluster and you see some failure or performance then you want to debug it then you can go by using following command. To turn on debug mode on the job level, issue the following command before executing the distcp job: To turn on debugmode on the mapper level,
Read MoreMany times we have to check what are the packages,classes included in one jar files, but due to black box(just a simple jar ) we face a trouble to check. So with the help of following ways you can check it. jar tf <PATH_TO_JAR But if you are looking for a specific class or package
Read MoreThere is situation when unfortunately and unknowingly you delete /hdp/apps/2.3.4.0-3485 with skipTrash then you will be in trouble and other services will be impacted. You will not be able to run hive,mapreduce or sqoop command, You will get following error. [root@m1 ranger-hdfs-plugin]# hadoop fs -rmr -skipTrash /hdp/apps/2.3.4.0-3485 rmr: DEPRECATED: Please use ‘rm -r’ instead. Deleted /hdp/apps/2.3.4.0-3485 So
Read MoreI have seen an issue with Application Timeline Server (ATS). Actually Application Timeline Server (ATS) uses a LevelDB database which is stored in the location specified by yarn.timeline-service.leveldb-timeline-store.path in yarn-site.xml.All metadata store in *.sst files under specified location. Due to this we may face an space issue.But It is not good practice to delete *.sst files directly. An *.sst file is a
Read MoreAs data continues to grow, businesses now have access to (or generate) more data than ever before–much of which goes unused. How can you turn this data into a competitive advantage? In this article, we explore different ways businesses are capitalizing on data. We keep hearing statistics about the growth of data. For instance: Data
Read MoreSome time due to heavy load you may a requirement to increase your knox jvm size to deal more reques and to give response in a time. So in that case you can change your knox jvm size in following ways. go to /usr/hdp/current/knox-server/bin/gateway.sh and seach for APP_MEM_OPTS string. Once you get it then you can change
Read MoreSometime we have to analyze our jobs to tune our jobs or to prepare some reports. We can use following method to get running time for each and every steps for your job in tez execution engine. By setting up hive.tez.exec.print.summary=true property you can achieve it. hive> select count(*) from cars_beeline; Query ID = s0998dnz_20160711080520_e282c377-5607-4cf4-bcda-bd7010918f9c Total
Read MoreWhen we work on Hive, there would be lots of scenarios that we need to move data(i.e tables from one cluster to another cluster) from one cluster to another. For example, sometimes we need to copy some production table from one cluster to another cluster. Now we have got very good functionality in hive which give us two
Read MoreWe have seen many time that our hadoop services are up and running but when we open ambari then it shows all are down. So basically it means services do not have any issue,it is a problem with ambari-agent. Ambari server typically gets to know about the service availability from Ambari agent and using the
Read MoreI have seen many time that sometime error does not give a clear picture about issue and it can be mislead to us. Also we have to waste so much time to investigate it. I have found enabling debug mode is a easy way to troubleshoot any hadoop problem as it gives us a detail
Read MoreHow To Backup Postgres Database 1. Backup a single postgres database This example will backup erp database that belongs to user geekstuff, to the file mydb.sql $ pg_dump -U geekstuff erp -f mydb.sql It prompts for password, after authentication mydb.sql got created with create table, alter table and copy commands for all the tables in
Read MoreHive Cross-Cluster Replication Here I tried to explain cross-Cluster Replication with a Feed entity. This is a simple way to enforce Disaster Recovery policies or aggregate data from multiple clusters to a single cluster for enterprise reporting. To further illustrate Apache Falcon’s capabilities, we will use an HCatalog/Hive table as the Feed entity. Step 1:
Read MoreSometime we have a requirement where we need to read compressed data from hdfs through hdfs command. And we have many compressed algorithms like(.gz, .snappy, .lzo and .bz2 etc). I have tried to explain how we can achieve this requirement with the help of following ways : Step 1: Copy any compressed file to your hdfs
Read MoreBy default when you configure your ambari server then it runs on postgres database. And if after sometime we need to change it to our comfortable and your org lovable db(like mysql) then you need to use following steps. Step 1: Please stop your ambari server and then take back of postgres ambari db(the default password
Read MoreIf you run your hive query on ORC tables in hdp 2.3.4 then you may encounter this issue and it is because ORC split generation running on a global threadpool and doAs not being propagated to that threadpool. Threads in the threadpool are created on demand at execute time and thus execute as random users that
Read MoreIf we have enabled AD/LDAP user sync in ranger and we get below error then we need to follow given steps to resolve it. LdapUserGroupBuilder [UnixUserSyncThread] – Updating user count: 148, userName:, groupList: [test, groups] 09 Jun 2016 09:04:34 ERROR UserGroupSync [UnixUserSyncThread] – Failed to initialize UserGroup source/sink. Will retry after 3600000 milliseconds. Error details:
Read MoreNode Label: Here we described how to use Node labels to run YARN/Other applications on cluster nodes that have a specified node label. Node labels can be set as exclusive or shareable: Exclusive— Access is restricted to applications running in queues associated with the node label. Sharable— If idle capacity is available on the labeled node, resources are
Read MoreIf you have a requirement where you have to read some file through pig and you want to schedule your pig script via Oozie then this article will help you to do your job. Step 1: First create some dir inside hdfs(under your home dir) would be good. $ hadoop fs -mkdir -p /user/<user_id>/oozie-scripts/PigTest Step 2:
Read MoreMySQL replication is a process that allows you to easily maintain multiple copies of a MySQL data by having them copied automatically from a master to a slave database. This can helpful for many reasons including facilating a backup for the data, a way to analyze it without using the main database, or simply as
Read MoreActually there is still a bug in ambari 2.2.0, whenever you run balancer though ambari and it has to balance lots of TBs data then it fails after 30 mins due to timeout. You can see following error in your logs: resource_management.core.exceptions.Fail: Execution of ‘ambari-sudo.sh su hdfs -l -s /bin/bash -c ‘export PATH=’”‘”‘/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/usr/hdp/current/hadoop-client/bin’”‘”‘ ; hdfs
Read MoreSqoop became very popular and the darling tool for the industries. Sqoop has developed a lot and become very popular amongst Hadoop ecosystem. When we import or export data from database through Sqoop then we have to give password in command or in file only. I feel this is not a fully secure way to
Read MorePrior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine. This impacted the
Read MoreTo find out whether hadoop hdfs filesystem has corrupt blocks or not also to fix that we can use below steps : [hdfs@m1 ~]$ hadoop fsck / or [hdfs@m1 ~]$ hadoop fsck hdfs://192.168.56.41:50070/ If you get any corrupted blocks or missing at the end of output like below : Total size: 4396621856 B (Total open files
Read MoreIn Resource Manager UI there is option to kill Application and because of that all users can kill jobs. If you want to disable it then you can use following steps : You can login to Ambari and go to YARN Configs page. Search yarn.resourcemanager.webapp.ui-actions.enabled If it exists, change the value to false. If it does not exist, clear
Read MoreUse this procedure to perform a rolling upgrade from HDP 2.2 to HDP 2.3. It is highly recommended you validate these steps in a test environment to adjust + account for any special configurations for your cluster. Before upgrading to HDP 2.3, you must first upgrade to Ambari 2.1. Make sure Ambari is upgraded and the cluster are
Read MoreIn my experience, people who do things in their career that they are excited about and have a passion for, can go farther and faster with the self-motivation than if they did something that they didn’t like, but felt like they needed to do it for other reasons. You are awesome in already taking initiative
Read MoreThe easiest way to get started with Hadoop is Sandbox with VM Player or Virtual Box. It is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest CDH/HDP distribution, packaged up in a virtual environment. You can start working on hadoop
Read MoreNo data on air propellers was available, but we had always understood that it was not a difficult matter to secure an efficiency of 50% with marine propellers.