Ask Questions

Ask questions, get answers, help others and connect with people who have similar interests.”

who, what, where, when, why, how questions - uncertrainty, brainstorming or decision making concept, colorful crumpled sticky notes on cork bulletin board


189 Comments

Srinivasa Rao

February 26, 2016 at 11:28 pm

I am Technical Recruiter and would like to learn some technology to start career, because of no development experience i am worried to learn., please guide me what to learn easy and to start my career

    admin

    February 27, 2016 at 5:41 am

    Hello Srinivasa,

    If you want to start your career with bigdata then you can start learning through my site and whenever you have any doubt or question, I am always here to help you.

Brenton Ellinger

March 13, 2016 at 10:54 pm

To drive more value out of your big data, you have to start with the right questions. Here are six strategies for improving the quality of the questions you ask big data.

Nilesh

June 28, 2016 at 1:59 pm

Hi Saurabh,

How to Restore into hive from backup. Restoration test cases are,

a. Metadata is lost – Here Embedded PostgreSQL is used.

b. Data is lost – from /user/hive/warehouse.

c. Both Metadata and data is lost in case table is dropped.

    admin

    June 29, 2016 at 7:46 am

    Hi Nilesh,

    Please find my comments:
    a.Metadata is lost – Here Embedded PostgreSQL is used.
    You can enable a backup cluster and can use hive replication through falcon to replicate your data from one cluster to another cluster. So in case if you lost metadata then you can get it from your backup cluster.
    b. Data is lost – from /user/hive/warehouse.
    I would suggest to enable snapshots for this dir or database to recover data.
    c. Both Metadata and data is lost in case table is dropped.
    I feel first answer will help to give this answer.

      Nilesh

      June 29, 2016 at 9:38 am

      Thank you for reply, Saurabh

      a. Metadata is lost – Here Embedded PostgreSQL is used.
      Is it possible to take metastore backup and restore it?

        admin

        June 29, 2016 at 9:55 am

        Yes Nilesh, It is possible. You can use following methods to backup and restore.
        1. stop postgres, backup /var/lib/pgsql/data (default location) or you can dump the database with the help of following link
        http://www.hadoopadmin.co.in/bigdata/backup-and-restore-of-postgres-database/

        Also for hive replication you can reefer following link.
        http://www.hadoopadmin.co.in/bigdata/hive-cross-cluster-replication/

          Nilesh

          June 29, 2016 at 10:08 am

          Thank you for sharing links, Saurabh

          If I restore snapshot of /user/hive/warehouse and also metadata of hive then I will get all the data?

          admin

          June 30, 2016 at 6:29 am

          It should be recover with MSCK repair command.
          But I will try to replicate this use case in my env and will let you know.

Nilesh

June 30, 2016 at 7:36 am

Thank you for your continuous support and guidance. I am eagerly waiting for the results.

    admin

    July 1, 2016 at 9:54 am

    Hi Nilesh,

    You can try following steps to restore your hive dbs.

    1) Install hive in the new hadoop server/cluster
    2) Distcp your backup data to new server under the same dir structure (You should have same dir structure )
    3) Now you have to restore psql dump data which you took backup via following command.

    ————-take back up—————–
    sudo -u postgres pg_dump > /tmp/mydir/backup_.sql

    ————-Restore back up—————–
    sudo -u postgres psql < /tmp/mydir/backup_.sql

    You may need to run MSCK repaire table command to sync partitioned data.

      NIlesh

      July 8, 2016 at 2:12 pm

      Hi Saurabh,
      It is done by Export and Import command in Hive. It took back of data as well as metadata for Hive.
      It tried with using pg_dump /restore or psql command but it failed twice.

      Thank you for your support.

        admin

        July 8, 2016 at 2:24 pm

        Hi Nilesh,

        I happy that you have achieved it.
        Yes hive import/export is nice feature which came in hive-0.8 version. I wanted to explain you about it but thought you did not want to go with that.
        Anyway now problem is resolved so its good. Please let me know in case of any further assistance.

Nilesh

July 10, 2016 at 5:11 pm

Saurabh,
I need your assistance to resolve below an error. Export and import execute on small table successfully but on large table following an error occurred.

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.CopyTask. /hadoop/tools/DistCpOptions.

    admin

    July 11, 2016 at 10:05 am

    @Nilesh,

    Are you getting this error during distcp or on some other steps ?
    Can you compare your steps with following articles.
    http://www.hadoopadmin.co.in/bigdata/import-export-in-hive/

      Nilesh

      July 12, 2016 at 9:31 am

      @Saurabh,
      I am getting an error while run export command in hive on same cluster. Below are the syntax.
      export table to ‘hdfspath’ I can export and import small table. The error come on big table.

      Following is the error message,

      java.lang.NoClassDefFoundError: org/apache/hadoop/tools/DistCpOptions
      at org.apache.hadoop.hive.shims.Hadoop23Shims.runDistCp(Hadoop23Shims.java:1142)
      at org.apache.hadoop.hive.common.FileUtils.copy(FileUtils.java:554)
      at org.apache.hadoop.hive.ql.exec.CopyTask.execute(CopyTask.java:82)
      at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
      at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
      at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1774)
      at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1531)
      at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1311)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1120)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1108)
      at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:218)
      at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:170)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:381)
      at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:773)
      at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:691)
      at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:626)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
      Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.tools.DistCpOptions
      at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      … 22 more
      FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.CopyTask. org/apache/hadoop/tools/DistCpOptions

        admin

        July 21, 2016 at 10:59 am

        @Nilesh,

        Sorry for delay, Are you still getting this error.

Gaurav

July 23, 2016 at 6:34 pm

Hello team,

I have undergone hadoop admin training and implemented few by setting clusters and all, but to get some exposure, or the way Hadoop works on production environment please suggest me . Concepetual wise got a rough idea next step on learning and working on Prod. is the concern.

Eagerly waiting for a positive response.

Regards,
gauav

    admin

    July 24, 2016 at 5:16 am

    Hello Gaurav,

    You can start with setting up a multi-node cluster in Virtual box and then can start pulling up data from your database with the help of sqoop. Once you have your data in your cluster then you can start processing it with many hadoop ecosystems(pig,hive,spark).
    If you want then you can start working on following use case.

    Use Case: Configure five node cluster with hadoop basic components. Once you have your running cluster then configure Name node HA, Resource Manger HA and HiveServer2 HA. Also test scheduling some jobs via falcon and Oozie. You can also configure you capacity scheduler or faire scheduler to get better understanding on tuning your cluster.

    I hope above explanation will help you but still if you have any doubts then please feel free to post your concerns or questions.

      Gaurav

      July 24, 2016 at 10:03 am

      Thanks Saurabh , for giving the insight. Cluster setup & MRV2 is done with setup end, while using of sqoop for testing various components is a gud idea and scheduling also will implement. I have a query on Kerberos. What are the things for an admin in Kerberos side to perform. Is it just about providing permission ? I am keen to explore on the side of roles and responsibilites on Hadoop security & user management , if u can provide some links/docs/test cases which will be helpful as an Admin that can be next step for me.

      If u sum of Hadoop admin side what are things that needs to be takencare of ?? . Basics understanding of concepts involving of v1&v2 is done, but looking for some real time scenarios which admin faces it makes me blanks..

        admin

        July 24, 2016 at 11:47 am

        @Gaurav,

        Being an admin you have to have complete understanding on kerberos(planning for KDC, architecture of kerberos, how to setup it, grant access and troubleshoot any issue in kerberos etc).
        Coming to your second point what are the responsibilities hadoop admin does, so I would suggest you to read following link. http://www.hadoopadmin.co.in/hadoop-administration-and-maintenance/

        User management is just a small part of admin jobs responsibilities. Being an admin you have to tune your cluster, you have to understand new hadoop components and their configuration so that in case of any new requirement comes from business then you should be able to install it n your cluster.

          Gaurav

          July 25, 2016 at 9:12 am

          I don’t think there is option of attachment, i have dropped you with attachments of errors.

Gaurav

July 24, 2016 at 7:05 pm

Thanks Saurabh, For HDP platform i was going through and trying to set it seems interesting, but while heading for cluster setup it ask all the details of SSH key (which i provided) but for FQDN i used nscheckup ip & provided the same of my machine (m1&m2) it throws error .—> Registration with server failed . Host checks were skipped on 1 hosts that failed to register. can you please suggest

    admin

    July 25, 2016 at 6:15 am

    Is it possible to to attach error screen shot or send me via email.

      Gaurav

      July 25, 2016 at 9:13 am

      I don’t think there is option of attachment, i have dropped you with attachments of errors.

        admin

        July 25, 2016 at 4:50 pm

        @Gaurav,
        I have got your email and soon I will get back to you.

          admin

          July 25, 2016 at 4:59 pm

          @Gurav,

          Can you run add the hostnames of the nodes, single entry on each line and you can get hostname by logging to that node and run command hostname –f .
          Select private key used while setting up password less SSH and username using which private key was created.

          admin

          July 25, 2016 at 5:16 pm

          I tested with the command and got below name and then I provided that one only without ip to list which worked after few warning.
          [root@w3 .ssh]# hostname -f
          w3.hdp22

          Also I have sent few attachment which you can refer for next steps.

Gaurav

July 25, 2016 at 5:41 pm

I had tried the same , but dint work out .

Carter Mcdugle

August 4, 2016 at 3:32 am

Because Big Data is much more complex than simply managing and exploring large volumes of data, it is essential that organizations are reasonably assured their business needs and demands improving customer loyalty, streamlining processes, improving top-line growth through predictive analytics, etc.

Bharathiraja

August 4, 2016 at 4:03 pm

HI Saurabh,
Thanks for sharing the useful information!.
Currently ,i am working as oracle DBA.I would like to learn Hadoop admin part.
Can you guide me how to start.

    admin

    August 5, 2016 at 5:43 am

    Hi Bharath,

    You can start with Hadoop admin tab on my website also can visit to FQA for any doubt.
    If you see any challenge then you can post your questions/concern to “Ask Questions” section, I will try my best to help you.

Rakesh

August 28, 2016 at 10:57 am

Hi sir,
awesome website To Learn Bigadata Hadoop. I am learning Hadoop Development.

    admin

    August 28, 2016 at 11:44 am

    Thanks Rakesh. Please feel free to reach out to me anytime for any doubts.

Stan Jubilee

August 29, 2016 at 6:51 am

Here are a few ways to improve the quality of the questions and ultimately the quality of insights and the quality of actions taken.

AM

September 2, 2016 at 8:56 am

Could you please provide some write ups on Apache Yarn and Jobs get spawned in Yarn Layout.
Do also provide some good writes on Apache Kafka.

Thanks in advance.

    admin

    September 2, 2016 at 10:26 am

    Sure AM,

    Thanks for your feedback and soon you will get those details update.

    Thanks

    admin

    September 9, 2016 at 7:12 pm

    AM,

    Please find the detailed explanation on Apache Kafka.
    http://www.hadoopadmin.co.in/hadoop-developer/kafka/

      AM

      September 27, 2016 at 6:13 pm

      Thanks Saurabh…for wonderful docs around Yarn and Kafka…
      Sooner, I will catch up with more follow up questions on this…

        admin

        September 28, 2016 at 9:34 am

        thanks for your kind words. Please feel free to reach out to me for any further assistance.

somappa

September 2, 2016 at 12:52 pm

Hi sourab, my question is about how compressed files store in hdfs location. my zip file size 1 gb.

Q) Its distributed into cluster or not . how store ?

    admin

    September 6, 2016 at 9:00 am

    If replication factor is not 1 then data will be distributed across the different nodes.

    You can achieve this by reading “non-splittable” compressed format in Single Mapper and then distributing data using Reducer to multiple nodes.

    HDFS will store data on multiple node even if files are compressed (using non-splittable or splittable codec) .HDFS will split the compressed file based on the block size. While reading file back in a MR job , your MR job will have a single mapper if your file is compressed using non-splittable codec otherwise (splittable codec) MR Job will have multiple mapper to read data.

    How Data is distributed :

    Suppose you have 1024MB of compressed file and your Hadoop cluster have 128MB of block size.

    When you upload the compressed file to HDFS , it will get converted into 8blocks (128MB each block size) and distributed to different nodes of cluster. HDFS would take care about which node should receive block in a cluster depending on cluster health/ node health/ HDFS balance.

somappa

September 2, 2016 at 12:54 pm

Hive 1 and hive 2 what is the difference ?

Hive 1 not support for update and delete.

Hive 2 support for update and delete . Any other difference

    admin

    September 6, 2016 at 9:13 am

    You mean to say that Hive server1 and Hive Server2 ?

somappa

September 2, 2016 at 1:35 pm

Hi saurabh, i want to do freelancing project in hadoop .

Q) Basically what kind of system i need to use. how many system is required. what is the configuration of each system like RAM, Hard disk , processor.
My total data size is 100 gb.
i m planing to setup a cluster like name node different system and datanode different system. how to set and process.

Dk

September 5, 2016 at 11:35 am

sir when i try to load data into hive table shows null values like that ………….
hive> create table dep (did int, dname string, mid int, lid int)
> row format delimited
> fields terminated by ‘ ‘;
OK
Time taken: 0.097 seconds
hive> load data local inpath ‘/home/castek/mudk/dep.txt’ into table dep;
FAILED: SemanticException Line 1:23 Invalid path ”/home/castek/mudk/dep.txt”: No files matching path file:/home/castek/mudk/dep.txt
hive> load data local inpath ‘/home/castek/dep.txt’ into table dep;
Loading data to table dkjkobs.dep
Table dkjkobs.dep stats: [numFiles=1, totalSize=226]
OK
Time taken: 0.217 seconds
hive> select * from dep;
OK
NULL NULL NULL NULL
NULL NULL NULL NULL
NULL NULL NULL NULL
NULL Resources,203,2400 NULL NULL
NULL NULL NULL NULL
NULL NULL NULL NULL
NULL Relations,204,2700 NULL NULL
NULL NULL NULL NULL
NULL NULL NULL NULL
NULL NULL NULL NULL
Time taken: 0.073 seconds, Fetched: 10 row(s)
hive>

    admin

    September 6, 2016 at 8:45 am

    @DK,

    I feel there are some data type issue in your table and content in your file, but to cross check can you paste here or email me your file’s(/home/castek/dep.txt) content here.Also It can be because of Hive’s “schema on read” approach to table definitions, invalid values will be converted to NULL when you read from them. You might be having different column data type for content.

Suriya

September 6, 2016 at 9:49 am

Hi

I need some live scenarios for hadoop admin role. I has completed my hadoop training with online course they have explained only theory. I want to know the exact roles and responsibilities and commands. Please help any hadoop admin

    admin

    September 6, 2016 at 10:00 am

    Hadoop Admin main roles and responsibilities are to setup a cluster in distributed mode,maintain cluster,take care job’s configurations, do POC on new hadoop components, Secure your cluster via kerberos or Ranger,Knox and manage your cluster’s users.These all are the real time and day to day jobs which hadoop admin do.

suriya

September 6, 2016 at 4:02 pm

Hi Thank you for your prompt response

Sachin

September 22, 2016 at 5:23 pm

When I am running the Talend Mapreduce Job for HDP 3 node cluster I am getting he below error.
I tried changing the -Xmx value to maxiumum. Still issue not got resolved.

Could anybody faced similar kind of issue?
When I run the same JOB for 50000 Records(6MB file) it works fine whereas when I double the size, it is giving error.
Standard JOB is running fine even for 100GB without any error for similar workflow.

Starting job ELT_File_HDFS_Hive_TPCH100GB_batch at 05:36 19/09/2016.

[WARN ]: org.apache.hadoop.util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
[INFO ]: local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch – TalendJob: ‘ELT_File_HDFS_Hive_TPCH100GB_batch’ – Start.
[statistics] connecting to socket on port 3591
[statistics] connected
[INFO ]: org.apache.hadoop.conf.Configuration.deprecation – mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
[INFO ]: org.apache.hadoop.conf.Configuration.deprecation – mapred.map.output.compression.codec is deprecated. Instead, use mapreduce.map.output.compress.codec
[INFO ]: org.apache.hadoop.yarn.client.RMProxy – Connecting to ResourceManager at ip-10-250-3-143.ec2.internal/10.250.3.143:8050
[INFO ]: org.apache.hadoop.conf.Configuration.deprecation – mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
[INFO ]: org.apache.hadoop.conf.Configuration.deprecation – mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
[INFO ]: org.apache.hadoop.yarn.client.RMProxy – Connecting to ResourceManager at ip-10-250-3-143.ec2.internal/10.250.3.143:8050
[INFO ]: org.apache.hadoop.yarn.client.RMProxy – Connecting to ResourceManager at ip-10-250-3-143.ec2.internal/10.250.3.143:8050
[WARN ]: org.apache.hadoop.mapreduce.JobResourceUploader – No job jar file set. User classes may not be found. See Job or Job#setJar(String).
[INFO ]: org.apache.hadoop.mapred.FileInputFormat – Total input paths to process : 1
[INFO ]: org.apache.hadoop.conf.Configuration.deprecation – mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
[INFO ]: org.apache.hadoop.mapreduce.JobSubmitter – number of splits:2
[INFO ]: org.apache.hadoop.mapreduce.JobSubmitter – Submitting tokens for job: job_1474263260915_0001
[INFO ]: org.apache.hadoop.mapred.YARNRunner – Job jar is not present. Not adding any jar to the list of resources.
[INFO ]: org.apache.hadoop.yarn.client.api.impl.YarnClientImpl – Submitted application application_1474263260915_0001
[INFO ]: org.apache.hadoop.mapreduce.Job – The url to track the job: http://ip-10-250-3-143.ec2.internal:8088/proxy/application_1474263260915_0001/
[INFO ]: org.apache.hadoop.conf.Configuration.deprecation – jobclient.output.filter is deprecated. Instead, use mapreduce.client.output.filter
Running job: job_1474263260915_0001
map 0% reduce 0%
map 50% reduce 0%
map 50% reduce 2%
map 50% reduce 10%
map 50% reduce 17%
Task Id : attempt_1474263260915_0001_m_000001_0, Status : FAILED
Exception from container-launch.
Container id: container_e34_1474263260915_0001_01_000003
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 255
Error: GC overhead limit exceeded
Task Id : attempt_1474263260915_0001_m_000001_1, Status : FAILED
Error: GC overhead limit exceeded
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Task Id : attempt_1474263260915_0001_m_000001_2, Status : FAILED
map 100% reduce 100%
Job complete: job_1474263260915_0001
Job Failed: Task failed task_1474263260915_0001_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

java.io.IOException: Job failed!
at org.talend.hadoop.mapred.lib.MRJobClient.runJob(MRJobClient.java:166)
at local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch.runMRJob(ELT_File_HDFS_Hive_TPCH100GB_batch.java:8650)
at local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch.access$0(ELT_File_HDFS_Hive_TPCH100GB_batch.java:8640)
at local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch$1.run(ELT_File_HDFS_Hive_TPCH100GB_batch.java:4768)
at local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch$1.run(ELT_File_HDFS_Hive_TPCH100GB_batch.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch.tFileInputDelimited_1_HDFSInputFormatProcess(ELT_File_HDFS_Hive_TPCH100GB_batch.java:4652)
Counters: 40
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=142415
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=6383323
HDFS: Number of bytes written=0
HDFS: Number of read operations=3
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=4
Killed reduce tasks=7
Launched map tasks=5
Launched reduce tasks=7
Other local map tasks=2
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=98207
Total time spent by all reduces in occupied slots (ms)=451176
Total time spent by all map tasks (ms)=98207
Total time spent by all reduce tasks (ms)=451176
Total vcore-seconds taken by all map tasks=98207
Total vcore-seconds taken by all reduce tasks=451176
Total megabyte-seconds taken by all map tasks=100563968
Total megabyte-seconds taken by all reduce tasks=462004224
Map-Reduce Framework
at local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch.run(ELT_File_HDFS_Hive_TPCH100GB_batch.java:8618)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch.runJobInTOS(ELT_File_HDFS_Hive_TPCH100GB_batch.java:8555)
at local_project.elt_file_hdfs_hive_tpch100gb_batch_0_1.ELT_File_HDFS_Hive_TPCH100GB_batch.main(ELT_File_HDFS_Hive_TPCH100GB_batch.java:8534)
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=98
Input split bytes=388
Combine input records=0
Combine output records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=100
CPU time spent (ms)=840
Physical memory (bytes) snapshot=278835200
Virtual memory (bytes) snapshot=2105524224
Total committed heap usage (bytes)=196608000
File Input Format Counters
Bytes Read=0
[statistics] disconnected
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
Job ELT_File_HDFS_Hive_TPCH100GB_batch ended at 05:37 19/09/2016. [exit code=1]

    admin

    September 22, 2016 at 5:24 pm

    Sachin,
    It looks like it is because of Insufficient NodeManager Java Heap memory. You may need to Increase YARN “NodeManager Java heap size” in Ambari Yarn Configs.

Chandra

September 22, 2016 at 5:26 pm

I got stuck connecting hive in beeline with knox and it is a kerbrzid cluster
beeline -u “jdbc:hive2://c4t22317.itcs.hpecorp.net:9083/;ssl=true;transportMode=http;httpPath=cliservice/default/hive” -n test -p test
WARNING: Use “yarn jar” to launch YARN applications.
Connecting to jdbc:hive2://c4t22317.itcs.hpecorp.net:9083/;ssl=true;transportMode=http;httpPath=cliservice/default/hive
Error: Could not open client transport with JDBC Uri: jdbc:hive2://c4t22317.itcs.hpecorp.net:9083/;ssl=true;transportMode=http;httpPath=cliservice/default/hive: Could not create http connection to jdbc:hive2://c4t22317.itcs.hpecorp.net:9083/;ssl=true;transportMode=http;httpPath=cliservice/default/hive. javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection? (state=08S01,code=0)
Beeline version 1.2.1000.2.4.0.0-169 by Apache Hive

    admin

    September 22, 2016 at 5:27 pm

    Hello Chandra,

    The way you are connecting is not right way to connect beeline in kerberised cluster.
    When you have kerberos then first you have to use kinit to initiate ticket and then you can use beeline in the following ways.

    [saurkuma@sandbox ~]$ kinit saurkuma

    You can use beeline to connect from an edge-node server to hiveserver2. Below is an example:

    beeline -u “jdbc:hive2://127.0.0.1:10000/default;principal=hive/sandbox.hortonworks.com@EXAMPLE.COM;auth-kerberos” -n

    They key part of this example is the JDBC URL that has to be provided for Kerberos authentication to work correctly. Note the main sections of the JDBC URL. jdbc:hive2://127.0.0.1:10000/default principal=hive/sandbox.hortonworks.com@EXAMPLE.COM; auth=kerberos

    The first part is a standard JDBC URL that provides information about the driver (hive2), the hostname (127.0.0.1), the port number (10000), and the default database (default).

    The second part is special to Kerberos. It tells you what service principal is used to authenticate to this URL.

    And the final step is to tell JDBC that you definitely want to do Kerberos authentication (auth=kerberos)

    You’ll also note that the command line for beeline included a specification that I wanted to connect with a specific username (-n ). This is required so that beeline knows what specific kerberos TGT to look for.

    All of this assumes that when you login to the edge node server, you followed standard protocol to get a kerberos TGT. (The profile is setup so that you’re automatically prompted again for your password. This establishes your TGT.)

      Chandra

      September 22, 2016 at 5:28 pm

      Thank you Saurabh.
      It is working now.

Somappa

October 1, 2016 at 11:48 am

This is my interview Questions when i attend hadoop Admin interview , Can u help me

1) After restarting the cluster, if the MapReduce jobs that were working earlier are failing now, what could have gone wrong while restarting?

2)how can you identify long running jobs?, In a large busy Hadoop cluster-how can you identify a long running job?

    admin

    October 3, 2016 at 10:52 am

    Somappa,

    Please find the below requested details.
    1) After restarting the cluster, if the MapReduce jobs that were working earlier are failing now, what could have gone wrong while restarting?
    Ans: You need to check job logs for that failed jobs either in RM logs(/var/log/hadoop-yarn/yarn/yarn-yarn-resourcemanager-m2.hdp22.log) or NM logs(/var/log/hadoop-yarn/nodemanager/*.log).In this way you will find the reason of this job failure.
    Also you can run following command to check specific job logs.
    $ oozie job -log [-action 1, 3-4, 7-40] (-action is optional.)

    2)how can you identify long running jobs?, In a large busy Hadoop cluster-how can you identify a long running job?

    Ans: I have answered this question in earlier answer, so you can refer it from there.

Somappa

October 1, 2016 at 11:58 am

Difference between MapReduce version one and MapReduce version two.
How do you identify a long running job and how do you troubleshoot that
How do you kill job.
How do you add a service or install a component in existing Hadoop cluster.
How do you restart the Name Node?

    admin

    October 3, 2016 at 10:11 am

    Hello Somappa,

    Please find the following answers:

    1. Difference between MapReduce version one and MapReduce version two ?

    Ans:
    MRv1 which is also called as Hadoop 1 where the HDFS (Resource management and scheduling) and MapReduce(Programming Framework) are tightly coupled. Because of this non-batch applications can not be run on the hadoop 1. It has single namenode so, it doesn’t provides high system availability and scalability.

    MRv2 (aka Hadoop 2) in this version of hadoop the resource management and scheduling tasks are separated from MapReduce which is separated by YARN(Yet Another Resource Negotiator). The resource management and scheduling layer lies beneath the MapReduce layer. It also provides high system availability and scalability as we can create redundant NameNodes. The new feature of snapshot through which we can take backup of filesystems which helps disaster recovery.

    2. How do you identify a long running job and how do you troubleshoot that ?
    Ans: Best way is to use RM portal to identify long runnig jobs or you can use following command with small trick.
    [yarn@m2.hdp22 ~]$ mapred job -list
    16/10/03 05:57:55 INFO impl.TimelineClientImpl: Timeline service address: http://m2.hdp22:8188/ws/v1/timeline/
    Total jobs:2
    JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
    job_1473919844018_49782 RUNNING 1475488492029 hdpbatch batch NORMAL 2 0 24576M 0M 24576M http://m2.hdp22:8088/proxy/application_1473919844018_49782/
    job_1473919844018_49761 RUNNING 1475487108530 hdpbatch batch NORMAL 5 0 61440M 0M 61440M http://m2.hdp22:8088/proxy/application_1473919844018_49761/

    3. How do you kill job ?
    Ans: [yarn@m2.hdp22 ~]$ hadoop job -kill

    4. How do you add a service or install a component in existing Hadoop cluster ?
    Ans: If you are using Hortonworks or cloudera then it is very easy to add it though Ambari or Cloudera portal, but you can use some rest api also to add services to your cluster.
    [yarn@m2.hdp22 ~]$ curl -u admin:admin -H ‘X-Requested-By: Ambari’ -X PUT -d ‘{“RequestInfo”:{“context”:”Install JournalNode”},”Body”:{“HostRoles”:{“state”:”INSTALLED”}}}’ http://localhost:8080/api/v1/clusters/CLUSTER_NAME/hosts/NEW_JN_NODE/host_components/JOURNALNODE

    5. How do you restart the Name Node?
    Ans: Best way is to restart it through Ambari or cloudera manager or you can use command also to restart Namenode.
    start-dfs.sh or stop-dfs.sh.

Somappa

October 6, 2016 at 8:12 am

How to read xml file in Mapreduce ?
what are the input formats in pig ?
what is difference between hive optimization and compression in hive ?

    admin

    October 8, 2016 at 6:43 am

    How to read xml file in Mapreduce ?
    Ans: There are many ways to process xml in hadoop like mahout, But here I would advise you to look into following articles.
    http://www.hadoopadmin.co.in/yarn/process-xml-file-via-mapreduce/
    http://www.hadoopadmin.co.in/pig/process-xml-file-via-apache-pig/

    what are the input formats in pig ?
    Ans: Pig is a platform for analyzing large data sets on top of Hadoop. To load a custom input dataset, Pig uses a loader function which loads the data from file-system.Pig’s loader function uses specified InputFormat which will split input data into logical split. Input format in turn uses RecordReader which will read each input split and emits for map function as input.

    What is difference between hive optimization and compression in hive ?
    Ans:Optimization tries to minimize the number of operations for a given amount of information, while compression tries to minimize the number of data bits for a given amount of information.File optimization can reduce file sizes far more effectively than traditional compression methods, such as zip compression. There is a good article for hive query optimization, so you can refer it.
    https://www.qubole.com/blog/big-data/hive-best-practices/

Somappa

October 6, 2016 at 8:14 am

I need some help to write cloudera exam certification

Thanks

    admin

    October 8, 2016 at 6:45 am

    Which cloudear exam(CCAH,CCDH, CCP) you are trying to write ?

      somappa

      October 18, 2016 at 4:34 pm

      CCAH

sureshk

October 16, 2016 at 10:13 am

Hi Saurabh,

I need some MR examples on unstructured data, can you please help on this…

Thanks
suresh.k

    admin

    October 16, 2016 at 5:09 pm

    Hi Suresh,

    Can you please look into following blog and let me know if it help you. I have created it to process unstructured data and process it with Hive.
    http://www.hadoopadmin.co.in/process-hdfs-data-with-hive/

      Sureshk

      October 17, 2016 at 9:05 am

      Thanks Saurabh.
      Can You process same with MR

      Reagrds
      Suresh.k

        admin

        October 17, 2016 at 1:37 pm

        Sure Suresh, I will test it and will write an article on it. But I want to know is there any specific need where you have to do it via Java MR code.

          Sureshk

          October 18, 2016 at 5:57 am

          Hi Saurabh,
          for self learning purpose , i was new to the Hadoop .

          Thanks
          Suresh.k

          admin

          October 18, 2016 at 6:32 am

          ok np Suresh.

suresh k

October 17, 2016 at 9:08 am

Hi Saurabh,
can you please explain map join and data cache in hive.

Thanks
Suresh.k

    admin

    October 17, 2016 at 1:34 pm

    Hi Suresh,

    Please find your answer.
    Map Join in hive:
    Map join is to avoid reduce phase as join work will get complete in Map phase only. Assume that we have two tables of which one of them is a small table. When we submit a map reduce task, a Map Reduce local task will be created before the original join Map Reduce task which will read data of the small table from HDFS and store it into an in-memory hash table. After reading, it serializes the in-memory hash table into a hash table file.

    In the next stage, when the original join Map Reduce task is running, it moves the data in the hash table file to the Hadoop distributed cache, which populates these files to each mapper’s local disk. So all the mappers can load this persistent hash table file back into the memory and do the join work as before. The execution flow of the optimized map join is shown in the figure below. After optimization, the small table needs to be read just once. Also if multiple mappers are running on the same machine, the distributed cache only needs to push one copy of the hash table file to this machine.
    There are two ways to enable it. First is by using a hint, which looks like /*+ MAPJOIN(aliasname), MAPJOIN(anothertable) */. This C-style comment should be placed immediately following the SELECT. It directs Hive to load aliasname (which is a table or alias of the query) into memory.
    SELECT /*+ MAPJOIN(c) */ * FROM orders o JOIN cities c ON (o.city_id = c.id);
    Or we can enabled it for whole cluster and wherever you would have joins on small table in that case it will force Hive to do it automatically. Simply set hive.auto.convert.join to true in your config, and Hive will automatically use mapjoins for any tables smaller than hive.mapjoin.smalltable.filesize (default is 25MB).

    Data Cache:
    DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.

    HDFS caching helps, however it helps only a bit since you are saving only the cost of moving bytes off disk and are still paying the cost of de-serialization, don’t get JVM JIT etc. So, with technologies like Hive LLAP (coming in hive-2) you will get significantly better performance because LLAP caches de-serialized vectors in memory-efficient formats (2 bits for certain integer ranges – rather than 4 bytes), cpu-efficient filters (vectorized query processing via filters etc.) removes JVM startup cost for tasks (100s of ms), provides JIT-enhanced CPU performance etc. Rather excited about it!
    You can refer to the following link.
    https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/filecache/DistributedCache.html

somappa

October 18, 2016 at 4:40 pm

i need your help how setup multinode cluster in cloudera distribution.

i worked with 7 node cluster using ubuntu os, i want to do using cloudera but its not working to me.

i want real time scenario to understand setup..

i want to do project in hadoop . i need complete setup similar to real time configuration systems. how many gb Ram, hard disk.

i need to process 1 TB Data.

Sureshk

October 20, 2016 at 4:34 am

Hi Saurabh,
how we will load json files using pig ,can you give the example

Thanks
Suresh.k

    admin

    October 20, 2016 at 11:01 am

    Hello Suresh,

    You need to use in build JsonLoader function to load JSON data like following example.
    a = load ‘a.json’ using JsonLoader(‘a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]’);

    or without schema :
    a = load ‘a.json’ using JsonLoader();
    In this example data is loaded without a schema; it assumes there is a .pig_schema (produced by JsonStorage) in the input directory.

    Note that there is no concept of delimit in JsonLoader or JsonStorer. The data is encoded in standard JSON format. JsonLoader optionally takes a schema as the construct argument.

sureshk

October 20, 2016 at 4:44 am

Hi Sir,
what is Thrift Server In hive when to use..

Thanks
Suresh.k

    admin

    October 20, 2016 at 11:10 am

    Apache Thrift is a software framework for scalable cross-language services development, which combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Perl, C#, JavaScript, Node.js and other languages.

    Thrift can be used when developing a web service that uses a service developed in one language access that is in another language.HiveServer is a service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results. It is built on Apache Thrift, therefore it is sometimes called as the Thrift server.

    In the context of Hive, Java language can be used to access Hive server. The Thrift interface acts as a bridge, allowing other languages to access Hive, using a Thrift server that interacts with the Java client.

Sureshk

October 22, 2016 at 11:42 am

Hi sir,
can we process input formats like (doc,pdf )in pig?
can you give example

sureshk

October 22, 2016 at 5:10 pm

Hi,
how to check for the health of cluster and display the path of curroputed files

Thanks
Suresh.k

    admin

    October 22, 2016 at 6:48 pm

    Suresh,Please find your answer.

    1. If you are using any GUI tool(Cloudera Manager or Ambari) for to manage your cluster then you can easily monitore your cluster if not then you can check health of your cluster by executing following command,

    hdfs dfsadmin -report

    This command will tell you about your cluster node status and corrupted blocks or under replicated blocks.

    2. To display the path of corrupted files, you need to follow the given steps:
    hdfs fsck / | egrep -v ‘^\.+$’ | grep -v eplica

    Once you find a file that is corrupt then use below command to find location
    hdfs fsck /path/to/corrupt/file -locations -blocks –files

    Now either we can move these blocks or delete if not required.
    hdfs fsck / -move
    hdfs fsck / -delete

    I hope it will help you to find your answers.

sureshk

October 23, 2016 at 5:59 am

Hi Sir,
Please find below hadoop interview questions.
Can you please share answers.

1. What are the steps in mapreduce job submission.
2. What are the functions of the input format class
3. Write a command to start the hadoop distributed file system
4. Sequence the steps in the mapping and reduces phases.
5. What are the major features of the hdfs
6. Which piece of code determines the format of the data that will go into the mapper
7.What is the benefit of serialization in Hadoop, Why cant you store the file as is.
8. How does serialization happens in Hadoop, What exactly happens behind the scenes.
9 How are ORC, Avro and Parquet stored.
10. Pros and Cons and comparison of ORC, Avro and Parquet.
11.How can you implement Updates in Hive.
Thanks
Suresh.k

sureshk

October 25, 2016 at 6:56 pm

Hi,
im installing Openoffice in CDH
But im getting below exception
[root@quickstart DEBS]# dpkg -i *.deb
bash: dpkg: command not found
[root@quickstart DEBS]# sudo dpkg -i *.deb
sudo: dpkg: command not found
any suggestions
Thanks
suresh.k

    admin

    October 26, 2016 at 12:49 pm

    Suresh,

    Can you clarify your question please ? are you trying to install openoffice in the same machine where you have cdh installed ? If yes then can you tell me which OS you have there ?

    Can you run following command to get more details :
    $ locate dpkg
    $ dpkg -help

      sureshk

      October 26, 2016 at 5:29 pm

      thanks you Saurabh
      1.What is the benefit of serialization in Hadoop, Why cant you store the file as is.
      2. How does serialization happens in Hadoop, What exactly happens behind the scenes.
      Can You please clarify the above questions
      Thanks
      Suresh.k

        admin

        November 11, 2016 at 6:44 pm

        Suresh,

        Please find your answer.
        1.What is the benefit of serialization in Hadoop, Why cant you store the file as is.
        Serialization is the process of converting structured objects into a byte stream. It is done basically for two purposes one, for transmission over a network(inter process communication) and for writing to persistent storage. In Hadoop the inter process communication between nodes in the system is done by using remote procedure calls i.e. RPCs. The RPC protocol uses serialization to make the message into a binary stream to be sent to the remote node,which receives and deserializes the binary stream into the original message. And if you will store data as it is then it will be very to your network and will not feasible during transmission. There are following advantage also you can review.
        Compact: To efficenetly use network bandwidth.
        Fast: Very little performance overhead is expected for serialization and deserilization process.
        Extensible: To adept to new changes and reqirements.
        Interoperable:The format needs to be designed to support clients that are written in different languages to the server.

        2. How does serialization happens in Hadoop, What exactly happens behind the scenes.
        Hadoop uses its own serialization format,Writables. Writable is compact and fast, but not extensible or interoperable. The Writable interface has two methods, one for writing and one for reading. The method for writing writes its state to a DataOutput binary stream and the method for reading reads its state from a DataInput binary stream.

        public interface Writable
        {
        void write(DataOutput out) throws IOException;
        void readFields(DataOutput in)throws IOException;
        }

        Let us understand serialization with an example.Given below is a helper method.
        public static byte[] serialize(Writable writable) throws IOException
        {
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DataOutputStream dataOut = new DataOutputStream(out);
        writable.write(dataOut);
        dataOut.close();
        return out.toByteArray();
        }

          sureshk

          November 27, 2016 at 6:45 am

          Thanks
          What are the steps in mapreduce job submission.
          Can you please explain this.

          admin

          November 27, 2016 at 9:55 am

          Below are the steps which are followed when any MR job is submitted by the user until it gets submitted to JobTracker:
          1. User copies input file to distributed file system
          2. User submits job
          3. Job client get input files info
          4. Creates splits
          5. Uploads job info i.e., Job.jar and Job.xml
          6. Validation of Job Output directory is done via HDFS API call; then client submits job to JobTracker using RPC call

          Once the job is submitted to JobTracker, it assumes it is JobTracker’s responsibility to distribute the job to the TT’s, schedule tasks and monitor them, and provide status and diagnostic information back to the job-client. Details of a job submission on the JobTracker side is out of scope for this post, but I plan to write a dedicated post in the future which details job flow for both JobTracker and TaskTracker.

Nilesh

October 26, 2016 at 7:35 am

Hi Saurabh,

Could you please guide for how to take incremental backup of Hbase table via Export command.

    admin

    October 26, 2016 at 12:41 pm

    Hello Nilesh,

    You can do it with following ways.

    *******************************************************************************************/
    /* STEP1: FULL backup from sourcecluster to targetcluster
    /* if no table name specified, all tables from source cluster will be backuped
    /*******************************************************************************************/
    [sourcecluster]$ hbase backup create full hdfs://hostname.targetcluster:9000/userid/backupdir t1_dn,t2_dn,t3_dn

    14/05/09 13:35:46 INFO backup.BackupManager: Backup request backup_1399667695966 has been executed.
    /*******************************************************************************************/
    /* STEP2: In HBase Shell, put a few rows
    /*******************************************************************************************/
    hbase(main):002:0> put ‘t1_dn’,’row100′,’cf1:q1′,’value100_0509_increm1′
    hbase(main):003:0> put ‘t2_dn’,’row100′,’cf1:q1′,’value100_0509_increm1′
    hbase(main):004:0> put ‘t3_dn’,’row100′,’cf1:q1′,’value100_0509_increm1′

    /*******************************************************************************************/
    /* STEP3: Take the 1st incremental backup
    /*******************************************************************************************/
    [sourcecluster]$ hbase backup create incremental hdfs://hostname.targetcluster:9000/userid/backupdir

    14/05/09 13:37:45 INFO backup.BackupManager: Backup request backup_1399667851020 has been executed.

    /*******************************************************************************************/
    /* STEP4: In HBase Shell, put a few more rows.
    /* update ‘row100’, and create new ‘row101’
    /*******************************************************************************************/
    hbase(main):005:0> put ‘t3_dn’,’row100′,’cf1:q1′,’value101_0509_increm2′
    hbase(main):006:0> put ‘t2_dn’,’row100′,’cf1:q1′,’value101_0509_increm2′
    hbase(main):007:0> put ‘t1_dn’,’row100′,’cf1:q1′,’value101_0509_increm2′
    hbase(main):009:0> put ‘t1_dn’,’row101′,’cf1:q1′,’value101_0509_increm2′
    hbase(main):010:0> put ‘t2_dn’,’row101′,’cf1:q1′,’value101_0509_increm2′
    hbase(main):011:0> put ‘t3_dn’,’row101′,’cf1:q1′,’value101_0509_increm2′

    /*******************************************************************************************/
    /* STEP5: Take the 2nd incremental backup
    /*******************************************************************************************/
    [sourcecluster]$ hbase backup create incremental hdfs://hostname.targetcluster:9000/userid/backupdir

    14/05/09 13:39:33 INFO backup.BackupManager: Backup request backup_1399667959165 has been executed.

    /*******************************************************************************************/

Nilesh

October 26, 2016 at 3:27 pm

Thank you Saurabh,

How to do it within the same cluster as we don`t have the target cluster.

    admin

    October 26, 2016 at 3:48 pm

    You can try by same cluster node name instead of target cluster node.

Adarsh

November 24, 2016 at 4:52 am

Can i have some Q&A on social media data on big data ?

    admin

    November 24, 2016 at 10:18 am

    1. How did you become interested in social media?
    I really like the affordances of social media platforms for research but am not by disposition a social media sharer. I don’t have a smart phone with a camera. I will only Tweet during a Twitter campaign for a professional organization that I’m part of, but that’s about it. I’m cyber-squatting a few social media sites, but I’m still unhatched on Twitter.

    2. What is the state of social media currently?
    Social media platforms are the current social “commons” or “public square” for this current age. It’s where people flock to engage in image-making and social performance. People will perform for an imagined audience. For example, recent news stories have covered the wide use of fembots for an online service that professed to connect married people with possible extramarital partners. It’s where people go to people-watch.

    3.In recent years, have social media platforms met your perceived expectations?
    So much of how people think of the world is through social frames and social constructs. People do social reference; they look to others to know how to interpret the events around them. They judge others’ attractiveness and social standing based on who those people socialize with; it’s that “company you keep” phenomena. It’s intriguing to see how social relationships on social media platforms so often follow power law distribution curves, with a few who garner most of the attention and the common folk and wannabes existing somewhere along the “long tail,” with just a friend or two following. The research on activating people is also eye-opening. While people can create a huge following and a lot of attention on a thing, it’s not always that attention translates into money spent or behavioral activation. It is hard to change people’s minds because of built-in confirmation biases and how people selectively pay attention to information that supports what they already think. We all wear interpretive lenses that are informed by what we want to believe. Many times, there is a lot of noise and smoke and rumor but no “fire”.

    4. In your opinion, why is social media currently such an area of research interest ?
    So beyond the human factor, researchers are finding value to exploring social media data. First, generically speaking, researchers can collect a lot of data at minimal cost. This is data in the wild, and it is empirical data. After some cleaning to remove spam and other noise, this data can be analyzed with software tools that offer quantitative and statistical based insights.

    5. Technologically, is it difficult to extract data from social media?
    Yes and no. A number of social media platforms have application programming interfaces (APIs) which enable developers to access some public (non-protected) data from their platforms, but these are often limited amounts of data—both in the sense of rate-limiting (amount accessible in a time period) and also in terms of total amount of data available. Other platforms are built on open technologies which are crawlable and which enable access to contents, to trace data, and to metadata.
    Then, there are research software tools that harness web browser add-ons to extract data from social media platforms. These tools are designed around graphical user interfaces (GUIs) and are fairly well documented in terms of procedures and processes. A number of open-source and free high-level programming (scripting) languages have defined methods for how to scrape data from social media platforms, with some able to even tap the Deep Web by auto-filling out web forms.

    6. What are some common methods for analyzing social media data?
    What I share is only going to be fairly limited. Manual methods are not uncommon. Researchers have their unique areas of expertise, and the extracted data may be analyzed by researcher teams to very positive effects. There are automated theme and subtheme extraction approaches; this topic modeling enables data summarization. Network analysis is another common approach to understand relationships between social media accounts…but also between concepts and data… Once a method is discovered, researchers can be very creative in applying that to various contexts. Cluster analyses are common—to capture similarity relationships. Sentiment analysis is also common, to understand positive and negative polarities of expressed opinions. Emotion analysis is a spin-off of sentiment analysis and offers even richer insights than sentiment alone. Linguistic analysis is common. Geographical data may be extracted from social media datasets, and this data is used to identify relationships between spatial proximity and distance, and captured research variables. There’s work going on to use computer vision and open-source computer-vision technologies to extract insights from still images and videos. Of course, every analytical approach applied in small scale can be applied in big data scale.

    7. What are people learning about and through social media?
    There are thousands of research articles related to various social media platforms and findings from both manual and auto-coded (machine-coded) insights. Researchers have gone to social media to understand collective intelligence and sentiment around particular issues. They have explored social media for business, e-governance, security, and other strategies. While there are unique locally-focused cases, there are also boggling big data approaches that include tens of millions of records. I sound like a rube when I mention the massive scale sizes of data because it’s really been years since the whole Web was mapped, and the data collection continues apace. There are a number of methods to advance understandings in the various fields.

    for more details you can http://www.igi-global.com/newsroom/archive/social-media-data-extraction-content/2603/
    http://www.pulsarplatform.com/blog/2016/jamies-social-club-1-get-to-know-your-data-sources-in-social-media-research/

kedarnath

November 28, 2016 at 1:15 pm

hi i am looking for hadoop developer interview questions can u please provide me(including mapreduce,hive,hbase,pig)

as

November 28, 2016 at 7:18 pm

hi Saurabh,
This is a great forum.
Thanks for your effort and persistence.
I had a question on any help you can provide on taking the Cloudera Hadoop Admin exam.

    admin

    November 29, 2016 at 6:28 am

    Hello,

    Thanks for your valuable feedback. Please go ahead and ask your questions and sure we will help you to crack CCAH exam because few of my team have done CCAH 4.0.

raj

January 26, 2017 at 9:40 pm

Hi saurabh ,

I am planning to apper for cloud era hadoop admin exam . Can you please provide the recent dumps for the same if you have any which will help me in clearing the exam .

    admin

    January 30, 2017 at 9:40 am

    Raj,

    Sorry to say I don’t have any dumps also I would not advise you to read dump as most of the time questions get changed in certificate exam, and only your hands on or practical knowledge will help you to crack exam.

    But I would suggest you to go through certificate course content if you have any issue with any hadoop topics or want to understand anything then please feel free to let me know, I would be happy to help you.

sureshk

February 4, 2017 at 5:37 pm

Hi ,

I would like to have examples for PDF,XML in Scala ,can you please help me out

Thanks
Sureshk

    admin

    February 6, 2017 at 7:15 am

    Suresh,

    Please find the below answer.

    Sample Input XML file.
    GPX file example
    597.0

    597.0

    598.7

    Code to process it :

    import org.apache.hadoop.mapred.JobConf
    import org.apache.spark.{SparkConf, SparkContext}

    object XMLParser {
    def main(args: Array[String]): Unit = {
    System.setProperty(“hadoop.home.dir”, “D:\\hadoop\\hadoop-common-2.2.0-bin-master\\”)
    val conf = new SparkConf().setAppName(“XMLParsing”).setMaster(“local”)
    val sc = new SparkContext(conf)

    val jobConf = new JobConf()
    jobConf.set(“stream.recordreader.class”,
    “org.apache.hadoop.streaming.StreamXmlRecordReader”)
    jobConf.set(“stream.recordreader.begin”, ““)
    org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf, “D:\\RnD\\Workspace\\scala\\TestSpark\\testData\\xmlinput.xml”)

    // Load documents (one per line).
    val documents = sc.hadoopRDD(jobConf, classOf[org.apache.hadoop.streaming.StreamInputFormat],
    classOf[org.apache.hadoop.io.Text],
    classOf[org.apache.hadoop.io.Text])

    import scala.xml.XML
    val texts = documents.map(_._1.toString)
    .map{ s =>
    val xml = XML.loadString(s)
    val trackpts = xml \ “trkpt”
    val gpsData = trackpts.map(
    xmlNode =>(
    (xmlNode \ “@lat”).text.toDouble,
    (xmlNode \ “@lon”).text.toDouble
    ))
    gpsData.toList
    }
    println(texts.first)

    }
    }

sureshk

February 13, 2017 at 2:56 am

Hi saurabh ,
Can you please ans the below questions.
1.What are the factors that we consider while creating a hive table?
2.What are the compression techniques and how do you decide which one to use?
3.What are the key challenges you faced while importing data in Sqoop
4.Advantages of Columnar data stores over Row-oriented data stores
5.How did you conclude on the optimal bucket size?
6.What is bloom filter, how is it different from distributed cache, how do you implement it in MR?
Thanks
Suresh.k

Atul

February 17, 2017 at 10:42 am

Hi All,

i am new to oozie. i am trying to call a hive script using oozie. every time i run the oozie workflow i am getting the error Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [1]

here is my workflow file

${jobTracker}
${nameNode}

mapred.job.queue.name
${queueName}

oozie.hive.defaults
/user/root/MyExamples/hive-default.xml

getcityname.hsql

Hive failed, error message

    admin

    February 17, 2017 at 11:23 am

    Hi Atul,

    Thanks for contacting us.
    This error is just a wrapper of the root cause, so you need to goto oozie logs and find out the actual error.
    But I would advise you to follow the below url to do your job(hive action with oozie).
    http://www.hadoopadmin.co.in/hive/hive-actions-with-oozie/

    But if still you face any issue or you any suspicious error in logs please paste here we will help you to debug/resolve it.

suresh

February 18, 2017 at 5:29 pm

Hi,
Transactions in Hive (Insert, update and delete)

Thanks
Suresh.k

Amandeep Singh

February 19, 2017 at 6:31 am

Hi Saurabh,

I am trying for Hadoop admin certification , can you help me with following question:

Cluster Summary: 45 files and directories, 12 blocks = 57 total. Heap size is 15.31 MB/193.38MB(7%)

Configured Capacity: 17.33GB
DFS Used: 144KB
NON DFS Used :5.49GB
DFS Remaining: 11.84GB
DFS Used % : 0%
DFS Remaining %: 68.32GB
Live Nodes: 6
Dead Nodes: 1
Decommision nodes: 0
Number of under replicated blocks: 6

Refer to the above . You configure a Hadoop cluster with seven DataNodes and on of your monitoring UIs displays the details shown in the exhibit. What does the this tell you?
A. The DataNode JVM on one host is not active
B. Because your under-replicated blocks count matches the Live Nodes, one node is dead, and your DFS Used % equals 0%, you can’t be certain that your cluster has all the data you’ve written it.
C. Your cluster has lost all HDFS data which had bocks stored on the dead Data Node
D. The HDFS cluster is in safe mode

    admin

    February 20, 2017 at 5:59 am

    Hi Amandeep,

    Thanks for your contacting us.
    Sure we will help you to understand your given question and in future as well.

    Answer is A.
    Reason : You have Dead Nodes: 1. This mean that one datanode is down and not able to make contact to NN. And a DataNode runs on its own JVM process, if DN is down then JVM will not be running.

    Note: I have other following observation also.
    It seems to be wrong question by seeing DFS Remaining % in GB which is not true and it should be in some percentage value not in GB. So if you get some option none of the above then that should be correct one.

    Please feel free to reach out to us in case of any further assistance.

Amandeep Singh

February 20, 2017 at 7:07 am

Can you please help to answer below questions!!

QUESTION 1
You have A 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High Availability (HA). You want to minimize the chance of data loss in your cluster. What should you do?
A. Add another master node to increase the number of nodes running the JournalNode which increases the number of machines available to HA to create a quorum
B. Set an HDFS replication factor that provides data redundancy, protecting against node failure
C. Run a Secondary NameNode on a different master from the NameNode in order to provide automatic recovery from a NameNode failure.
D. Run the ResourceManager on a different master from the NameNode in order to load-share HDFS metadata processing
E. Configure the cluster’s disk drives with an appropriate fault tolerant RAID level

QUESTION2
You want to node to only swap Hadoop daemon data from RAM to disk when absolutely necessary. What should you do?
A. Delete the /dev/vmswap file on the node
B. Delete the /etc/swap file on the node
C. Set the ram.swap parameter to 0 in core-site.xml
D. set vm.swappiness=1 in /etc/sysctl.conf on the node
E. set vm.swappiness=0 in /etc/sysctl.conf on the node
F. Delete the /swapfile file on the node

    admin

    February 20, 2017 at 7:09 am

    Hello Amandeep,

    Please find the following answer.
    QUESTION 1.
    You have A 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High Availability (HA). You want to minimize the chance of data loss in your cluster. What should you do?
    A. Add another master node to increase the number of nodes running the JournalNode which increases the number of machines available to HA to create a quorum
    B. Set an HDFS replication factor that provides data redundancy, protecting against node failure
    C. Run a Secondary NameNode on a different master from the NameNode in order to provide automatic recovery from a NameNode failure.
    D. Run the ResourceManager on a different master from the NameNode in order to load-share HDFS metadata processing
    E. Configure the cluster’s disk drives with an appropriate fault tolerant RAID level

    Answer : C (It is because you will run Resource manager and Name Node on the same master node then it would be a heavy load on one server and chances to goes down will be high, and if server goes down then NN will goes down and you may loos data, so we should prefer to move RM and NN on different server)

    QUESTION2.
    You want to node to only swap Hadoop daemon data from RAM to disk when absolutely necessary. What should you do?
    A. Delete the /dev/vmswap file on the node
    B. Delete the /etc/swap file on the node
    C. Set the ram.swap parameter to 0 in core-site.xml
    D. set vm.swappiness=1 in /etc/sysctl.conf on the node
    E. set vm.swappiness=0 in /etc/sysctl.conf on the node
    F. Delete the /swapfile file on the node

    Answer: E is correct one. (It is a kernel parameter that controls the kernel’s tendency to swap application data from memory to disk, should be set to some small value like 0 or 5 to instruct the kernel to never swap, if there is an option. However, even if you set it to 0, swap out might still happen and we saw it at Spotify)

    You can see what value vm.swappiness is currently set to by looking at /proc/sys/vm; for example:

    cat /proc/sys/vm/swappiness
    On most systems, it is set to 60 by default. This is not suitable for Hadoop clusters nodes, because it can cause processes to get swapped out even when there is free memory available. This can affect stability and performance, and may cause problems such as lengthy garbage collection pauses for important system daemons. Cloudera recommends that you set this parameter to 0; for example:

    # sysctl -w vm.swappiness=0

Amandeep Singh

February 20, 2017 at 7:13 am

Can you also answer following question.

Hadoop jar j.jar DriverClass /data/input /data/output
The error message returned includes the line:
PriviligedActionException as:training (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.invalidInputException:
Input path does not exist: file:/data/input

What can be the error here?

    admin

    February 20, 2017 at 7:14 am

    Hi Amandeep,

    Please find the your answer.

    Hadoop jar j.jar DriverClass /data/input /data/output
    The error message returned includes the line:
    PriviligedActionException as:training (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.invalidInputException:
    Input path does not exist: file:/data/input

    Answer: You are getting this error because you do not have permission to read input( /data/input) file. This error as:training (auth:SIMPLE) clearly stating that it is permission issue. So user is not authorized to read this file.

Amandeep Singh

February 21, 2017 at 7:33 am

Can you please check below 2 questions also?

These are the last of the lot 😛

Question 1)
What does CDH packaging do on install to facilitate Kerberos security setup?

A. Automatically configures permissions for log files at & MAPRED_LOG_DIR/userlogs

B. Creates users for hdfs and mapreduce to facilitate role assignment

C. Creates directories for temp, hdfs, and mapreduce with the correct permissions

D. Creates a set of pre-configured Kerberos keytab files and their permissions

E. Creates and configures your kdc with default cluster values

Correct Answer: B

Question 2)
You’r upgrading a Hadoop cluster from HDFS and MapReduce version 1 (MRv1) to one running

HDFS and MapReduce version 2 (MRv2) on YARN. You want to set and enforce version 1 (MRv1) to

one running HDFS and MapReduce version 2 (MRv2) on YARN. You want to set and enforce a block

size of 128MB for all new files written to the cluster after upgrade. What should you do?

A. You cannot enforce this, since client code can always override this value

B. Set dfs.block.size to 128M on all the worker nodes, on all client machines, and on the NameNode,

and set the parameter to final

C. Set dfs.block.size to 128M on all the worker nodes and client machines, and set the parameter to

final. You do not need to set this value on the NameNode

D. Set dfs.block.size to 134217728 on all the worker nodes, on all client machines, and on the

NameNode, and set the parameter to final

E. Set dfs.block.size to 134217728 on all the worker nodes and client machines, and set the

parameter to final. You do not need to set this value on the NameNode

    admin

    February 21, 2017 at 7:33 am

    Please find your requested answer.

    Question 1)
    What does CDH packaging do on install to facilitate Kerberos security setup?

    A. Automatically configures permissions for log files at & MAPRED_LOG_DIR/userlogs

    B. Creates users for hdfs and mapreduce to facilitate role assignment

    C. Creates directories for temp, hdfs, and mapreduce with the correct permissions

    D. Creates a set of pre-configured Kerberos keytab files and their permissions

    E. Creates and configures your kdc with default cluster values

    Correct Answer: B (Question is talking about CDH not Cloudera manager and as per CDH design during CDH 5 package installation of MRv1, the hdfs and mapreduce Unix user accounts are automatically created to support security, you can get more details on this URL https://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_sg_users_groups_verify.html)

    Question 2)
    You’r upgrading a Hadoop cluster from HDFS and MapReduce version 1 (MRv1) to one running

    HDFS and MapReduce version 2 (MRv2) on YARN. You want to set and enforce version 1 (MRv1) to

    one running HDFS and MapReduce version 2 (MRv2) on YARN. You want to set and enforce a block

    size of 128MB for all new files written to the cluster after upgrade. What should you do?

    A. You cannot enforce this, since client code can always override this value

    B. Set dfs.block.size to 128M on all the worker nodes, on all client machines, and on the NameNode,

    and set the parameter to final

    C. Set dfs.block.size to 128 M on all the worker nodes and client machines, and set the parameter to

    final. You do not need to set this value on the NameNode

    D. Set dfs.block.size to 134217728 on all the worker nodes, on all client machines, and on the

    NameNode, and set the parameter to final

    E. Set dfs.block.size to 134217728 on all the worker nodes and client machines, and set the

    parameter to final. You do not need to set this value on the NameNode

    Correct answer is E (Both C and E are telling the same thing but there is a difference in syntax in answer C. As there is an space between value and unit like 128 M, also please note that you can give value in any unit formate ex. You can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB) https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml )

Amandeep Singh

February 22, 2017 at 9:48 am

Hello Saurabh,

Kindly check this question, as per my knowledge this command will try to make nn01 as standby and nn02 will become active.
And if graceful fail-over is not happening , then nn01 will be fenced and nn02 will become active.
So which answer will be correct one ? option B or option C.

Your cluster implements HDFS High Availability (HA). Your two NameNodes are named nn01 and

nn02. What occurs when you execute the command: hdfs haadmin -failover nn01 nn02?

A. nn02 is fenced, and nn01 becomes the active NameNode

B. nn01 is fenced, and nn02 becomes the active NameNode

C. nn01 becomes the standby NameNode and nn02 becomes the active NameNode

D. nn02 becomes the standby NameNode and nn01 becomes the active NameNode

    admin

    February 22, 2017 at 9:49 am

    Yes Answer is B. (If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails then the fencing methods (as configured by dfs.ha.fencing.methods) will be attempted in order until one of the methods succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned.)

      kunal

      April 16, 2017 at 1:17 am

      Related to question “Your cluster implements HDFS High Availability (HA). Your two NameNodes are named nn01 and nn02. What occurs when you execute the command: hdfs haadmin -failover nn01 nn02?”

      Option C, if the failover is smooth, and option B if we expect a damage / failure and fencing occurs.

      Suggestion pls.

        admin

        April 16, 2017 at 3:44 am

        Hello ,

        This command will change your active namenode from nn01 to nn02. So for example if you have nn01 is active name node and nn02 is standby. After executing this command (hdfs haadmin -failover nn01 nn02) your namenode will be reverse and nn01 will become standby and nn02 will become active. you can check this status by executing hdfs haadmin -getServiceState nn01.

          kunal

          April 16, 2017 at 10:35 pm

          So what should be a best answer.

Rana

April 7, 2017 at 7:59 pm

Hi Sourabh,

We are facing an issue wherein we have a workflow created in control-m wherein the first workflow takes the data from source through Informatica jobs and loads into a hive table. Second workflow fires a Tibco job on that hive table and loads to another table. The issue here is in the second flow.

At times the hive table doesn’t have any data but the second workflow when run Tibco fetches 100+ records from hive table wherein in Hive table doesn’t have any records at all. Where from the Tibco job fetches the records is a mystery to us and would like to combat this situation.

Pls note that, when the entire workflow run on automated, causes this issue but when run manually by holding certain folders in control-m , runs fine and gives exact records.

Can you pls share your insights on how to trouble shoot this issue.

Looking forward to hear from you.

Thanks,
Rana

    admin

    April 10, 2017 at 12:21 pm

    Hello Rana,

    Can you tell me following more thing to understand your issue in details.
    1.Do you have any delete operation after you insert records from one table to other table in hive ?
    2.Can you place one select count(*) or select * from first table before you start insert just to check whether we have some data in original table or not ?
    3.Are you using beeline or hive action to do this job ?

      Rana

      April 21, 2017 at 7:48 am

      Thanks so much Sourabh….and very sorry for the late reply. Many issues and this one is still giving us hard time 🙁 Please find my comments inline

      1.Do you have any delete operation after you insert records from one table to other table in hive ?

      No, delete operation is in place. Whole set of “Tibco” jobs are just reading the data from Hive through automated process through Control-M.

      2.Can you place one select count(*) or select * from first table before you start insert just to check whether we have some data in original table or not ?

      Yes, we had already been monitoring that through a script through cronjob on every 1 min basis which writes the output to a file. This helped us to confirm that there was data before the Tibco job ran but still upon Tibco job completion the result is wrong. Please note that this issue is intermittent issue and doesn’t occur everytime. The Control-M job is scheduled to run from Monday till Saturday. So this issue happens only at certain days and not everyday.

      We also monitored the table external path whether there was any data JUST before the Tibco job runs on that Hive table. And that helped us confirm that Hive table does have data but when automated scheduled through Control-M, it misbehaves intermittently.

      3.Are you using beeline or hive action to do this job ?

      Tibco uses WebHcat for running their queries on Hive table.

      Please do let me know if you need any other information and I will be happy to share 🙂 Thanks again so much for helping and I really appreciate this gesture 🙂

      Regards,
      Rana

viru

April 12, 2017 at 1:23 am

Hi

I have a shell script in HDFS. I have scheduled the script in oozie, the script works fine and I am getting the desired output.

for this shell script I will pass a table name as an argument. Now I want to do two things.

1) Run the shell script in parallel for 10 tables at the same time.

2) I want to collect the stderr logs that we can see in oozie logs as a file for each time the script is invoked. I want the file to be in hdfs.

How can we achieve that in oozie. what are the options to run scripts in parallel in oozie

Please let me know

kunal

April 15, 2017 at 3:40 am

Hello Amandeep,

Need your help in identifying correct answer as they are confusing.

Q1 In which scenario workload will benefit by adding a faster network fabric ie by implementing 10 Gigabit Ethernet as the network fabric to Hadoop Cluster?

A. When your workload generates a large amount of output data, significantly larger than the amount of intermediate data
B. When your workload generates a large amount of intermediate data, on the order of the input data itself.

Q2 What will happen when this is executed “hadoop jar SampleJar MyClass” on a client machine?
A. SampleJar.Jar is sent to the ApplicationMaster which allocates a container for SampleJar.Jar
B. Sample.jar is placed in a temporary directory in HDFS

    admin

    April 16, 2017 at 3:52 am

    Please find the following correct answer.
    Answer 1: B is correct(When your workload generates a large amount of intermediate data, on the order of the input data itself.)
    Answer 2: A is correct (SampleJar.Jar is sent to the ApplicationMaster which allocates a container for SampleJar.Jar)

kunal

April 15, 2017 at 3:42 am

I also need some help on matrix multiplication using spark. What should be the better way to do it because my executors are failing due to memory overhead.

    admin

    April 16, 2017 at 3:57 am

    Are you using pyspark or Spark Mlib to implement matrix multiplication ? Also what is your data size and cluster size ?

      kunal

      April 16, 2017 at 10:36 pm

      Using Spark and data size will be around 10billion. I have a cluster of 9 worker and 3 master node.

viru

April 16, 2017 at 6:31 am

Hi Saurabh,

I have a shell script in HDFS. I want to schedule this job in oozie. For this script I will pass table names as argument.

I have scheduled this script in oozie, and got the desired result.

Now I want to schedule this script to run for 10 tables in parallel. I am able to do this using cron jobs. But How can I schedule the same in oozie.

Do I need to create 10 workflows or what is the ideal solution for this?

I looked at fork option in oozie but this script executes a sqoop query and writes to a hive table, so I don’t know whether this option can be used for my use case

Please let me know

Thank you
viru

    admin

    April 17, 2017 at 9:32 am

    Hi Viru,

    Thanks for reaching out to us. Please find the following update on your questions.

    We will not need to create 10 Workflow as one work flow will do the same work. If you are able to do it in cron then you can directly do shell action in oozie and with one wf you can do it.

    Also can you please tell us sample your shell script.

      viru

      April 18, 2017 at 12:16 am

      Hi Saurabh,

      Here is my shell script.

      #!/bin/bash
      LOG_LOCATION=/home/$USER/logs
      exec 2>&1

      [ $# -ne 1 ] && { echo “Usage : $0 table “;exit 1; }

      table=$1

      TIMESTAMP=`date “+%Y-%m-%d”`
      touch /home/$USER/logs/${TIMESTAMP}.success_log
      touch /home/$USER/logs/${TIMESTAMP}.fail_log
      success_logs=/home/$USER/logs/${TIMESTAMP}.success_log
      failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log

      #Function to get the status of the job creation
      function log_status
      {
      status=$1
      message=$2
      if [ “$status” -ne 0 ]; then
      echo “`date +\”%Y-%m-%d %H:%M:%S\”` [ERROR] $message [Status] $status : failed” | tee -a “${failed_logs}”
      #echo “Please find the attached log file for more details”
      exit 1
      else
      echo “`date +\”%Y-%m-%d %H:%M:%S\”` [INFO] $message [Status] $status : success” | tee -a “${success_logs}”
      fi
      }

      `hive -e “create table testing.${table} stored as parquet as select * from fishing.${table}”`

      g_STATUS=$?
      log_status $g_STATUS “Hive create ${table}”

      echo “***********************************************************************************************************************************************************************”

      viru

      April 19, 2017 at 2:03 pm

      Hi Saurabh,

      I have tried using this as well. In another shell script I am doing the following

      nl -n rz test | xargs -n 2 –max-procs 10 sh -c ‘shell.sh “$1” > /tmp/logging/`date “+%Y-%m-%d”`/”$1″‘

      Here when themax-procs is 10 in linux it works fine without any error but in oozie it fails.

      But when I change the max procs to 1 then the job is successful in oozie

      nl -n rz test | xargs -n 2 –max-procs 1 sh -c ‘shell.sh “$1” > /tmp/logging/`date “+%Y-%m-%d”`/”$1″‘

      Can you please let me know why this is happening

        admin

        April 20, 2017 at 10:16 am

        I am checking and soon will update you.

          admin

          April 20, 2017 at 10:29 am

          Can you try following ?
          Write your table name in tables.txt and read input(table name) from this file and then pass it to your main script.

          xargs –max-procs 10 -n 1 sh shell.sh < tables.txt The --max-procs causes xargs to spawn up to 10 processes at a time, and it will pass a single table name to each invocation.

        admin

        April 20, 2017 at 10:21 am

        if possible can you rerun it and send oozie/hiveserver2 logs ?

          viru

          April 20, 2017 at 7:45 pm

          Hi Saurabh,

          I have tried xargs –max-procs 10 -n 1 sh shell.sh < test. It shows running and gets killed.

          When i run xargs –max-procs 1 -n 1 sh shell.sh < test then there are no errors

          In the oozie logs in stdout I see following

          Stdoutput 17/04/20 12:40:13 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:13 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:13 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:14 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:14 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:14 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:14 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:14 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:14 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput 17/04/20 12:40:14 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties
          Stdoutput
          Stdoutput Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hive-common-1.1.0-cdh5.8.0.jar!/hive-log4j.properties

kunal

April 17, 2017 at 1:38 am

Hi Saurabh,

I have couple of more questions which I need to clarify…

Q1 You have installed a cluster HDFS and MapReduce version 2 (MRv2) on YARN. You have no dfs.hosts entry(ies) in your hdfs-site.xml configuration file. You configure a new worker node by setting fs.default.name in its configuration files to point to the NameNode on your cluster, and you start the DataNode daemon on that worker node. What do you have to do on the cluster to allow the worker node to join, and start sorting HDFS blocks?

A. Without creating a dfs.hosts file or making any entries, run the commands hadoop.dfsadminrefreshModes
on the NameNode
B. Restart the NameNode
C. Creating a dfs.hosts file on the NameNode, add the worker Node’s name to it, then issue the
command hadoop dfsadmin -refresh Nodes = on the Namenode
D. Nothing; the worker node will automatically join the cluster when NameNode daemon is started

Q2 You have A 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High Availability (HA). You want to minimize the chance of data loss in your cluster. What should you do?

A. Add another master node to increase the number of nodes running the JournalNode which increases the number of machines available to HA to create a quorum
B. Set an HDFS replication factor that provides data redundancy, protecting against node failure
C. Run a Secondary NameNode on a different master from the NameNode in order to provide automatic recovery from a NameNode failure.
D. Run the ResourceManager on a different master from the NameNode in order to load-share HDFS metadata processing
E. Configure the cluster’s disk drives with an appropriate fault tolerant RAID level

    admin

    April 17, 2017 at 9:28 am

    Please find the following answer.
    Answer 1: A is correct answer (Without creating a dfs.hosts file or making any entries, run the commands hadoop.dfsadminrefreshModes on the NameNode)

    Answer 2: D is correct answer (Run the ResourceManager on a different master from the NameNode in order to load-share HDFS metadata processing)

Piyush Chauhan

June 11, 2017 at 7:15 am

Hi Brother,
Piyush here from KIET MCA. I am facing issue while setting up kerberos.

DearAll,

Request for the help,
In Kerberos getting error while applying the kinit cmd

[root@kbclient hduser]# kadmin hduser/admin
Authenticating as principal hduser/admin@TCS.COM with password.
Password for hduser/admin@TCS.COM:
kadmin: kinit -e
kadmin: Unknown request “kinit”. Type “?” for a request list.
kadmin: klist -e
kadmin: Unknown request “klist”. Type “?” for a request list.
kadmin: list_principals
K/M@TCS.COM
hduser/admin@TCS.COM
kadmin/admin@TCS.COM
kadmin/changepw@TCS.COM
kadmin/kbclient.tcs.com@TCS.COM
krbtgt/TCS.COM@TCS.COM
kadmin: kinit hduser/admin
kadmin: Unknown request “kinit”. Type “?” for a request list.
kadmin: kinit -R
kadmin: Unknown request “kinit”. Type “?” for a request list.
kadmin:

    admin

    June 24, 2017 at 3:34 am

    Hello Bro,

    Sorry for delay, If you need immediate help in future then feel free to call me on my mobile.
    Coming to your issue, it seems keytab is got expired or deleted, so can you try to check it by running following command.
    [root@m1 ~]# klist
    Ticket cache: FILE:/tmp/krb5cc_0
    Default principal: admin/admin@HADOOPADMIN.COM

    Valid starting Expires Service principal
    02/10/17 05:03:01 02/11/17 05:03:01 krbtgt/HADOOPADMIN.COM@HADOOPADMIN.COM
    renew until 02/10/17 05:03:01

    and then initiate kinit with keytabb.

    [root@m1 ~]# kinit root/admin@HADOOPADMIN.COM
    Password for root/admin@HADOOPADMIN.COM:
    [root@m1 ~]#

SM

July 25, 2017 at 7:49 am

Hi Saurav,

Thanks for the wonderful discussions. I am SM and preparing for a Hadoop Admin. I have gone through the test case provided earlier. Could you please provide a little more test case which will help to build more confidence for interview and certification, and also, could you please help me to understand the schedulers (FIFO, Fair and capacity) and there use cases.

    admin

    July 27, 2017 at 12:54 pm

    Thanks SM for your valuable comments. I will help you to get more details on schedulers soon.

suresh k

July 27, 2017 at 5:26 am

HI
Can you please Ans below
1.How to remove BAD data from Millions of records in Hadoop
2.How do we test Hive in production?

    admin

    July 27, 2017 at 12:56 pm

    Hello ,

    1.How to remove BAD data from Millions of records in Hadoop ?
    Ans : What do you mean by BAD data ? Data can be bad or good based on requirement.
    2.How do we test Hive in production?
    Ans : We can test hive by running sanity test or by running our own script. Do you want any specific testing ?

sureshk

July 27, 2017 at 5:32 am

Hi Sir,

How Many reducers can be planned ? How do we decide on the number of Reducers ?

Thanks

    admin

    July 27, 2017 at 1:02 pm

    Reducers are depend on your type of job and also it depends on data split as well. And the right number of reduces seems to be 0.95 or 1.75 multiplied by ( * ).

sureshk

July 29, 2017 at 5:42 am

Hi

if the primary and secondary node fails?what we need to do

Thanks

sureshk

July 29, 2017 at 5:42 am

Hi

if the primary and secondary nodes fails?what we need to do

Thanks

    admin

    July 31, 2017 at 10:55 am

    You need to recover NN metadata from NFS location.

suresh

August 5, 2017 at 5:57 am

hi
what is hadoop portrait where we need to use

Thanks

Skumar

August 15, 2017 at 1:50 pm

Hello Saurabh,

Good Evening,

as discussed, can you please help me on Hadoop Capacity planning for Different workload patterns?

Below is the Cluster Capacity plan that i have done for my Environment.

Approximate data that is going to ingest in the cluster is ~20TB per Day.
Data growth per month is 20*31days = 620TB
With Replication 3 , Data in our cluster would be approx. 620TB*3 = 1860TB.
Space for metadata is Approximately 25% of raw data we are storing which is around 465 TB .
The total HDFS space we required per month is 1860+465= 2325TB .
Considering this as 70% threshold value we required to have 3321 TB Space available per month (100%). Making it round which is 3330 TB per month.
Total Number of Data nodes required is 34 nodes Each with 100TB(25*4TB) storage space.
Basic computation power required per each Data node is 16 core CPU and 64 GB RAM.
Master Nodes: minimum 4 Master nodes Required with 48 core CPU, 512GB RAM and 40TB HDD .
Edge nodes: 2 Edge nodes with 8 core CPU and 20TB for user space.
My queries are as below:

1. what is the recommended H/W configuration for Master and slaves.
2. What are the different parameters that we need to consider during the design phase to achieve max performance from cluster? like number of partitions.. CPU … etc.
3. is there any formula to calculate the RAM and CPU for Different workload patterns?

    admin

    August 16, 2017 at 1:11 pm

    Hello Santhosh,

    1. What is the recommended H/W configuration for Master and slaves.
    Answer: Yes we have H/W recommendation for hardware in case of Master and salve. You can review the following URL for your answer.
    https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_cluster-planning/content/conclusion.html

    2. What are the different parameters that we need to consider during the design phase to achieve max performance from cluster? like number of partitions.. CPU … etc.
    Answer: We need to consider CPU and RAM or other configuration based on hadoop ecosystems. For example hive LLAP,Spark, Hbase or any other hight CPU intensive components.

    3. is there any formula to calculate the RAM and CPU for Different workload patterns?
    Answer : I feel following URL will help you to get your answer.
    https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_command-line-installation/content/determine-hdp-memory-config.html

    Please feel free to reach-out to me for any further help.

      santhosh

      August 22, 2017 at 5:28 am

      Unable to submit a spark job in yarn cluster mode. Request your help

      SUCCEEDED
      Diagnostics:
      Application application_1503296712291_0694 failed 2 times due to AM Container for appattempt_1503296712291_0694_000002 exited with exitCode: 1
      For more detailed output, check the application tracking page: http://m2.tatahdp.com:8088/cluster/app/application_1503296712291_0694 Then click on links to logs of each attempt.
      Diagnostics: Exception from container-launch.
      Container id: container_e19_1503296712291_0694_02_000001
      Exit code: 1
      Stack trace: ExitCodeException exitCode=1:
      at org.apache.hadoop.util.Shell.runCommand(Shell.java:944)
      at org.apache.hadoop.util.Shell.run(Shell.java:848)
      at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
      at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:237)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      Container exited with a non-zero exit code 1
      Failing this attempt. Failing the application.

      Logs:
      17/08/22 04:56:19 INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
      17/08/22 04:56:19 ERROR ApplicationMaster: Uncaught exception:
      org.apache.spark.SparkException: Exception thrown in awaitResult:
      at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
      at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:401)
      at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:254)
      at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:766)
      at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
      at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
      at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
      at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:764)
      at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
      Caused by: org.apache.spark.SparkUserAppException: User application exited with 1
      at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:104)
      at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
      17/08/22 04:56:19 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User application exited with status 1)
      17/08/22 04:56:19 INFO ApplicationMaster: Deleting staging directory hdfs://hdpqa/user/root/.sparkStaging/application_1503296712291_0694
      17/08/22 04:56:19 INFO ShutdownHookManager: Shutdown hook called

      Log Type: stdout
      Log Upload Time: Tue Aug 22 04:56:21 +0000 2017
      Log Length: 143
      Traceback (most recent call last):
      File “ddos_streaming.py”, line 7, in
      import pandas as pd
      ImportError: No module named pandas

        admin

        August 22, 2017 at 6:27 am

        Hello Santhosh,

        It seems you are not having same python version on worker nodes. So can you answer following questions to have better understanding on you issue.
        1. Are you running multiple version of python ?
        2. Can you run following command to collect application log and can send me the error ?
        yarn logs -applicationId application_1503296712291_0694 >test.log

Bibhu

August 17, 2017 at 10:25 am

Hi Saurav,
This is Bibhu ,I am a regular follower to your site, could you please let me know how to provide resources manually to the running job in yarn so that job will be running first.

    admin

    August 18, 2017 at 1:33 pm

    Hello Bibhu,

    Thanks for visiting regularly to my site and it really motivate me to keep adding new article or issues.
    Now coming to your question there are many waiting for this functionality and as far as I know that is still under tech preview not implemented in YARN. You can check following jira is still in opened(https://issues.apache.org/jira/browse/YARN-1197) state and people are working on it.
    But to run your job fast you can play with queues concept and schedulers.
    Please let me know if you want me to explain CS and queues concept.

      Bibhu

      August 18, 2017 at 4:45 pm

      Thank you so much for your Time saurav, As I know we can Move our job to different queue(yarn application -movetoquoue -queue where the job will get more resources to run quickly , but not sure about CS. If you know any more concept on CS and Queue then please explain the same

      Thanks and Regards
      Bibhu

Bibhu

August 23, 2017 at 8:29 am

Hi Saurav,

could you please explain in details about the capacity scheduler and queue how we can expedite the job. any option is there for fair scheduler as to expedite the job?
Thanks and Regards
Bibhu

    admin

    August 24, 2017 at 9:27 am

    Hello Bibhu,

    Do you use HDP stack or Cloudera or some other distribution? if you are using HDP then you can use fair Policy in capacity scheduler and if you use CDH then you can use fair scheduler with weight and priority. Please tell me your distribution and I will explain accordingly in details.

      Bibhu

      August 24, 2017 at 3:08 pm

      Hi Saurabh, We do have cloudera distribution .

      Thanks and Regards
      Bibhu

Bibhu

August 30, 2017 at 6:03 pm

Hi Saurabh,
what is Multi-tenancy in Hadoop, could you please explain in detail like what is architecture,how to set up , how high availability is working in Multi-tenancy, what are the things we need to concentrate while working in Multi Tenancy etc.
Thanks and Regards
Bibhu

    admin

    August 31, 2017 at 11:35 am

    Hello Bibhu,
    Multi-tenancy means you should have separate processing layer and separate storage layer within one cluster for different-different business units. That means is if HR and Finance team wants to use your single cluster then you should have dedicated storage and dedicated processing resources in your cluster.cluster can be used as a unified platform within an organization, by different business units and line of businesses. I have seen cases where people use separate cluster in this kind of requirement but that leads to so many following other problems.
    Cost
    Complexity
    Maintenance
    Security Challenges etc

    It need some time and detailed explanation to answer your other questions ” what is architecture, how to set up , how high availability is working in Multi-tenancy, what are the things we need to concentrate while working in Multi Tenancy etc”. So I will prepare a detailed blog and will pass you a link soon.
    If it is urgent then give me a call, I will try to explain you on call.

santhosh kumar

September 4, 2017 at 9:29 am

Dear Admin,

I am getting No Data Available on my Gadgets at My Ambari Dashboard . Below is the Error message i found at log level..
Error sending metrics to server. [Errno 110] Connection timed out
2017-09-04 08:42:07,702 [WARNING] emitter.py:111 – Retrying after 5 …

Can you Please help me on sorting out the issue with my Ambari? Smartsense service is frequently getting down.

Thanks in Advance!

    santhosh kumar

    September 8, 2017 at 5:52 am

    as Discussed below is the ambari-metrics colector log:

    [root@m2 ambari-metrics-collector]# tail -f hbase-ams-master-m2.tatahdp.com.log
    2017-09-08 05:47:28,110 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system…
    2017-09-08 05:47:28,112 INFO [timeline] impl.MetricsSinkAdapter: timeline thread interrupted.
    2017-09-08 05:47:28,116 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
    2017-09-08 05:47:28,619 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
    2017-09-08 05:47:28,700 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Initializing Timeline metrics sink.
    2017-09-08 05:47:28,700 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: Identified hostname = m2.tatahdp.com, serviceName = ams-hbase
    2017-09-08 05:47:28,703 INFO [HBase-Metrics2-1] timeline.HadoopTimelineMetricsSink: No suitable collector found.
    2017-09-08 05:47:28,705 INFO [HBase-Metrics2-1] impl.MetricsSinkAdapter: Sink timeline started
    2017-09-08 05:47:28,706 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
    2017-09-08 05:47:28,706 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started
    2017-09-08 05:48:38,715 INFO [timeline] timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20 times.
    2017-09-08 05:48:41,031 INFO [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://null:6188/ws/v1/timeline/metrics
    This exceptions will be ignored for next 100 times

    2017-09-08 05:48:41,031 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
    2017-09-08 05:50:11,012 INFO [timeline] timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20 times.
    2017-09-08 05:50:13,419 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=1.05 MB, freeSize=1020.39 MB, max=1021.44 MB, blockCount=3, accesses=73, hits=67, hitRatio=91.78%, , cachingAccesses=70, cachingHits=65, cachingHitsRatio=92.86%, evictions=149, evicted=2, evictedPerRun=0.01342281885445118

      admin

      September 8, 2017 at 11:43 am

      Hello Santhosh

      As discussed by seeing your logs, you are getting this issue because of hostname was not updated properly in ambari collector config. It is null in your logs.
      so request you to check ambari metrics config and update your required hostname. Also you still see any error then can you give me ambari-metrics-collector.log files.

    admin

    September 8, 2017 at 11:32 am

    As discussed I will check logs and will help you to solve it.

pavan kumar

September 13, 2017 at 4:59 am

Hello Sir ,my name is pavan ,am following your blog its very good and very helpful ,i have an issue with pig am trying to fetch tables from hive to pig for analysis but every time i get error 2245 or 1070 schema not found, am using hive version 2.1.1 with metastore in mysql and pig version 0.16.0.please help me out sir

    admin

    September 13, 2017 at 10:37 am

    Hello Pavan,

    You are getting this error because of hcat metastore and might be you are not running pig with HCat.
    So can you try to run pig shell with HCat like following.
    [s0998dnz@m1.hdp22 ~]$ pig -x tez -useHCatalog

    Example:

    hive> select * from sample;
    OK
    1 23.45 54.45
    2 34.5 45.56
    3 45.5 234.56
    1 23.45 54.45
    2 34.5 45.56
    3 45.5 234.56
    1 23.45 54.45
    2 34.5 45.56
    3 45.5 234.56
    Time taken: 1.0 seconds, Fetched: 9 row(s)
    hive> describe sample;
    OK
    sno int
    value1 double
    value2 double
    Time taken: 1.227 seconds, Fetched: 3 row(s)
    hive>
    [1]+ Stopped hive

    [user@m1.hdp22 ~]$ pig -x tez -useHCatalog
    17/09/13 06:28:50 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
    17/09/13 06:28:50 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
    17/09/13 06:28:50 INFO pig.ExecTypeProvider: Trying ExecType : TEZ_LOCAL
    17/09/13 06:28:50 INFO pig.ExecTypeProvider: Trying ExecType : TEZ
    17/09/13 06:28:50 INFO pig.ExecTypeProvider: Picked TEZ as the ExecType
    2017-09-13 06:28:50,409 [main] INFO org.apache.pig.Main – Apache Pig version 0.16.0.2.6.1.0-129 (rexported) compiled May 31 2017, 03:39:20
    2017-09-13 06:28:50,409 [main] INFO org.apache.pig.Main – Logging error messages to: /home/user/pig_1505298530407.log
    2017-09-13 06:28:50,506 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/user/.pigbootup not found
    2017-09-13 06:28:51,131 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://HDPHA
    2017-09-13 06:28:52,222 [main] INFO org.apache.pig.PigServer – Pig Script ID for the session: PIG-default-fab04b8f-e8aa-4429-a0d8-3e55faa9438e
    2017-09-13 06:28:52,631 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl – Timeline service address: http://m2.hdp22.:8188/ws/v1/timeline/
    2017-09-13 06:28:52,633 [main] INFO org.apache.pig.backend.hadoop.PigATSClient – Created ATS Hook

    grunt> a1 = load ‘test.sample’ using org.apache.hive.hcatalog.pig.HCatLoader();

    2017-09-13 06:30:09,049 [main] INFO org.apache.hive.hcatalog.common.HiveClientCache – Initializing cache: eviction-timeout=120 initial-capacity=50 maximum-capacity=50
    2017-09-13 06:30:09,082 [main] INFO hive.metastore – Trying to connect to metastore with URI thrift://m2.hdp22.:9083
    2017-09-13 06:30:09,237 [main] WARN org.apache.hadoop.security.LdapGroupsMapping – Failed to get groups for user user (retry=0) by javax.naming.directory.InvalidSearchFilterException: Unbalanced parenthesis; remaining name ‘dc=,dc=com’
    2017-09-13 06:30:09,345 [main] WARN org.apache.hadoop.security.LdapGroupsMapping – Failed to get groups for user user (retry=1) by javax.naming.directory.InvalidSearchFilterException: Unbalanced parenthesis; remaining name ‘dc=,dc=com’
    2017-09-13 06:30:09,452 [main] WARN org.apache.hadoop.security.LdapGroupsMapping – Failed to get groups for user user (retry=2) by javax.naming.directory.InvalidSearchFilterException: Unbalanced parenthesis; remaining name ‘dc=,dc=com’
    2017-09-13 06:30:09,475 [main] INFO hive.metastore – Connected to metastore.

    grunt> describe a1;
    a1: {sno: int,value1: double,value2: double}

    grunt> dump a1;
    2017-09-13 06:31:28,069 [main] INFO org.apache.pig.tools.pigstats.ScriptState – Pig features used in the script: UNKNOWN
    2017-09-13 06:31:28,103 [main] INFO org.apache.pig.data.SchemaTupleBackend – Key [pig.schematuple] was not set… will not generate code.
    2017-09-13 06:31:28,188 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer – {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
    2017-09-13 06:31:28,253 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager – Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
    2017-09-13 06:31:28,336 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher – Tez staging directory is /tmp/user/staging and resources directory is /tmp/temp1388557227
    2017-09-13 06:31:28,379 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler – File concatenation threshold: 100 optimistic? false
    2017-09-13 06:31:28,699 [main] INFO org.apache.hadoop.mapred.FileInputFormat – Total input paths to process : 3
    2017-09-13 06:31:28,709 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths (combined) to process : 1
    2017-09-13 06:31:29,857 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: libfb303-0.9.3.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: libthrift-0.9.3.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: pig-0.16.0.2.6.1.0-129-core-h2.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: hive-exec-1.2.1000.2.6.1.0-129.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: jdo-api-3.0.1.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: automaton-1.11-8.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: hive-metastore-1.2.1000.2.6.1.0-129.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: hive-hcatalog-core-1.2.1000.2.6.1.0-129.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: antlr-runtime-3.4.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: hive-hcatalog-pig-adapter-1.2.1000.2.6.1.0-129.jar
    2017-09-13 06:31:29,858 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Local resource: hive-hbase-handler-1.2.1000.2.6.1.0-129.jar
    2017-09-13 06:31:30,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder – For vertex – scope-2: parallelism=1, memory=2048, java opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
    2017-09-13 06:31:30,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder – Processing aliases: a1
    2017-09-13 06:31:30,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder – Detailed locations: a1[1,5]
    2017-09-13 06:31:30,050 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder – Pig features in the vertex:
    2017-09-13 06:31:30,134 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler – Total estimated parallelism is 1
    2017-09-13 06:31:30,201 [PigTezLauncher-0] INFO org.apache.pig.tools.pigstats.tez.TezScriptState – Pig script settings are added to the job
    2017-09-13 06:31:30,336 [PigTezLauncher-0] INFO org.apache.tez.client.TezClient – Tez Client Version: [ component=tez-api, version=0.7.0.2.6.1.0-129, revision=bbcfb9e8d9cc93fb586b32199eb9492528449f7c, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2017-05-31T02:35:29Z ]
    2017-09-13 06:31:30,510 [PigTezLauncher-0] INFO org.apache.hadoop.yarn.client.AHSProxy – Connecting to Application History server at m2.hdp22./172.29.90.11:10200
    2017-09-13 06:31:30,513 [PigTezLauncher-0] INFO org.apache.tez.client.TezClient – Session mode. Starting session.
    2017-09-13 06:31:30,517 [PigTezLauncher-0] INFO org.apache.tez.client.TezClientUtils – Using tez.lib.uris value from configuration: /hdp/apps/2.6.1.0-129/tez/tez.tar.gz
    2017-09-13 06:31:30,546 [PigTezLauncher-0] INFO org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider – Looking for the active RM in [rm1, rm2]…
    2017-09-13 06:31:30,574 [PigTezLauncher-0] INFO org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider – Found active RM [rm2]
    2017-09-13 06:31:30,584 [PigTezLauncher-0] INFO org.apache.tez.client.TezClient – Stage directory /tmp/user/staging doesn’t exist and is created
    2017-09-13 06:31:30,596 [PigTezLauncher-0] INFO org.apache.tez.client.TezClient – Tez system stage directory hdfs://HDPHA/tmp/user/staging/.tez/application_1505289772015_0001 doesn’t exist and is created
    2017-09-13 06:31:30,987 [PigTezLauncher-0] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl – Submitted application application_1505289772015_0001
    2017-09-13 06:31:30,992 [PigTezLauncher-0] INFO org.apache.tez.client.TezClient – The url to track the Tez Session: http://m1.hdp22.:8088/proxy/application_1505289772015_0001/
    2017-09-13 06:31:39,251 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob – Submitting DAG PigLatin:DefaultJobName-0_scope-0
    2017-09-13 06:31:39,251 [PigTezLauncher-0] INFO org.apache.tez.client.TezClient – Submitting dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1505289772015_0001, dagName=PigLatin:DefaultJobName-0_scope-0, callerContext={ context=PIG, callerType=PIG_SCRIPT_ID, callerId=PIG-default-fab04b8f-e8aa-4429-a0d8-3e55faa9438e }
    2017-09-13 06:31:40,404 [PigTezLauncher-0] INFO org.apache.tez.client.TezClient – Submitted dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1505289772015_0001, dagName=PigLatin:DefaultJobName-0_scope-0
    2017-09-13 06:31:40,469 [PigTezLauncher-0] INFO org.apache.hadoop.yarn.client.AHSProxy – Connecting to Application History server at m2.hdp22./172.29.90.11:10200
    2017-09-13 06:31:40,478 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob – Submitted DAG PigLatin:DefaultJobName-0_scope-0. Application id: application_1505289772015_0001
    2017-09-13 06:31:40,479 [PigTezLauncher-0] INFO org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider – Looking for the active RM in [rm1, rm2]…
    2017-09-13 06:31:40,482 [PigTezLauncher-0] INFO org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider – Found active RM [rm2]
    2017-09-13 06:31:41,166 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher – HadoopJobId: job_1505289772015_0001
    2017-09-13 06:31:41,480 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob – DAG Status: status=RUNNING, progress=TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
    2017-09-13 06:31:46,742 [PigTezLauncher-0] INFO org.apache.tez.common.counters.Limits – Counter limits initialized with parameters: GROUP_NAME_MAX=256, MAX_GROUPS=3000, COUNTER_NAME_MAX=64, MAX_COUNTERS=10000
    2017-09-13 06:31:46,746 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob – DAG Status: status=SUCCEEDED, progress=TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=Counters: 25
    org.apache.tez.common.counters.DAGCounter
    NUM_SUCCEEDED_TASKS=1
    TOTAL_LAUNCHED_TASKS=1
    DATA_LOCAL_TASKS=1
    AM_CPU_MILLISECONDS=2230
    AM_GC_TIME_MILLIS=0
    File System Counters
    HDFS_BYTES_READ=123
    HDFS_BYTES_WRITTEN=213
    HDFS_READ_OPS=6
    HDFS_WRITE_OPS=2
    HDFS_OP_CREATE=1
    HDFS_OP_GET_FILE_STATUS=3
    HDFS_OP_OPEN=3
    HDFS_OP_RENAME=1
    org.apache.tez.common.counters.TaskCounter
    GC_TIME_MILLIS=260
    CPU_MILLISECONDS=11540
    PHYSICAL_MEMORY_BYTES=1182793728
    VIRTUAL_MEMORY_BYTES=3675426816
    COMMITTED_HEAP_BYTES=1182793728
    INPUT_RECORDS_PROCESSED=9
    INPUT_SPLIT_LENGTH_BYTES=123
    OUTPUT_RECORDS=9
    MultiStoreCounters
    Output records in _0_tmp1517407105=9
    TaskCounter_scope_2_INPUT_scope_0
    INPUT_RECORDS_PROCESSED=9
    INPUT_SPLIT_LENGTH_BYTES=123
    TaskCounter_scope_2_OUTPUT_scope_1
    OUTPUT_RECORDS=9
    2017-09-13 06:31:47,180 [main] INFO org.apache.pig.tools.pigstats.tez.TezPigScriptStats – Script Statistics:

    HadoopVersion: 2.7.3.2.6.1.0-129
    PigVersion: 0.16.0.2.6.1.0-129
    TezVersion: 0.7.0.2.6.1.0-129
    UserId: user
    FileName:
    StartedAt: 2017-09-13 06:31:28
    FinishedAt: 2017-09-13 06:31:47
    Features: UNKNOWN

    Success!

    DAG 0:
    Name: PigLatin:DefaultJobName-0_scope-0
    ApplicationId: job_1505289772015_0001
    TotalLaunchedTasks: 1
    FileBytesRead: 0
    FileBytesWritten: 0
    HdfsBytesRead: 123
    HdfsBytesWritten: 213
    SpillableMemoryManager spill count: 0
    Bags proactively spilled: 0
    Records proactively spilled: 0

    DAG Plan:
    Tez vertex scope-2

    Vertex Stats:
    VertexId Parallelism TotalTasks InputRecords ReduceInputRecords OutputRecords FileBytesRead FileBytesWritten HdfsBytesRead HdfsBytesWritten Alias Feature Outputs
    scope-2 1 1 9 0 9 0 0 123 213 a1hdfs://HDPHA/tmp/temp-1446456325/tmp1517407105,

    Input(s):
    Successfully read 9 records (123 bytes) from: “test.sample”

    Output(s):
    Successfully stored 9 records (213 bytes) in: “hdfs://HDPHA/tmp/temp-1446456325/tmp1517407105”

    2017-09-13 06:31:47,188 [main] WARN org.apache.pig.data.SchemaTupleBackend – SchemaTupleBackend has already been initialized
    2017-09-13 06:31:47,198 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1
    2017-09-13 06:31:47,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1

    (1,23.45,54.45)
    (2,34.5,45.56)
    (3,45.5,234.56)
    (1,23.45,54.45)
    (2,34.5,45.56)
    (3,45.5,234.56)
    (1,23.45,54.45)
    (2,34.5,45.56)
    (3,45.5,234.56)

      pavan kumar

      September 13, 2017 at 5:24 pm

      Thank you very much for your kind reply sir,am gonna try this and i ll get back to you….

        pavan kumar

        September 14, 2017 at 7:21 am

        Hello Sir,

        Am getting following error while executing

        commands used pig -useHCatalog

        A = load ‘logs’ using org.apache.hive.hcatalog.pig.HCatLoader();
        2017-09-14 12:43:15,116 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
        2017-09-14 12:43:15,151 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
        2017-09-14 12:43:15,230 [main] INFO org.apache.hadoop.hive.metastore.HiveMetaStore – 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
        2017-09-14 12:43:15,245 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 2245: Cannot get schema from loadFunc org.apache.hive.hcatalog.pig.HCatLoader
        Details at logfile: /home/pavan/pig_1505373104957.log

        Do i need to configure anthing else sir ,am running hadoop 2.7.2 version on single node

          admin

          September 14, 2017 at 11:38 am

          Hello pavan,

          Do you have this table in default db if not then mentioned db name before table in load statement. Also can you tell me HCat server is running fine ?
          Also which distribution you are using HDP or CDH ?

santhosh

September 14, 2017 at 9:48 am

Hello Saurab,
I am facing an issue with spark stream batch processing,Spark Streaming batch processing time increasing over time:
i have tried all the options in below artical but still its processing time is getting increasing continuously.

https://community.hortonworks.com/articles/80301/spark-configuration-and-best-practice-advice.html

can you please help me with your recommendations to setup a spark cluster with optimal performance. ?

pavan kumar

September 14, 2017 at 12:35 pm

Hello sir,

i have this table in default database in hive sir,i ve installed Hadoop and its ecosystems on linux OS ubuntu 15.10

    admin

    September 14, 2017 at 12:53 pm

    ok, Are you using hortonworks or Cloudera distribution ?
    And can you check Web Hcatalog server is installed or not ?

      pavan kumar

      September 14, 2017 at 1:34 pm

      Hello sir,

      Recently i ve learned hadoop ,the tutor taught us manually installing the apache hadoop and its ecosystems in OS linux 15.10 am not using any distribution service like hortonworks or cloudera ,its a direct installation untaring the .tar.gz files and configuring ./bashrc files …

        admin

        September 18, 2017 at 2:34 pm

        Can you install Hcatalog service also, it is to use pig on top of hive.

Krishna

September 16, 2017 at 11:14 am

importmrdp.logging.LogWriter; im getting error in these line . i think have to add jar files but im not getting any jar regarding to that… can you help rectify the error…?

    admin

    September 18, 2017 at 2:33 pm

    Can you please tell me which jar you have added and what exactly you are trying to achieve, so that I can assist you better.

sureshk

October 18, 2017 at 3:54 pm

Hi Admin
Can you please ans the below questions.
1) How spark works on YARN cluster.
2) Managing persistence on memory in spark
3) I have a json/xml data in one of RDBMS table. How to import or export it to hadoop using sqoop
Thanks

Bibhu

November 3, 2017 at 1:32 pm

Hi Saurabh, I am getting Block count issue in two data nodes in my production cluster. I am sure that number of small files are growing rapidly . Could you please suggest how and where can I find the small files usage any command or any way to check in cloudera manager. could you please suggest as early as possible ,its bit urgent.
1.Environment is- CM 5.10.1 version
2.Two data nodes are in amber color
3. Health test massages in cloudera manager as mentioned below
DataNode Health Suppress…
Healthy DataNode: 3. Concerning DataNode: 2. Total DataNode: 5. Percent healthy: 60.00%. Percent healthy or concerning: 100.00%. Warning threshold: 95.00%.

    admin

    November 4, 2017 at 7:54 am

    Hello Bibhu,

    I know this is common issue in hadoop world. So if you would like to find the location of all small files then you need to create your own cuton script with hadoop fs -ls -R / command and then either you can use hadoop archival(HAR) or delete those files if not required.
    But for immediate and temp fix you need to raise the DN heap size to allow it to continue serving blocks at the same performance via the CM -> HDFS -> Configuration -> Monitoring section fields. Actually you are getting this alert because of the 200k default is to warn in DN but I think it is revised to 600k in CDH5.x.

    If you are facing any problem to solve it or creating a script then let me know I will help you to solve.

      Bibhu

      November 4, 2017 at 1:38 pm

      Thank you so much saurabh will update you soon

Bibhu

November 14, 2017 at 11:31 am

Hi Saurabh,
Application team is facing an error when they are executing the below query in Hive . These tables , when run alone/individually, is not throwing any error. Error log attached.

SELECT a.address_id AS Addrs_Id,
a.addr_line_one,
a.addr_line_two,
a.addr_line_three,
a.prov_state_tp_nm,
a.postal_code,
a.city_name,
c.country_code AS country_tp_nm
FROM axa_us_mdm.mv_address a
LEFT OUTER JOIN us_aml.country c ON a.country_tp_nm = c.country_code;

ERROR LOG:
[sghosh@l51bxp11 ~]$ beeline -u ‘jdbc:hive2://l51hdpp02.na.bigdata.intraxa:10000/;principal=hive/_HOST@NA.BIGDATA.INTRAXA’ -e “SELECT a.address_id AS Addrs_Id,
> a.addr_line_one,
> a.addr_line_two,
> a.addr_line_three,
> a.prov_state_tp_nm,
> a.postal_code,
> a.city_name,
> c.country_code AS country_tp_nm
> FROM axa_us_mdm.mv_address a
> LEFT OUTER JOIN us_aml.country c ON a.country_tp_nm = c.country_code”
scan complete in 2ms
Connecting to jdbc:hive2://l51hdpp02.na.bigdata.intraxa:10000/;principal=hive/_HOST@NA.BIGDATA.INTRAXA
Connected to: Apache Hive (version 1.1.0-cdh5.10.1)
Driver: Hive JDBC (version 1.1.0-cdh5.10.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO : Compiling command(queryId=hive_20171114041919_8741f00f-6e97-4dc3-8c85-4a980a6d1372): SELECT a.address_id AS Addrs_Id,
a.addr_line_one,
a.addr_line_two,
a.addr_line_three,
a.prov_state_tp_nm,
a.postal_code,
a.city_name,
c.country_code AS country_tp_nm
FROM axa_us_mdm.mv_address a
LEFT OUTER JOIN us_aml.country c ON a.country_tp_nm = c.country_code
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:addrs_id, type:string, comment:null), FieldSchema(name:a.addr_line_one, type:string, comment:null), FieldSchema(name:a.addr_line_two, type:string, comment:null), FieldSchema(name:a.addr_line_three, type:string, comment:null), FieldSchema(name:a.prov_state_tp_nm, type:string, comment:null), FieldSchema(name:a.postal_code, type:string, comment:null), FieldSchema(name:a.city_name, type:string, comment:null), FieldSchema(name:country_tp_nm, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20171114041919_8741f00f-6e97-4dc3-8c85-4a980a6d1372); Time taken: 0.157 seconds
INFO : Executing command(queryId=hive_20171114041919_8741f00f-6e97-4dc3-8c85-4a980a6d1372): SELECT a.address_id AS Addrs_Id,
a.addr_line_one,
a.addr_line_two,
a.addr_line_three,
a.prov_state_tp_nm,
a.postal_code,
a.city_name,
c.country_code AS country_tp_nm
FROM axa_us_mdm.mv_address a
LEFT OUTER JOIN us_aml.country c ON a.country_tp_nm = c.country_code
INFO : Query ID = hive_20171114041919_8741f00f-6e97-4dc3-8c85-4a980a6d1372
INFO : Total jobs = 1
INFO : Starting task [Stage-4:MAPREDLOCAL] in serial mode
ERROR : Execution failed with exit status: 2
ERROR : Obtaining error information
ERROR :
Task failed!
Task ID:
Stage-4

Logs:

ERROR : /var/log/hive/hadoop-cmf-hive-HIVESERVER2-l51hdpp02.na.bigdata.intraxa.log.out
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
INFO : Completed executing command(queryId=hive_20171114041919_8741f00f-6e97-4dc3-8c85-4a980a6d1372); Time taken: 4.645 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask (state=08S01,code=2)
Closing: 0: jdbc:hive2://l51hdpp02.na.bigdata.intraxa:10000/;principal=hive/_HOST@NA.BIGDATA.INTRAXA

My understanding :
Return code 2 is basically a camoflauge for an hadoop/yarn memory problem. Basically, not enough resources configured into hadoop/yarn to run .

Could you please suggest

Leave a Reply