Monthly Archives: November 2016

  • 0

Top most Hadoop Interview question

1. What are the Side Data Distribution Techniques?

Side data refers to extra static small data required by map reduce to perform job. Main challenge is the availability of side data on the node where the map would be executed. Hadoop provides two side data distribution techniques.

Using Job Configuration

An arbitrary Key value pair can be set in job configuration.

2. What is shuffling in MapReduce?

Once map tasks started to complete, A communication from reducers is started. where map output sent to reducer, which is looking for the output data to process. at same time data nodes are still process multiple other tasks. The data transfer of mappers output to reducer known as shuffling.

3. What is partitioning?

Partitioning is a process to identify the reducer instance, which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identifies the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer.

4. What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars, which could be used by application to improve performance. Application provides details of file to jobconf object to cache. Mapreduce framework would copy the

5. What is a job tracker?

Job tracker is a background service executed on namenode for submitting and tracking a Job. Job in hadoop terminology refers to mapreduce jobs. It further break up the job into tasks. Which would be deployed every data node holding the required data. In a Hadoop cluster, Job tracker is master and task acts like child, acts, performs and revert the progress to job tracker through heartbeat.

6. How to set which framework would be used to run map reduce program?

mapreduce.framework.name. it can be

  1. Local
  2. Classic
  3. Yarn

7. What is replication factor for Job’s JAR?

These are one of the most critical resources used regularly by task completion. it’s replication factor is 10

8. mapred.job.tracker property is used for?

mapred.job.tracker property is used by runner to get the job tracker mode. if it set to local then runner would submit the job to local job tracker running of single JVM. else job would be sent to mentioned address in property.

9. Difference between Job.submit() or waitForCompletion() ?

Job Submit internally creates submitter instance and submit the job, while waitforcompletion poll’s progress at regular interval of one second. if job gets executed successfully, it displays successful message on console else display a relevant error message.

 

10. What are the types of tables in Hive?

There are two types of tables.

  1. Managed tables.
  2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

11. Does Hive support record level Insert, delete or update?

Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

12. What kind of datawarehouse application is suitable for Hive?

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where

1) Relatively static data is analyzed,

2) Fast response times are not required, and

3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

13. How can the columns of a table in hive be written to a file?

By using awk command in shell, the output from HiveQL (Describe) can be written to a file.

hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.

14. CONCAT function in Hive with Example?

CONCAT function will concat the input strings. You can specify any number of strings separated by comma.

Example:

CONCAT (‘Hive’,’-’,’performs’,’-’,’good’,’-’,’in’,’-’,’Hadoop’);

Output:

Hive-performs-good-in-Hadoop

So, every time you delimit the strings by ‘-’. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.

CONCAT_WS (‘-’,’Hive’,’performs’,’good’,’in’,’Hadoop’);

Output: Hive-performs-good-in-Hadoop

15. REPEAT function in Hive with example?

REPEAT function will repeat the input string n times specified in the command.

Example:

REPEAT(‘Hadoop’,3);

Output:

HadoopHadoopHadoop.

Note: You can add a space with the input string also.

16. How Pig integrate with Mapreduce to process data?

Pig can easier to execute. When programmer wrote a script to analyze the data sets, Here Pig compiler will convert the programs into MapReduce understandable format. Pig engine execute the query on the MR Jobs. The MapReduce process the data and generate output report. Here MapReduce doesn’t return output to Pig, directly stored in the HDFS.

17. What is the difference between logical and physical plan?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

18. How many ways we can run Pig programs?

Pig programs or commands can be executed in three ways.

  • Script – Batch Method
  • Grunt Shell – Interactive Method
  • Embedded mode

All these ways can be applied to both Local and Mapreduce modes of execution.

19. What is Grunt in Pig?

Grunt is an Interactive Shell in Pig, and below are its major features:

  • Ctrl-E key combination will move the cursor to the end of the line.
  • Grunt remembers command history, and can recall lines in the history buffer using up or down cursor keys.
  • Grunt supports Auto completion mechanism, which will try to complete
  • Pig Latin keywords and functions when you press the Tab key.

20. What are the modes of Pig Execution?

Local Mode:

Local execution in a single JVM, all files are installed and run using local host and file system.

Mapreduce Mode:

Distributed execution on a Hadoop cluster, it is the default mode.

21. What are the main difference between local mode and MapReduce mode?

Local mode:

No need to start or install Hadoop. The pig scripts run in the local system. By default Pig store data in File system. 100% MapReduce and Local mode commands everything same, no need to change anything.

MapReduce Mode:

It’s mandatory to start Hadoop. Pig scripts run and stored in in HDFS. in Both modes, Java and Pig installation is mandatory.

22. Can we process vast amount of data in local mode? Why?

No, System has limited fixed amount of storage, where as Hadoop can handle vast amount of data. So, Pig -x Mapreduce mode is the best choice to process vast amount of data.

23. Does Pig support multi-line commands?

Yes

24. Hive doesn’t support multi-line commands, what about Pig?

Pig can support single and multiple line commands.

Single line comments:

Dump B; — It execute the data, but not store in the file system.

Multiple Line comments:

Store B into ‘/output’; /* it can store/persists the data in Hdfs or Local File System. In protection level most often used Store command */

25. Difference Between Pig and SQL ?

Pig is a Procedural SQL is Declarative Nested relational data model SQL flat relational Schema is optional SQL schema is required OLAP works SQL supports OLAP+OLTP works loads Limited Query  Optimization and Significent opportunity for query Optimization.

 


  • 0

Installing grafana and it is failing with resource_management.core.exceptions.Fail: Ambari Metrics Grafana data source creation failed. POST request status: 401 Unauthorized

Category : Ambari

When we do fresh install for grafana in ambari 2.4 and when you start it then it may be fail with following error.

stderr:   /var/lib/ambari-agent/data/errors-14517.txt

Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/
metrics_grafana.py", line 67, in <module>
    AmsGrafana().execute()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", 
line 280, in execute method(env)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 725, in restart self.start(env)
  File "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/
metrics_grafana.py", line 49, in start   create_ams_datasource()
  File "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts
/metrics_grafana_util.py", line 261, in create_ams_datasource
    (response.status, response.reason, data))
resource_management.core.exceptions.Fail: Ambari Metrics Grafana data source creation failed. 
POST request status: 401 Unauthorized {"message":"Invalid username or password"}

Root Cause: It could be because first time it use default username and password (i.e admin, admin) 
and you may be used wrong username & password or something went wrong during installation. 

Resolutions: Grafana username and password is stored in a sqlite3 database. One of the way is to
reset the password back to admin first and then can be changed in Grafana Dashboard. To do the same, 
following steps could be used:

1. Logon to the node where Grafana is installed and invoke Grafana sqlite3 database as follows and 
sudo to ams user.

[s0998dnz@server1 ~]$ sudo su – ams

Last login: Thu Nov 24 04:11:25 EST 2016

[ams@server1 ~]$ sqlite3 /var/lib/ambari-metrics-grafana/grafana.db

SQLite version 3.6.20

Enter “.help” for instructions

Enter SQL statements terminated with a “;”

sqlite> select salt, password from user;

GZtvpYh3e0|56053ec1580b26a14b339f3c95c1e51117f7ce730c5400955c0288c650deba14a3dbeb70ba4a65464d822a9fa47fc7f7c6ba

sqlite> update user set password = ’59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6′, salt = ‘F3FAxVm33R’ where login = ‘admin’;

sqlite> .exit

2. Once done, edit Ambari Metrics Server-Configs and update Grafana Password to admin

3. Restart the Ambari Metrics Server

4. Access Grafana page using the Quick Links under Ambari Metric Server Dashboard

5. Click on the Grafana Symbol in the top left corner of the screen and Sign-in as admin user


  • 0

Standby NameNode is faling and only one is running

Category : HDFS

Standby NameNode is unable to start up. Or, once bring up standby NameNode, the active NameNode will go down soon, leaving only one live NameNode. NameNode log shows:

FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) – Error: flush failed for required journal (JournalAndStream(mgr=QJM to ))

java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. 

ROOT CAUSE: 

This can be caused by network issue, which causes JournalNode to take long time to sync. The following snippet from JournalNode log shows it took unusual long time to sync:

WARN server.Journal (Journal.java:journal(384)) – Sync of transaction range 187176137-187176137 took 44461ms

WARN ipc.Server (Server.java:processResponse(1027)) – IPC Server handler 3 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from Call#10890 Retry#0: output error

INFO ipc.Server (Server.java:run(2105)) – IPC Server handler 3 on 8485 caught an exception

java.nio.channels.ClosedChannelException

at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)

at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)

at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2573)

at org.apache.hadoop.ipc.Server.access$1900(Server.java:135)

at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:977)

at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1042)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2094)

 

RESOLUTION:

Increase the values of following JournalNode timeout properties:

dfs.qjournal.select-input-streams.timeout.ms = 60000 

dfs.qjournal.start-segment.timeout.ms = 60000 

dfs.qjournal.write-txns.timeout.ms = 60000