Top most Hadoop Interview question

  • 0

Top most Hadoop Interview question

1. What are the Side Data Distribution Techniques?

Side data refers to extra static small data required by map reduce to perform job. Main challenge is the availability of side data on the node where the map would be executed. Hadoop provides two side data distribution techniques.

Using Job Configuration

An arbitrary Key value pair can be set in job configuration.

2. What is shuffling in MapReduce?

Once map tasks started to complete, A communication from reducers is started. where map output sent to reducer, which is looking for the output data to process. at same time data nodes are still process multiple other tasks. The data transfer of mappers output to reducer known as shuffling.

3. What is partitioning?

Partitioning is a process to identify the reducer instance, which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identifies the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer.

4. What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars, which could be used by application to improve performance. Application provides details of file to jobconf object to cache. Mapreduce framework would copy the

5. What is a job tracker?

Job tracker is a background service executed on namenode for submitting and tracking a Job. Job in hadoop terminology refers to mapreduce jobs. It further break up the job into tasks. Which would be deployed every data node holding the required data. In a Hadoop cluster, Job tracker is master and task acts like child, acts, performs and revert the progress to job tracker through heartbeat.

6. How to set which framework would be used to run map reduce program?

mapreduce.framework.name. it can be

  1. Local
  2. Classic
  3. Yarn

7. What is replication factor for Job’s JAR?

These are one of the most critical resources used regularly by task completion. it’s replication factor is 10

8. mapred.job.tracker property is used for?

mapred.job.tracker property is used by runner to get the job tracker mode. if it set to local then runner would submit the job to local job tracker running of single JVM. else job would be sent to mentioned address in property.

9. Difference between Job.submit() or waitForCompletion() ?

Job Submit internally creates submitter instance and submit the job, while waitforcompletion poll’s progress at regular interval of one second. if job gets executed successfully, it displays successful message on console else display a relevant error message.

 

10. What are the types of tables in Hive?

There are two types of tables.

  1. Managed tables.
  2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

11. Does Hive support record level Insert, delete or update?

Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

12. What kind of datawarehouse application is suitable for Hive?

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where

1) Relatively static data is analyzed,

2) Fast response times are not required, and

3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

13. How can the columns of a table in hive be written to a file?

By using awk command in shell, the output from HiveQL (Describe) can be written to a file.

hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.

14. CONCAT function in Hive with Example?

CONCAT function will concat the input strings. You can specify any number of strings separated by comma.

Example:

CONCAT (‘Hive’,’-’,’performs’,’-’,’good’,’-’,’in’,’-’,’Hadoop’);

Output:

Hive-performs-good-in-Hadoop

So, every time you delimit the strings by ‘-’. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.

CONCAT_WS (‘-’,’Hive’,’performs’,’good’,’in’,’Hadoop’);

Output: Hive-performs-good-in-Hadoop

15. REPEAT function in Hive with example?

REPEAT function will repeat the input string n times specified in the command.

Example:

REPEAT(‘Hadoop’,3);

Output:

HadoopHadoopHadoop.

Note: You can add a space with the input string also.

16. How Pig integrate with Mapreduce to process data?

Pig can easier to execute. When programmer wrote a script to analyze the data sets, Here Pig compiler will convert the programs into MapReduce understandable format. Pig engine execute the query on the MR Jobs. The MapReduce process the data and generate output report. Here MapReduce doesn’t return output to Pig, directly stored in the HDFS.

17. What is the difference between logical and physical plan?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

18. How many ways we can run Pig programs?

Pig programs or commands can be executed in three ways.

  • Script – Batch Method
  • Grunt Shell – Interactive Method
  • Embedded mode

All these ways can be applied to both Local and Mapreduce modes of execution.

19. What is Grunt in Pig?

Grunt is an Interactive Shell in Pig, and below are its major features:

  • Ctrl-E key combination will move the cursor to the end of the line.
  • Grunt remembers command history, and can recall lines in the history buffer using up or down cursor keys.
  • Grunt supports Auto completion mechanism, which will try to complete
  • Pig Latin keywords and functions when you press the Tab key.

20. What are the modes of Pig Execution?

Local Mode:

Local execution in a single JVM, all files are installed and run using local host and file system.

Mapreduce Mode:

Distributed execution on a Hadoop cluster, it is the default mode.

21. What are the main difference between local mode and MapReduce mode?

Local mode:

No need to start or install Hadoop. The pig scripts run in the local system. By default Pig store data in File system. 100% MapReduce and Local mode commands everything same, no need to change anything.

MapReduce Mode:

It’s mandatory to start Hadoop. Pig scripts run and stored in in HDFS. in Both modes, Java and Pig installation is mandatory.

22. Can we process vast amount of data in local mode? Why?

No, System has limited fixed amount of storage, where as Hadoop can handle vast amount of data. So, Pig -x Mapreduce mode is the best choice to process vast amount of data.

23. Does Pig support multi-line commands?

Yes

24. Hive doesn’t support multi-line commands, what about Pig?

Pig can support single and multiple line commands.

Single line comments:

Dump B; — It execute the data, but not store in the file system.

Multiple Line comments:

Store B into ‘/output’; /* it can store/persists the data in Hdfs or Local File System. In protection level most often used Store command */

25. Difference Between Pig and SQL ?

Pig is a Procedural SQL is Declarative Nested relational data model SQL flat relational Schema is optional SQL schema is required OLAP works SQL supports OLAP+OLTP works loads Limited Query  Optimization and Significent opportunity for query Optimization.