Big Data Hadoop Interview Questions for Freshers and Experienced

1.Difference between Hadoop and Traditional RDBMS?

A. Datatypes :

Hadoop : Processes semi-structured and unstructured data.

RDBMS : Processes structured data.

B. Best Fit for Applications :

Hadoop : Data discovery and Massive Storage/Processing of Unstructured data.

RDBMS : Best suited for OLTP and complex ACID transactions.

C. Schema :

Hadoop : Schema on Read

RDBMS : Schema on Write

D. Speed :

Hadoop : Writes are Fast

RDBMS : Reads are Fast

2. How Big Data helps to increase the Business Revenue ?

  • Highly scalable: Traditional DataStack platform is unable to scale to heavy workloads, big data platform can scale infinitely.
  • Money savings: Most of the big data platforms are coming with open-source licensing options. Cost of ownership, hence, is significantly less with big data BI architecture.
  • MapReduce: Programming interface MapReduce provides powerful custom data management and processing capabilities.
  • Unstructured data support: One major advantage of big data is that it supports structured, semi-structured, an unstructured data elements.

3. Which Companies are using hadoop platform :

  • Yahoo
  • Google
  • Facebook
  • Twitter
  • Amazon
  • Ebay
  • Hotstar
  • Spotify
  • Adobe
  • IBM
  • Microsoft
  • Intel
  • Flipkart
  • Alibaba
  • Infosys
  • Cognizant

4. What are the most common Input Formats in Hadoop?

  • Text Input Format: Default input format in Hadoop.
  • Key Value Input Format: used for plain text files where the files are broken into lines.
  • Sequence File Input Format: used for reading files in sequence.

5. How can you debug Hadoop code?

First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.

  1. Run: “ps –ef | grep –I ResourceManager”
    and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
  2. On the basis of RM logs, identify the worker node that was involved in execution of the task.
  3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
  4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.

Want to Join Best Hadoop Training in Bangalore ?

6. How to compress mapper output but not the reducer output?

To achieve this compression, you should set:

conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)

7. What is the difference between Map Side join and Reduce Side Join?

Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

8. What are real-time industry applications of Hadoop?

Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high performance and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today.Some of the instances where Hadoop is used:

  • Managing traffic on streets.
  • Streaming processing.
  • Content Management and Archiving Emails.
  • Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
  • Fraud detection and Prevention.
  • Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data.
  • Managing content, posts, images and videos on social media platforms.
  • Analyzing customer data in real-time for improving business performance.
  • Public sector fields such as intelligence, defense, cyber security and scientific research.
  • Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction.
  • Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data.

9. What are the core methods of a Reducer?

The three core methods of a Reducer are:

  1. setup(): this method is used for configuring various parameters like input data size, distributed cache.
    public void setup (context)
  2. reduce(): heart of the reducer always called once per key with the associated reduced task
    public void reduce(Key, Value, context)
  3. cleanup(): this method is called to clean temporary files, only once at the end of the task
    public void cleanup (context)

10. What is SequenceFile in Hadoop?

Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:

  1. Uncompressed key/value records.
  2. Record compressed key/value records – only ‘values’ are compressed here.
  3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.

 

Jobs in Bangalore
Best Training

Quick Enquiry