Qs 1. What is Hadoop framework?
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
MapReduce is a parallel programming model which is used to process large data sets across hundreds or thousands of servers in a Hadoop cluster.
A MapReduce program consists of the following 3 parts :
2. The Mapper code reads the input files as <Key,Value> pairs and emits key value pairs. The Mapper class extends MapReduceBase and implements the Mapper interface. The Mapper interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types.
3. The Reducer code reads the outputs generated by the different mappers as <Key,Value> pairs and emits key value pairs. The Reducer class extends MapReduceBase and implements the Reducer interface. The Reducer interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types. The keys are WritableComparables, the values are Writables
Qs 4. Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
Mapper is the first phase of Map phase which process map task.Mapper reads key/value pairs and emit key/value pair.
Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.
Following 3 Daemons run on Master nodes.
NameNode - This daemon stores and maintains the metadata for HDFS.The namenode is the master server in Hadoop and manages the file system namespace and access to the files stored in the cluster.
Secondary NameNode -Secondary namenode, isn't a redundant daemon for the namenode but instead provides period checkpointing and housekeeping taskse.
JobTracker - Each cluster will have a single jobtracker that manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.
Following 2 Daemons run on each Slave nodes
DataNode – Stores actual HDFS data blocks.The datanode manages the storage attached to a node, of which there can be multiple nodes in a cluster. Each node storing data will have a datanode daemon running.
TaskTracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks i.e.Tasktracker per datanode performs the actual work
Qs 7. What is InputSplit in Hadoop?
Input Split: It is part of input processed by a single map. Each split is processed by a single map. In other words InputSplit represents the data to be processed by an individual Mapper. Each split is divided into records , and the map processes each record, which is a key value pair. Split is basically a number of rows and record is that number.
The InputFormat class is one of the fundamental classes in the Hadoop Map Reduce framework. This class is responsible for defining two main things:
Hadoop will make 5 splits as follows:
JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
Qs 11. What if job tracker machine is down?
In Hadoop 1.0, Job Tracker is single Point of availability means if JobTracker fails, all jobs must restart.Overall Execution flow will be interupted. Due to this limitation, In hadoop 2.0 Job Tracker concept is replaced by YARN.
In YARN, the term JobTracker and TaskTracker has totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components).
When a datanode fails:
The following are some typical tasks of JobTracker:-
When Client applications submit map reduce jobs to the Job tracker.
The user of Mapreduce framework needs to specify
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker.
Qs 17. Explain what is sqoop in Hadoop?
Sqoop is a connectivity tool which transfers data in both directions between relational databases (MySQL, Oracle, Teradata) ,data warehouses and Hadoop HDFS and other Hadoop data source like Hive, HBase.
Qs 18. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
It will restart the task again on some other TaskTracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.
Qs 19. What is speculative execution (also called backup tasks)? What problem does it solve?
In Hadoop during Speculative Execution a certain number of duplicate tasks are launched.
On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution.
In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk.
Disk that finish the task first are retained and disks that do not finish first are killed.
Qs 20. Explain the use of TaskTracker in the Hadoop cluster?
The basic parameters of a Mapper are
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
Qs 23. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?
Distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
Qs 24. How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.
Qs 25. How will you write a custom partitioner for a Hadoop job?
To have Hadoop use a custom partitioner you will have to do minimum the following three:
Following are differences between HDFS and NAS
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
- Hadoop is part of the Apache project sponsored by the Apache Software Foundation.
- Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.
- Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising.
- The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X
MapReduce is a parallel programming model which is used to process large data sets across hundreds or thousands of servers in a Hadoop cluster.
- Map/reduce brings compute to the data at data location in contrast to traditional parallelism, which brings data to the compute location.
- The Term MapReduce is composed of Map and Reduce phase. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job. he programming language for MapReduce is Java.All data emitted in the flow of a MapReduce program is in the form of Key/Value pairs.
A MapReduce program consists of the following 3 parts :
- Driver
- Mapper
- Reducer
2. The Mapper code reads the input files as <Key,Value> pairs and emits key value pairs. The Mapper class extends MapReduceBase and implements the Mapper interface. The Mapper interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types.
3. The Reducer code reads the outputs generated by the different mappers as <Key,Value> pairs and emits key value pairs. The Reducer class extends MapReduceBase and implements the Reducer interface. The Reducer interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types. The keys are WritableComparables, the values are Writables
Qs 4. Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
- org.apache.hadoop.mapreduce.Mapper
- org.apache.hadoop.mapreduce.Reducer
Mapper is the first phase of Map phase which process map task.Mapper reads key/value pairs and emit key/value pair.
- Maps are the individual tasks that transform input records into intermediate records.
- The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.
Following 3 Daemons run on Master nodes.
NameNode - This daemon stores and maintains the metadata for HDFS.The namenode is the master server in Hadoop and manages the file system namespace and access to the files stored in the cluster.
Secondary NameNode -Secondary namenode, isn't a redundant daemon for the namenode but instead provides period checkpointing and housekeeping taskse.
JobTracker - Each cluster will have a single jobtracker that manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.
Following 2 Daemons run on each Slave nodes
DataNode – Stores actual HDFS data blocks.The datanode manages the storage attached to a node, of which there can be multiple nodes in a cluster. Each node storing data will have a datanode daemon running.
TaskTracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks i.e.Tasktracker per datanode performs the actual work
Qs 7. What is InputSplit in Hadoop?
Input Split: It is part of input processed by a single map. Each split is processed by a single map. In other words InputSplit represents the data to be processed by an individual Mapper. Each split is divided into records , and the map processes each record, which is a key value pair. Split is basically a number of rows and record is that number.
- The length of the InputSplit is measured in bytes.
- Every InputSplit has a storage locations (hostname strings). The storage locations are used by the MapReduce system to place map tasks as close to split's data as possible.
- Case 1 - Input split size [64 MB] = Block size [64 MB], # of Map task 2
- Case 2 - Input split size [32 MB] = Block size [64 MB], # of Map task 4
- Case 2 - Input split size [128 MB] = Block size [64 MB], # of Map task 1
The InputFormat class is one of the fundamental classes in the Hadoop Map Reduce framework. This class is responsible for defining two main things:
- Data splits
- Record reader
- Data split is a fundamental concept in Hadoop Map Reduce framework which defines both the size of individual Map tasks and its potential execution server.
- The Record Reader is responsible for actual reading records from the input file and submitting them (as key/value pairs) to the mapper.
Hadoop will make 5 splits as follows:
- 1 split for 64K files
- 2 splits for 65MB files
- 2 splits for 127MB files
JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
Qs 11. What if job tracker machine is down?
In Hadoop 1.0, Job Tracker is single Point of availability means if JobTracker fails, all jobs must restart.Overall Execution flow will be interupted. Due to this limitation, In hadoop 2.0 Job Tracker concept is replaced by YARN.
In YARN, the term JobTracker and TaskTracker has totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components).
- Resource Manager
- Node Manager(node specific)
When a datanode fails:
- Jobtracker and namenode detect the failure
- On the failed node all tasks are re-scheduled
- Namenode replicates the users data to another node
The following are some typical tasks of JobTracker:-
When Client applications submit map reduce jobs to the Job tracker.
- The JobTracker talks to the Name node to determine the location of the data.
- The JobTracker locates Tasktracker nodes with available slots at or near the data
- The JobTracker submits the work to the chosen Tasktracker nodes.
- The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
- When the work is completed, the JobTracker updates its status.
- Client applications can poll the JobTracker for information.
The user of Mapreduce framework needs to specify
- Job’s input locations in the distributed file system
- Job’s output location in the distributed file system
- Input format
- Output format
- Class containing the map function
- Class containing the reduce function
- JAR file containing the mapper, reducer and driver classes
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.
- Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job Tracker
- Task Tracker also handles the data motion between the map and reduce phases.
- One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the status of the Task.
- If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker.
Qs 17. Explain what is sqoop in Hadoop?
Sqoop is a connectivity tool which transfers data in both directions between relational databases (MySQL, Oracle, Teradata) ,data warehouses and Hadoop HDFS and other Hadoop data source like Hive, HBase.
- Sqoop allows easy import and export of data from structured data stores.
- Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.
Qs 18. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
It will restart the task again on some other TaskTracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.
Qs 19. What is speculative execution (also called backup tasks)? What problem does it solve?
In Hadoop during Speculative Execution a certain number of duplicate tasks are launched.
On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution.
In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk.
Disk that finish the task first are retained and disks that do not finish first are killed.
Qs 20. Explain the use of TaskTracker in the Hadoop cluster?
- A TaskTracker is a slave node in the cluster which that accepts the tasks from JobTracker like Map, Reduce or shuffle operation. TaskTracker also runs in its own JVM Process.
- Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker.
- The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker.
- The TaskTracker also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.
The basic parameters of a Mapper are
- LongWritable and Text
- Text and IntWritable
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
Qs 23. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?
Distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
Qs 24. How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.
Qs 25. How will you write a custom partitioner for a Hadoop job?
To have Hadoop use a custom partitioner you will have to do minimum the following three:
- Create a new class that extends Partitioner Class
- Override method getPartition
- In the wrapper that runs the Mapreduce, either
- Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)
Following are differences between HDFS and NAS
- In HDFS Data Blocks are distributed across local drives of all machines in a cluster, whereas in NAS data is stored on dedicated hardware.
- HDFS is designed to work with MapReduce System, since computation is moved to data. NAS is not suitable for MapReduce since data is stored separately from the computations.
- HDFS runs on a cluster of machines and provides redundancy using replication protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy.
- Hadoop Basic Course
- Hadoop Installation on Windows
No comments:
Post a Comment