Friday, October 28, 2016

Hadoop 1.x: Architecture, Major Components and How HDFS and MapReduce Works


In this post, we are going to discuss about Apache Hadoop 1.x Architecture and How it’s components work in detail.

Post’s Brief Table of Contents

  • Hadoop 1.x Architecture
  • Hadoop 1.x Major Components
  • How Hadoop 1.x Major Components Works
  • How Store and Compute Operations Work in Hadoop

Hadoop 1.x Architecture

Apache Hadoop 1.x or earlier versions are using the following Hadoop Architecture. It is a Hadoop 1.x High-level Architecture. We will discuss in-detailed Low-level Architecture in coming sections.
If you don’t understand this Architecture at this stage, no need to worry. Read next sections in this post and also coming posts to understand it very well.
hadoop1.x-components
  • Hadoop Common Module is a Hadoop Base API (A Jar file) for all Hadoop Components. All other components works on top of this module.
  • HDFS stands for Hadoop Distributed File System. It is also know as HDFS V1 as it is part of Hadoop 1.x. It is used as a Distributed Storage System in Hadoop Architecture.
  • MapReduce is a Batch Processing or Distributed Data Processing Module. It is built by following Google’s MapReduce Algorithm. It is also know as “MR V1” or “Classic MapReduce” as it is part of Hadoop 1.x.
  • Remaining all Hadoop Ecosystem components work on top of these two major components: HDFS and MapReduce. We will discuss all Hadoop Ecosystem components in-detail in my coming posts.
NOTE:-
Hadoop 1.x MapReduce is also know as “Classic MapReduce” as it was developed by following Google’s MapReduce Algorithm Tech Paper.

Hadoop 1.x Major Components

Hadoop 1.x Major Components components are: HDFS and MapReduce. They are also know as “Two Pillars” of Hadoop 1.x.
HDFS:
HDFS is a Hadoop Distributed FileSystem, where our BigData is stored using Commodity Hardware. It is designed to work with Large DataSets with default block size is 64MB (We can change it as per our Project requirements).
HDFS component is again divided into two sub-components:
  1. Name Node
  2. Name Node is placed in Master Node. It used to store Meta Data about Data Nodes like “How many blocks are stored in Data Nodes, Which Data Nodes have data, Slave Node Details, Data Nodes locations, timestamps etc” .
  3. Data Node
  4. Data Nodes are places in Slave Nodes. It is used to store our Application Actual Data. It stores data in Data Slots of size 64MB by default.
MapReduce:
MapReduce is a Distributed Data Processing or Batch Processing Programming Model. Like HDFS, MapReduce component also uses Commodity Hardware to process “High Volume of Variety of Data at High Velocity Rate” in a reliable and fault-tolerant manner.
MapReduce component is again divided into two sub-components:
  1. Job Tracker
  2. Job Tracker is used to assign MapReduce Tasks to Task Trackers in the Cluster of Nodes. Sometimes, it reassigns same tasks to other Task Trackers as previous Task Trackers are failed or shutdown scenarios.
    Job Tracker maintains all the Task Trackers status like Up/running, Failed, Recovered etc.
  3. Task Tracker
  4. Task Tracker executes the Tasks which are assigned by Job Tracker and sends the status of those tasks to Job Tracker.
hadoop1.x-hdfs-mr-components
We will discuss these four sub-component’s responsibilities and how they interact each other to perform a “Client Application Tasks” in detail in next section.

How Hadoop 1.x Major Components Works

Hadoop 1.x components follow this architecture to interact each other and to work parallel in a reliable and fault-tolerant manner.
Hadoop 1.x Components High-Level Architecture
hadoop1.x-components-architecture
  • Both Master Node and Slave Nodes contain two Hadoop Components:
    1. HDFS Component
    2. MapReduce Component
  • Master Node’s HDFS component is also known as “Name Node”.
  • Slave Node’s HDFS component is also known as “Data Node”.
  • Master Node’s “Name Node” component is used to store Meta Data.
  • Slave Node’s “Data Node” component is used to store actual our application Big Data.
  • HDFS stores data by using 64MB size of “Data Slots” or “Data Blocks”.
  • Master Node’s MapReduce component is also known as “Job Tracker”.
  • Slave Node’s MapReduce component is also known as “Task Tracker”.
  • Master Node’s “Job Tracker” will take care assigning tasks to “Task Tracker” and receiving results from them.
  • Slave Node’s MapReduce component “Task Tracker” contains two MapReduce Tasks:
    1. Map Task
    2. Reduce Task
    We will discuss in-detail about MapReduce tasks (Mapper and Reducer) in my coming post with some simple End-to-End Examples.
  • Slave Node’s “Task Tracker” actually performs Client’s tasks by using MapReduce Batch Processing model.
  • Master Node is a Primary Node to take care of all remaining Slave Nodes (Secondary Nodes).
Hadoop 1.x Components In-detail Architecture
hadoop2.x-components-architecture
Hadoop 1.x Architecture Description
  • Clients (one or more) submit their work to Hadoop System.
  • When Hadoop System receives a Client Request, first it is received by a Master Node.
  • Master Node’s MapReduce component “Job Tracker” is responsible for receiving Client Work and divides into manageable independent Tasks and assign them to Task Trackers.
  • Slave Node’s MapReduce component “Task Tracker” receives those Tasks from “Job Tracker” and perform those tasks by using MapReduce components.
  • Once all Task Trackers finished their job, Job Tracker takes those results and combines them into final result.
  • Finally Hadoop System will send that final result to the Client.

How Store and Compute Operations Work in Hadoop

All these Master Node and Slave Nodes are organized into a Network of clusters. Each Cluster is again divided into Racks. Each rack contains a set of Nodes (Commodity Computer).
When Hadoop system receives “Store” operation like storing Large DataSets into HDFS, it stores that data into 3 different Nodes (As we configure Replication Factor = 3 by default). This complete data is not stored in one single node. Large Data File is divided into manageable and meaningful Blocks and distributed into different nodes with 3 copies.
If Hadoop system receives any “Compute” operation, it will talk to near-by nodes to retrieve those blocks of Data. While Reading Data or Computing if one or more nodes get failed, then it will automatically pick-up performing those tasks by approaching any near-by and available node.
That’s why Hadoop system provides highly available and fault tolerant BigData Solutions.
NOTE:-
  • Hadoop 1.x Architecture has lot of limitations and drawbacks. So that Hadoop Community has evaluated and redesigned this Architecture into Hadoop 2.x Architecture.
  • Hadoop 2.x Architecture is completely different and resolved all Hadoop 1.x Architecture’s limitations and drawbacks.
That’s it all about Hadoop 1.x Architecture, Hadoop Major Components and How those components work together to fulfill Client requirements. We will discuss “Hadoop 2.x Architecture, Major Components and How those components work” in my coming post.
We hope you understood Hadoop 1.x Architecture and how it works very well now.

2 comments: