Friday, October 28, 2016

Hadoop 1.x: Architecture, Major Components and How HDFS and MapReduce Works


In this post, we are going to discuss about Apache Hadoop 1.x Architecture and How it’s components work in detail.

Post’s Brief Table of Contents

  • Hadoop 1.x Architecture
  • Hadoop 1.x Major Components
  • How Hadoop 1.x Major Components Works
  • How Store and Compute Operations Work in Hadoop

Hadoop 1.x Architecture

Apache Hadoop 1.x or earlier versions are using the following Hadoop Architecture. It is a Hadoop 1.x High-level Architecture. We will discuss in-detailed Low-level Architecture in coming sections.
If you don’t understand this Architecture at this stage, no need to worry. Read next sections in this post and also coming posts to understand it very well.
hadoop1.x-components
  • Hadoop Common Module is a Hadoop Base API (A Jar file) for all Hadoop Components. All other components works on top of this module.
  • HDFS stands for Hadoop Distributed File System. It is also know as HDFS V1 as it is part of Hadoop 1.x. It is used as a Distributed Storage System in Hadoop Architecture.
  • MapReduce is a Batch Processing or Distributed Data Processing Module. It is built by following Google’s MapReduce Algorithm. It is also know as “MR V1” or “Classic MapReduce” as it is part of Hadoop 1.x.
  • Remaining all Hadoop Ecosystem components work on top of these two major components: HDFS and MapReduce. We will discuss all Hadoop Ecosystem components in-detail in my coming posts.
NOTE:-
Hadoop 1.x MapReduce is also know as “Classic MapReduce” as it was developed by following Google’s MapReduce Algorithm Tech Paper.

Hadoop 1.x Major Components

Hadoop 1.x Major Components components are: HDFS and MapReduce. They are also know as “Two Pillars” of Hadoop 1.x.
HDFS:
HDFS is a Hadoop Distributed FileSystem, where our BigData is stored using Commodity Hardware. It is designed to work with Large DataSets with default block size is 64MB (We can change it as per our Project requirements).
HDFS component is again divided into two sub-components:
  1. Name Node
  2. Name Node is placed in Master Node. It used to store Meta Data about Data Nodes like “How many blocks are stored in Data Nodes, Which Data Nodes have data, Slave Node Details, Data Nodes locations, timestamps etc” .
  3. Data Node
  4. Data Nodes are places in Slave Nodes. It is used to store our Application Actual Data. It stores data in Data Slots of size 64MB by default.
MapReduce:
MapReduce is a Distributed Data Processing or Batch Processing Programming Model. Like HDFS, MapReduce component also uses Commodity Hardware to process “High Volume of Variety of Data at High Velocity Rate” in a reliable and fault-tolerant manner.
MapReduce component is again divided into two sub-components:
  1. Job Tracker
  2. Job Tracker is used to assign MapReduce Tasks to Task Trackers in the Cluster of Nodes. Sometimes, it reassigns same tasks to other Task Trackers as previous Task Trackers are failed or shutdown scenarios.
    Job Tracker maintains all the Task Trackers status like Up/running, Failed, Recovered etc.
  3. Task Tracker
  4. Task Tracker executes the Tasks which are assigned by Job Tracker and sends the status of those tasks to Job Tracker.
hadoop1.x-hdfs-mr-components
We will discuss these four sub-component’s responsibilities and how they interact each other to perform a “Client Application Tasks” in detail in next section.

How Hadoop 1.x Major Components Works

Hadoop 1.x components follow this architecture to interact each other and to work parallel in a reliable and fault-tolerant manner.
Hadoop 1.x Components High-Level Architecture
hadoop1.x-components-architecture
  • Both Master Node and Slave Nodes contain two Hadoop Components:
    1. HDFS Component
    2. MapReduce Component
  • Master Node’s HDFS component is also known as “Name Node”.
  • Slave Node’s HDFS component is also known as “Data Node”.
  • Master Node’s “Name Node” component is used to store Meta Data.
  • Slave Node’s “Data Node” component is used to store actual our application Big Data.
  • HDFS stores data by using 64MB size of “Data Slots” or “Data Blocks”.
  • Master Node’s MapReduce component is also known as “Job Tracker”.
  • Slave Node’s MapReduce component is also known as “Task Tracker”.
  • Master Node’s “Job Tracker” will take care assigning tasks to “Task Tracker” and receiving results from them.
  • Slave Node’s MapReduce component “Task Tracker” contains two MapReduce Tasks:
    1. Map Task
    2. Reduce Task
    We will discuss in-detail about MapReduce tasks (Mapper and Reducer) in my coming post with some simple End-to-End Examples.
  • Slave Node’s “Task Tracker” actually performs Client’s tasks by using MapReduce Batch Processing model.
  • Master Node is a Primary Node to take care of all remaining Slave Nodes (Secondary Nodes).
Hadoop 1.x Components In-detail Architecture
hadoop2.x-components-architecture
Hadoop 1.x Architecture Description
  • Clients (one or more) submit their work to Hadoop System.
  • When Hadoop System receives a Client Request, first it is received by a Master Node.
  • Master Node’s MapReduce component “Job Tracker” is responsible for receiving Client Work and divides into manageable independent Tasks and assign them to Task Trackers.
  • Slave Node’s MapReduce component “Task Tracker” receives those Tasks from “Job Tracker” and perform those tasks by using MapReduce components.
  • Once all Task Trackers finished their job, Job Tracker takes those results and combines them into final result.
  • Finally Hadoop System will send that final result to the Client.

How Store and Compute Operations Work in Hadoop

All these Master Node and Slave Nodes are organized into a Network of clusters. Each Cluster is again divided into Racks. Each rack contains a set of Nodes (Commodity Computer).
When Hadoop system receives “Store” operation like storing Large DataSets into HDFS, it stores that data into 3 different Nodes (As we configure Replication Factor = 3 by default). This complete data is not stored in one single node. Large Data File is divided into manageable and meaningful Blocks and distributed into different nodes with 3 copies.
If Hadoop system receives any “Compute” operation, it will talk to near-by nodes to retrieve those blocks of Data. While Reading Data or Computing if one or more nodes get failed, then it will automatically pick-up performing those tasks by approaching any near-by and available node.
That’s why Hadoop system provides highly available and fault tolerant BigData Solutions.
NOTE:-
  • Hadoop 1.x Architecture has lot of limitations and drawbacks. So that Hadoop Community has evaluated and redesigned this Architecture into Hadoop 2.x Architecture.
  • Hadoop 2.x Architecture is completely different and resolved all Hadoop 1.x Architecture’s limitations and drawbacks.
That’s it all about Hadoop 1.x Architecture, Hadoop Major Components and How those components work together to fulfill Client requirements. We will discuss “Hadoop 2.x Architecture, Major Components and How those components work” in my coming post.
We hope you understood Hadoop 1.x Architecture and how it works very well now.

Thursday, October 27, 2016

HDFS Hands-on Lab

Download the Hortonworks Sandbox and install on your laptop.

1. Open a terminal on your local machine, SSH into the sandbox:
ssh root@127.0.0.1 -p 2222
Note: If your on VMware or Azure, insert your appropriate ip address in place of 127.0.0.1. Azure users will need to replace port 2222 with 22.
2. Copy and paste the commands to download the sf-salaries-2011-2013.csv and sf-salaries-2014.csv files. We will use them while we learn file management operations.
# download sf-salaries-2011-2013
wget https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/using-the-command-line-to-manage-hdfs/sf-salary-datasets/sf-salaries-2011-2013.csv
# download sf-salaries-2014
wget https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/using-the-command-line-to-manage-hdfs/sf-salary-datasets/sf-salaries-2014.csv
sf_salary_datasets

OUTLINE

STEP 1: CREATE A DIRECTORY IN HDFS, UPLOAD A FILE AND LIST CONTENTS 

Let’s learn by writing the syntax. You will be able to copy and paste the following example commands into your terminal. Let’s login under hdfs user, so we can give root user permission to perform file operations:
su hdfs
cd
We will use the following command to run filesystem commands on the file system of hadoop:
hdfs dfs [command_operation]
Refer to the File System Shell Guide to view various command_operations.

HDFS DFS -CHMOD:

  • Affects the permissions of the folder or file. Controls who has read/write/execute privileges
  • We will give root access to read and write to the user directory. Later we will perform an operation in which we send a file from our local filesystem to hdfs.
hdfs dfs -chmod 777 /user
  • Warning in production environments, setting the folder with the permissions above is not a good idea because anyone can read/write/execute files or folders.
Type the following command, so we can switch back to the root user. We can perform the remaining file operations under the user folder since the permissions were changed.
exit

HDFS DFS -MKDIR:

  • Takes the path uri’s as an argument and creates a directory or multiple directories.
# Usage:
        # hdfs dfs -mkdir <paths>
# Example:
        hdfs dfs -mkdir /user/hadoop
        hdfs dfs -mkdir /user/hadoop/sf-salaries-2011-2013 /user/hadoop/sf-salaries /user/hadoop/sf-salaries-2014

HDFS DFS -PUT:

  • Copies single src file or multiple src files from local file system to the Hadoop Distributed File System.
# Usage:
        # hdfs dfs -put <local-src> ... <HDFS_dest_path>
# Example:
        hdfs dfs -put sf-salaries-2011-2013.csv /user/hadoop/sf-salaries-2011-2013/sf-salaries-2011-2013.csv
        hdfs dfs -put sf-salaries-2014.csv /user/hadoop/sf-salaries-2014/sf-salaries-2014.csv

HDFS DFS -LS:

  • Lists the contents of a directory
  • For a file, returns stats of a file
# Usage:  
        # hdfs dfs  -ls  <args>  
# Example:
        hdfs dfs -ls /user/hadoop
        hdfs dfs -ls /user/hadoop/sf-salaries-2011-2013
        hdfs dfs -ls /user/hadoop/sf-salaries-2011-2013/sf-salaries-2011-2013.csv
list_folder_contents

STEP 2: FIND OUT SPACE UTILIZATION IN A HDFS DIRECTORY 

HDFS DFS -DU:

  • Displays size of files and directories contained in the given directory or the size of a file if its just a file.
# Usage:  
        # hdfs dfs -du URI
# Example:
        hdfs dfs -du  /user/hadoop/ /user/hadoop/sf-salaries-2011-2013/sf-salaries-2011-2013.csv
displays_entity_size

STEP 3: DOWNLOAD FILE FROM HDFS TO LOCAL FILE SYSTEM 

HDFS DFS -GET:

  • Copies/Downloads files from HDFS to the local file system
# Usage:
        # hdfs dfs -get <hdfs_src> <localdst>
# Example:
        hdfs dfs -get /user/hadoop/sf-salaries-2011-2013/sf-salaries-2011-2013.csv /home/

STEP 4: EXPLORE TWO ADVANCED FEATURES 

HDFS DFS -GETMERGE

  • Takes a source directory file or files as input and concatenates files in src into the local destination file.
  • Concatenates files in the same directory or from multiple directories as long as we specify their location and outputs them to the local file system, as can be seen in the Usage below.
  • Let’s concatenate the san francisco salaries from two separate directory and output them to our local filesystem. Our result will be the salaries from 2014 are appended below the last row of 2011-2013.
# Usage:
        # hdfs dfs -getmerge <src> <localdst> [addnl]
        # hdfs dfs -getmerge <src1> <src2> <localdst> [addnl]
# Option:
        # addnl: can be set to enable adding a newline on end of each file
# Example:
        hdfs dfs -getmerge /user/hadoop/sf-salaries-2011-2013/ /user/hadoop/sf-salaries-2014/ /root/output.csv
Merges the files in sf-salaries-2011-2013 and sf-salaries-2014 to output.csv in the root directory of the local filesystem. The first file contained about 120,000 rows and the second file contained almost 30,000 rows. This file operation is important because it will save you time from having to manually concatenate them.

HDFS DFS -CP:

  • Copy file or directories recursively, all the directory’s files and subdirectories to the bottom of the directory tree are copied.
  • It is a tool used for large inter/intra-cluster copying
# Usage:
        # hdfs dfs -cp <src-url> <dest-url>
# Example:
        hdfs dfs -cp /user/hadoop/sf-salaries-2011-2013/ /user/hadoop/sf-salaries-2014/ /user/hadoop/sf-salaries
-cp: copies sf-salaries-2011-2013, sf-salaries-2014 and all their contents to sf-salaries
Verify the files or directories successfully copied to the destination folder:
hdfs dfs -ls /user/hadoop/sf-salaries/
hdfs dfs -ls /user/hadoop/sf-salaries/sf-salaries-2011-2013
hdfs dfs -ls /user/hadoop/sf-salaries/sf-salaries-2014
visual_result_of_distcp
Visual result of distcp file operation. Notice that both src1 and src2 directories and their contents were copied to the dest directory.

STEP 5: USE HELP COMMAND TO ACCESS HADOOP COMMAND MANUAL 

Help command opens the list of commands supported by Hadoop Data File System (HDFS)
# Example:  
        hdfs dfs  -help
hadoop_help_command_manual
Hope this short tutorial was useful to get the basics of file management.

SUMMARY 

Congratulations! We just learned to use commands to manage our sf-salaries-2011-2013.csv and sf-salaries-2014.csv dataset files in HDFS. We learned to create, upload and list the the contents in our directories. We also acquired the skills to download files from HDFS to our local file system and explored a few advanced features of HDFS file management using the command line.