Hadoop Introduction

From WikiOD

Remarks[edit | edit source]

What is Apache Hadoop?[edit | edit source]

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop includes these modules:[edit | edit source]

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Reference:[edit | edit source]

Apache Hadoop

Versions[edit | edit source]

Version Release Notes Release Date
3.0.0-alpha1 2016-08-30
2.7.3 Click here - 2.7.3 2016-01-25
2.6.4 Click here - 2.6.4 2016-02-11
2.7.2 Click here - 2.7.2 2016-01-25
2.6.3 Click here - 2.6.3 2015-12-17
2.6.2 Click here - 2.6.2 2015-10-28
2.7.1 Click here - 2.7.1 2015-07-06

Installation of Hadoop on ubuntu[edit | edit source]

Creating Hadoop User:[edit | edit source]

sudo addgroup hadoop

Adding a user:[edit | edit source]

sudo adduser --ingroup hadoop hduser001


Configuring SSH:[edit | edit source]

su -hduser001
ssh*keygen -t rsa -P ""
cat .ssh/id rsa.pub >> .ssh/authorized_keys

Note: If you get errors [bash: .ssh/authorized_keys: No such file or directory] whilst writing the authorized key. Check here.

Getting_started_with_hadoop Getting_started_with_hadoop Getting_started_with_hadoop

Add hadoop user to sudoer's list:[edit | edit source]

sudo adduser hduser001 sudo


Disabling IPv6:[edit | edit source]

Getting_started_with_hadoop Getting_started_with_hadoop

Installing Hadoop:[edit | edit source]

sudo add-apt-repository ppa:hadoop-ubuntu/stable
sudo apt-get install hadoop

Getting_started_with_hadoop Getting_started_with_hadoop

Installation or Setup on Linux[edit | edit source]

A Pseudo Distributed Cluster Setup Procedure


Install JDK1.7 and set JAVA_HOME environment variable.

Create a new user as "hadoop".

useradd hadoop

Setup password-less SSH login to its own account

 su - hadoop
 << Press ENTER for all prompts >>
 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
 chmod 0600 ~/.ssh/authorized_keys

Verify by performing ssh localhost

Disable IPV6 by editing /etc/sysctl.conf with the followings:

 net.ipv6.conf.all.disable_ipv6 = 1
 net.ipv6.conf.default.disable_ipv6 = 1
 net.ipv6.conf.lo.disable_ipv6 = 1

Check that using cat /proc/sys/net/ipv6/conf/all/disable_ipv6

(should return 1)

Installation and Configuration:

Download required version of Hadoop from Apache archives using wget command.

 cd /opt/hadoop/
 wget http:/addresstoarchive/hadoop-2.x.x/xxxxx.gz
 tar -xvf hadoop-2.x.x.gz
 mv hadoop-2.x.x.gz hadoop 

 ln -s hadoop-2.x.x.gz hadoop
 chown -R hadoop:hadoop hadoop

Update .bashrc/.kshrc based on your shell with below environment variables

  export HADOOP_PREFIX=/opt/hadoop/hadoop
  export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
  export JAVA_HOME=/java/home/path

In $HADOOP_HOME/etc/hadoop directory edit below files




Create mapred-site.xml from its template

cp mapred-site.xml.template mapred-site.xml






Create the parent folder to store the hadoop data

mkdir -p /home/hadoop/hdfs

Format NameNode (cleans up the directory and creates necessary meta files)

hdfs namenode -format

Start all services:

start*dfs.sh && start-yarn.sh
mr*jobhistory*server.sh start historyserver

Instead use start-all.sh (deprecated).

Check all running java processes


Namenode Web Interface: http://localhost:50070/

Resource manager Web Interface: http://localhost:8088/

To stop daemons(services):

stop*dfs.sh && stop-yarn.sh
mr*jobhistory*daemon.sh stop historyserver

Instead use stop-all.sh (deprecated).

Hadoop overview and HDFS[edit | edit source]



  • Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
  • Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.
  •  It was originally developed to support distribution for the search engine project.

Major modules of hadoop

Hadoop File System  Basic Features

Namenode and Datanodes


Hadoop Shell Commands

Common commands used:-


Link for hadoop shell commands:- https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html