Getting started with hadoop

Installation of Hadoop on ubuntu

Creating Hadoop User:

sudo addgroup hadoop

Adding a user:

sudo adduser --ingroup hadoop hduser001

enter image description here

Configuring SSH:

su -hduser001
ssh-keygen -t rsa -P ""
cat .ssh/id >> .ssh/authorized_keys

Note: If you get errors [bash: .ssh/authorized_keys: No such file or directory] whilst writing the authorized key. Check here.

enter image description here enter image description here enter image description here

Add hadoop user to sudoer’s list:

sudo adduser hduser001 sudo

enter image description here

Disabling IPv6:

enter image description here enter image description here

Installing Hadoop:

sudo add-apt-repository ppa:hadoop-ubuntu/stable
sudo apt-get install hadoop

enter image description here enter image description here

Installation or Setup on Linux

A Pseudo Distributed Cluster Setup Procedure


  • Install JDK1.7 and set JAVA_HOME environment variable.

  • Create a new user as “hadoop”.

    useradd hadoop

  • Setup password-less SSH login to its own account

     su - hadoop
     << Press ENTER for all prompts >>
     cat ~/.ssh/ >> ~/.ssh/authorized_keys
     chmod 0600 ~/.ssh/authorized_keys
  • Verify by performing ssh localhost

  • Disable IPV6 by editing /etc/sysctl.conf with the followings:

     net.ipv6.conf.all.disable_ipv6 = 1
     net.ipv6.conf.default.disable_ipv6 = 1
     net.ipv6.conf.lo.disable_ipv6 = 1
  • Check that using cat /proc/sys/net/ipv6/conf/all/disable_ipv6

    (should return 1)

Installation and Configuration:

  • Download required version of Hadoop from Apache archives using wget command.

     cd /opt/hadoop/
     wget http:/addresstoarchive/hadoop-2.x.x/xxxxx.gz
     tar -xvf hadoop-2.x.x.gz
     mv hadoop-2.x.x.gz hadoop 
     ln -s hadoop-2.x.x.gz hadoop
     chown -R hadoop:hadoop hadoop
  • Update .bashrc/.kshrc based on your shell with below environment variables

      export HADOOP_PREFIX=/opt/hadoop/hadoop
      export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
      export JAVA_HOME=/java/home/path
  • In $HADOOP_HOME/etc/hadoop directory edit below files

    • core-site.xml

    • mapred-site.xml

      Create mapred-site.xml from its template

      cp mapred-site.xml.template mapred-site.xml

    • yarn-site.xml

    • hdfs-site.xml


    Create the parent folder to store the hadoop data

    mkdir -p /home/hadoop/hdfs
  • Format NameNode (cleans up the directory and creates necessary meta files)

    hdfs namenode -format
  • Start all services: && start historyserver

Instead use (deprecated).

  • Check all running java processes

  • Namenode Web Interface: http://localhost:50070/

  • Resource manager Web Interface: http://localhost:8088/

  • To stop daemons(services): && stop historyserver

Instead use (deprecated).

Hadoop overview and HDFS

enter image description here

  • Hadoop is an open-source software framework for storage and large-scale processing of data-sets in a distributed computing environment.
  • It is sponsored by Apache Software Foundation.
  • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.


  • Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
  • Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.
  •  It was originally developed to support distribution for the search engine project.

Major modules of hadoop

  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.

Hadoop File System  Basic Features

  • Highly fault-tolerant.
  • High throughput.
  • Suitable for applications with large data sets.
  • Can be built out of commodity hardware.

Namenode and Datanodes

  • Master/slave architecture.
  • HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.
  • The DataNodes manage storage attached to the nodes that they run on.
  • HDFS exposes a file system namespace and allows user data to be stored in files.
  • A file is split into one or more blocks and set of blocks are stored in DataNodes.
  • DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

enter image description here

  • HDFS is designed to store very large files across machines in a large cluster.
  • Each file is a sequence of blocks.
  • All blocks in the file except the last are of the same size.
  • Blocks are replicated for fault tolerance.
  • The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
  • BlockReport contains all the blocks on a Datanode.

Hadoop Shell Commands

  • Common commands used:-
    1. ls Usage: hadoop fs –ls Path(dir/file path to list).
    2. Cat Usage: hadoop fs -cat PathOfFileToView

enter image description here

Link for hadoop shell commands:-