Primeros pasos con Hadoop
Instalación de Hadoop en ubuntu
Creación de usuario de Hadoop:
sudo addgroup hadoop
Agregar un usuario:
sudo adduser --ingroup hadoop hduser001
Configuración de SSH:
su -hduser001
ssh-keygen -t rsa -P ""
cat .ssh/id rsa.pub >> .ssh/authorized_keys
Nota: Si recibe errores [bash: .ssh/authorized_keys: No such file or directory] mientras escribe la clave autorizada. Marque aquí.
Agregue el usuario de hadoop a la lista de sudoer:
sudo adduser hduser001 sudo
Deshabilitar IPv6:
Instalación de Hadoop:
sudo add-apt-repository ppa:hadoop-ubuntu/stable
sudo apt-get install hadoop
Instalación o configuración en Linux
Procedimiento de configuración de un pseudo clúster distribuido
Requisitos previos
-
Instale JDK1.7 y establezca la variable de entorno JAVA_HOME.
-
Crear un nuevo usuario como “hadoop”.
useradd hadoop
-
Configure el inicio de sesión SSH sin contraseña en su propia cuenta
su - hadoop ssh-keygen << Press ENTER for all prompts >> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
-
Verificar realizando
ssh localhost
-
Deshabilite IPV6 editando
/etc/sysctl.conf
con lo siguiente:net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
-
Verifique eso usando
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
(should return 1)
Instalacion y configuracion:
-
Descargue la versión requerida de Hadoop de los archivos de Apache usando el comando
wget
.cd /opt/hadoop/ wget http:/addresstoarchive/hadoop-2.x.x/xxxxx.gz tar -xvf hadoop-2.x.x.gz mv hadoop-2.x.x.gz hadoop (or) ln -s hadoop-2.x.x.gz hadoop chown -R hadoop:hadoop hadoop
-
Actualice
.bashrc
/.kshrc
en función de su shell con las siguientes variables de entornoexport HADOOP_PREFIX=/opt/hadoop/hadoop export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export JAVA_HOME=/java/home/path export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin:$JAVA_HOME/bin
-
En el directorio
$HADOOP_HOME/etc/hadoop
edite los archivos debajo - core-site.xml<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> </property> </configuration>
-
mapred-site.xml
Create
mapred-site.xml
from its templatecp mapred-site.xml.template mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
-
yarn-site.xml
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>localhost</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
-
hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/hadoop/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///home/hadoop/hdfs/datanode</value> </property> </configuration>
Create the parent folder to store the hadoop data
mkdir -p /home/hadoop/hdfs
-
-
Formatear NameNode (limpia el directorio y crea los metaarchivos necesarios)
hdfs namenode -format
-
Iniciar todos los servicios:
start-dfs.sh && start-yarn.sh mr-jobhistory-server.sh start historyserver
En su lugar, use start-all.sh (obsoleto).
-
Verifique todos los procesos java en ejecución
jps
-
Interfaz web de Namenode: http://localhost:50070/
-
Interfaz web del administrador de recursos: http://localhost:8088/
-
Para detener demonios (servicios):
stop-dfs.sh && stop-yarn.sh mr-jobhistory-daemon.sh stop historyserver
En su lugar, use stop-all.sh (obsoleto).
Descripción general de Hadoop y HDFS
- Hadoop is an open-source software framework for storage and large-scale processing of data-sets in a distributed computing environment.
- It is sponsored by Apache Software Foundation.
- It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Historial
- Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
- Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.
- It was originally developed to support distribution for the search engine project.
Módulos principales de hadoop
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Sistema de archivos Hadoop Funciones básicas
- Highly fault-tolerant.
- High throughput.
- Suitable for applications with large data sets.
- Can be built out of commodity hardware.
Nodo de nombre y nodo de datos
- Master/slave architecture.
- HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.
- The DataNodes manage storage attached to the nodes that they run on.
- HDFS exposes a file system namespace and allows user data to be stored in files.
- A file is split into one or more blocks and set of blocks are stored in DataNodes.
- DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
- HDFS is designed to store very large files across machines in a large cluster.
- Each file is a sequence of blocks.
- All blocks in the file except the last are of the same size.
- Blocks are replicated for fault tolerance.
- The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
- BlockReport contains all the blocks on a Datanode.
Comandos de Hadoop Shell
- Common commands used:-
- ls Usage: hadoop fs –ls Path(dir/file path to list).
- Cat Usage: hadoop fs -cat PathOfFileToView
Enlace para comandos de shell de Hadoop:- https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html