Hadoop installation tutorial

Purpose of post is to explain how to install hadoop in your computer. This post considers that you have Linux based system available for use. I am doing this on Ubuntu system

If you want to know how to install latest version of Hadoop 2.0 , then see the Hadoop 2.0 Install Tutorial

Before you begin create a separate user named hadoop in the system and do all these operations in that.

This document covers the Steps to
1) Configure SSH
2) Install JDK
3) Install Hadoop

Update your repository
#sudo apt-get update

You can directly copy the commands from there and run in your system

Hadoop requires that various systems present in cluster can talk to each other freely. Hadoop use SSH to prove the identity for connection.

Let's Download and configure SSH

#sudo apt-get install openssh-server openssh-client
#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
#cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#sudo chmod go-w $HOME $HOME/.ssh
#sudo chmod 600 $HOME/.ssh/authorized_keys
#sudo chown `whoami` $HOME/.ssh/authorized_keys

Testing your SSH

#ssh localhost
Say yes

It should open connection with SSH
#exit
This will close the SSH

Java 1.6 is mandatory for running hadoop

Lets Download and install JDK

#sudo mkdir /usr/java
#cd /usr/java
#sudo wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin

Wait till the jdk download completes
Install java
#sudo chmod o+w jdk-6u31-linux-i586.bin
#sudo chmod +x jdk-6u31-linux-i586.bin
#sudo ./jdk-6u31-linux-i586.bin

Now comes the Hadoop :)

Lets Download and configure Hadoop in Pseudo distributed mode. You can read more about various types of modes on Hadoop website.

Download the latest hadoop version from its website

http://hadoop.apache.org/common/releases.html
Download hadoop 1.0.x tar.gz from hadoop website

Extract it into some folder ( say /home/hadoop/software/20/ )
All softwares have been downloaded at that location

For other modes (Standalone and Fully distributed) please see hadoop documentation

Go to conf directory in hadoop folder and open core-site.xml and add the following property in blank configuration tags

 

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost</value>
</property>
</configuration>

Similarly do for

conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>


conf/mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

Environment variables

In hadoop_env.sh file , change the JAVA_HOME to location where you installed java
e.g
JAVA_HOME = /usr/java/jdk1.6.0_31

Configure the environment variables for JDK , Hadoop as follows

Go to ~.profile file in the current user home directory
Add the following

You can change the variable paths if you have installed hadoop and java at some other locations

export JAVA_HOME="/usr/java/jdk1.6.0_31"
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_INSTALL="/home/hadoop/software/hadoop-1.0.1"
export PATH=$PATH:$HADOOP_INSTALL/bin

Testing your installation
Format the HDFS
# hadoop namenode -format

hadoop@jj-VirtualBox:~$ start-dfs.sh
starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out
localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out
localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out
hadoop@jj-VirtualBox:~$ start-mapred.sh
starting jobtracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-jj-VirtualBox.out
localhost: starting tasktracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-jj-VirtualBox.out

Open the browser and point to page

localhost:50030
localhost:50070

It would open the status page for hadoop

Thats it , this completes the installation of Hadoop , now you are ready to play with it.

9 comments:

  1. This guide made me realize i left out part of a config file, thanks.

    Is there a guide to just using hdfs as a distributed file system, like a replacement for NFS, AFS, or gluster?

    ReplyDelete
  2. Hello Genewitch

    Thank you for your comment.

    There is very interesting discussion on HDFS standalone.

    Just go through the mailing list

    http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201102.mbox/%3CAANLkTi=+Wic=e4uj3vpHrihctr7Uu84uh8YbS1fuXccw@mail.gmail.com%3E

    ReplyDelete
  3. Hi,

    I have two node cluster, rsi1 and rsi2 are hostnames of both machines.

    what should be value of fs.default.name and mapred.job.tracker on both federated namenodes.

    want to make both nodes as federated nodes.

    Appreciate your reply.

    Rashmi

    ReplyDelete
  4. How can we install Hive?Can you please guide.

    ReplyDelete
    Replies
    1. Please

      Download Hive tar ball from Apache website

      Extract it to some place

      In the conf file (hive-default.xml)specify the following parameter

      You can create ( or find) this xml in

      conf folder


      mapred.job.tracker
      JobTrakerIP:8021


      Also read one of the post on my blog to configure Hive MySQL metastore instead of derby default DB

      http://jugnu-life.blogspot.com.au/2012/05/hive-mysql-setup-configuration.html

      Thanks

      Delete
    2. so here we will add mapred.job.tracker and JobTrakerIP:8021 as property inside hive-default.xml.so my another property will like mapred.job.trackerJobTrakerIP:8021...then i will restart hive and hadoop....please reply sir,its urgent for me now

      Delete
  5. hive> show tables;
    FAILED: Error in metadata: MetaException(message:Got exception: java.net.ConnectException Call to localhost/127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused)
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

    why this error is for?Sir can u help me on this.

    ReplyDelete
  6. how to know which is my datanode and namenode?

    ReplyDelete
    Replies
    1. Do jps on nodes , you can see where namenode service is running.

      Delete

Please share your views and comments below.

Thank You.