Hadoop Installation on RHEL6/CentOS6

we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes !
— Elia Mazzaw

Hadoop Installtion on CentOS6.3

Prerequisite:

Red Hat Enterprise Linux 6 / CentOS6 (will worked on RHEL5 /CentOS5 too)

Download Hadoop1.0.4 tar ball  specific to your architecture from here.

For Other linux distributions Hadoop is available too, Choose as per your  requirement, for RedHat, centos and Novell Suse  you will need RPM. Ubuntu people deb is also here. But we will look into tar ball installation so that every Linux distribution is treated equally.

Here I am using CentOS 6.3 Linux.

Other then Hadoop tar ball (hadoop-1.0.4.tar.gz) we require :

Java 1.6 (at least), better to have 1.7.

I am using : jre1.6

[root@hadoopmaster html]# java -version

java version “1.6.0_37”

Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) Client VM (build 20.12-b01, mixed mode, sharing)

Here we are discussing this installation for 2 node hadoop cluster, So arrange two CentOS 6.3 Linux box (VM’s are good)

And setup everything on these two machines like hostname for server1 ie Master: hadoopmaster and for server2 ie slave: hadoopslave.

Also Update /etc/hosts file on both nodes, it will look something like this:

[hadoop@hadoopmaster bin]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.56.103 hadoopmaster
192.168.56.104 hadoopslave

Other than this make entry for your hostname under: /etc/sysconfig/network file

[hadoop@hadoopmaster bin]$ cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoopmaster

And HOSTNAME=hadoopslave on another machine.

This will setup HOSTNAME.

Adding a dedicated Hadoop system user

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

#adduser hadoop

#passwd hadoop  < set your password here

After Creating hadoop user go to

#cd /hadoop/hadoop1.0.4/bin

#su hadoop

[hadoop@hadoopmaster bin]$

and add ssh keys for both nodes through hadoop user.

commands would be

#ssh_keygen

and

#ssh_copyid

and cross check the same by sshing in to both nodes. It should logged in without asking for any password.

Now create a directory:

#mkdir /hadoop

Put your downloaded hadoop tar ball here

Extract it here by

#tar -xvf  hadoop1.0.4.tar.gz (Yes I know you are smart enough to point out mistakes here bt here its working perfectly🙂

Do all the same stuff on another hadoop machine too, which is going to be our hadoop salve.

Extract hadoop tar ball and go to :

#cd /hadoop/hadoop-1.0.4/conf/

Here we need to change following files on master node:

core-site.xml
hdfs-site.xml
mapred-site.xml
masters
slaves
hadoop-env.sh

And on slave node we only need to play with masters and slaves file.

So, Lets start with core-site.xml file here the BOLD and ITALIC sections need to be added.

[hadoop@hadoopmaster conf]$ cat core-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>
             <property>
                      <name>fs.default.name</name>
                      <value>hdfs://hadoopmaster:54310</value>
            </property>
</configuration>

Then Go to hdfs-site.xml file

[hadoop@hadoopmaster conf]$ cat hdfs-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>
               <property>
                        <name>dfs.replication</name>
                       <value>2</value>
            </property>
            <property>
                      <name>dfs.permission</name>
                      <value>false</value>
          </property>
</configuration>

3. Nest is mapred-site.xml file

[hadoop@hadoopmaster conf]$ cat mapred-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>
         <property>
                 <name>mapred.job.tracker</name>
                <value>hadoopmaster</value>
       </property>
</configuration>

4: And the Important one

[hadoop@hadoopmaster conf]$ cat hadoop-env.sh
# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use. Required.
export JAVA_HOME=/usr          #<<<< Only this section needs to be edit..JAVA_HOME path

# Extra Java CLASSPATH elements. Optional.
<<snip >>

5: Go to Masters file…replace localhost entry with your Master’s hostname.

[hadoop@hadoopmaster conf]$ cat masters
hadoopmaster

6. and finally for slaves file

[hadoop@hadoopmaster conf]$ cat slaves
hadoopmaster
hadoopsalve

Add both nodes entry here.

On Slave node:

Update only slaves file .

So the configuration part is done. Now need to run some command to make Hadoop Up and Running:)

start with following:

[hadoop@hadoopmaster bin]$ ./hadoop version
Hadoop 1.0.4
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290
Compiled by hortonfo on Wed Oct 3 05:13:58 UTC 2012
From source with checksum xxxxxxxxxxxxxxxxxxxxxxxx

It might show some java error, so Please check JAVA_HOME defined under conf/hadoop_env.sh.

After that

[hadoop@hadoopmaster bin]$ ./hadoop namenode -format
12/12/13 07:47:37 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoopmaster/192.168.56.103
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.4
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by ‘hortonfo’ on Wed Oct 3 05:13:58 UTC 2012
************************************************************/
12/12/13 07:47:37 INFO util.GSet: VM type = 32-bit
12/12/13 07:47:37 INFO util.GSet: 2% max memory = 19.33375 MB
12/12/13 07:47:37 INFO util.GSet: capacity = 2^22 = 4194304 entries
12/12/13 07:47:37 INFO util.GSet: recommended=4194304, actual=4194304
12/12/13 07:47:37 INFO namenode.FSNamesystem: fsOwner=hadoop
12/12/13 07:47:37 INFO namenode.FSNamesystem: supergroup=supergroup
12/12/13 07:47:37 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/12/13 07:47:37 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/12/13 07:47:37 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/12/13 07:47:37 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/12/13 07:47:37 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/12/13 07:47:37 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
12/12/13 07:47:37 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoopmaster/192.168.56.103
************************************************************/

And Then finally:

[hadoop@hadoopmaster bin]$ ./start-dfs.sh
starting namenode, logging to /hadoop/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-namenode-hadoopmaster.out
hadoopsalve: ssh: Could not resolve hostname hadoopsalve: Temporary failure in name resolution
hadoopmaster: starting datanode, logging to /hadoop/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-datanode-hadoopmaster.out
hadoopmaster: starting secondarynamenode, logging to /hadoop/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out
[hadoop@hadoopmaster bin]$ ./start-mapred.sh
starting jobtracker, logging to /hadoop/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-jobtracker-hadoopmaster.out
hadoopsalve: ssh: Could not resolve hostname hadoopsalve: Temporary failure in name resolution
hadoopmaster: starting tasktracker, logging to /hadoop/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-tasktracker-hadoopmaster.out

If every thing is fine then go to your browser and check for the running hadoop status over there:

Further, You will have a WebGUI to see your hard work of last 30 mins

Url would be : http://<masternodename/IP address>:50070

May be I have forgot few things, Please mention those on comments below… Hope this is useful🙂

!! Enjoy your day !!

Hadoop: What is Hadoop exactly…I’ve never heard about this….

Image

So the very basic question raised in our mind that what exactly this thing is…are we really talking about any technology or I have reached somewhere else. Well no worries you are in to right place and further this blog will explain various things about hadoop, like what Hadoop exactly is, how it originated, how it is helping today’s major internet giants to successfully accomplish their task and computing needs and many more things like how it works and about its enormous capabilities.

Where did Hadoop come from?

So, the underlying technology of hadoop was basically invented by Google for their bulk data processing, crawling web . And this underlying technologies were MapReduce and GFS (Google file system). There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications. Hadoop is an open source project developed by Doug Cutting, an engineer at Yahoo.

What problems can Hadoop solve?

The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.

Hadoop applies to a bunch of markets.Like in finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.

How is Hadoop architected?

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, push those into a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set. I will put a further article on Map and Reduce that how it actually works.

Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together, to complete your BIG job and process that large amount of data.

More

Yes that’s only a beginning, Further to work on Hadoop System you need to be aware of following terms

1.HDFS

2.Map Reduce

3.Hive

4.HBASE

Basic working of Hadoop:

In a traditional non distributed architecture, you’ll have data stored in one server and any client program will access this central data server to retrieve the data. The non distributed model has few fundamental issues. In this model, you’ll mostly scale vertically by adding more CPU, adding more storage, etc. This architecture is also not reliable, as if the main server fails, you have to go back to the backup to restore the data. From performance point of view, this architecture will not provide the results faster when you are running a query against a huge data set.

In a hadoop distributed architecture, both data and processing are distributed across multiple servers. The following are some of the key points to remember about the hadoop:

  • Each and every server offers local computation and storage. i.e When you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set. Finally, the resultset from all this local servers are consolidated.
  • In simple terms, instead of running a query on a single server, the query is split across multiple servers, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.
  • You don’t need a powerful server. Just use several less expensive commodity servers as hadoop individual nodes.
  • High fault-tolerance. If any of the nodes fails in the hadoop environment, it will still return the dataset properly, as hadoop takes care of replicating and distributing the data efficiently across the multiple nodes.
  • A simple hadoop implementation can use just two servers. But you can scale up to several thousands of servers without any additional effort.
  • Hadoop is written in Java. So, it can run on any platform.
  • Please keep in mind that hadoop is not a replacement for your RDBMS. You’ll typically use hadoop for unstructured data
  • Originally Google started using the distributed computing model based on GFS (Google Filesystem) and MapReduce. Later Nutch (open source web search software) was rewritten using MapReduce. Hadoop was branced out of Nutch as a separate project. Now Hadoop is a top-level Apache project that has gained tremendous momentum and popularity in recent years.

HDFS

HDFS stands for Hadoop Distributed File System, which is the storage system used by Hadoop. The following is a high-level architecture that explains how HDFS works.

The following are some of the key points to remember about the HDFS:

  • In the above diagram, there is one NameNode, and multiple DataNodes (servers). b1, b2, indicates data blocks.
  • When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in the hadoop cluster. HDFS creates several replication of the data blocks and distributes them accordingly in the cluster in way that will be reliable and can be retrieved faster. A typical HDFS block size is 128MB. Each and every data block is replicated to multiple nodes across the cluster.
  • Hadoop will internally make sure that any node failure will never results in a data loss.
  • There will be one NameNode that manages the file system metadata
  • There will be multiple DataNodes (These are the real cheap commodity servers) that will store the data blocks
  • When you execute a query from a client, it will reach out to the NameNode to get the file metadata information, and then it will reach out to the DataNodes to get the real data blocks
  • Hadoop provides a command line interface for administrators to work on HDFS
  • The NameNode comes with an in-built web server from where you can browse the HDFS filesystem and view some basic cluster statistics

MapReduce


The following are few of the key points to remember about the HDFS:

  • Map Reduce is a parallel programming model that is used to retrieve the data from the Hadoop cluster
  • In this model, the library handles lot of messy details that programmers doesn’t need to worry about. For example, the library takes care of parallelization, fault tolerance, data distribution, load balancing, etc.
  • This splits the tasks and executes on the various nodes parallely, thus speeding up the computation and retriving required data from a huge dataset in a fast manner.
  • This provides a clear abstraction for programmers. They have to just implement (or use) two functions: map and reduce
  • The data are fed into the map function as key value pairs to produce intermediate key/value pairs
  • Once the mapping is done, all the intermediate results from various nodes are reduced to create the final output
  • JobTracker keeps track of all the MapReduces jobs that are running on various nodes. This schedules the jobs, keeps track of all the map and reduce jobs running across the nodes. If any one of those jobs fails, it reallocates the job to another node, etc. In simple terms, JobTracker is responsible for making sure that the query on a huge dataset runs successfully and the data is returned to the client in a reliable manner.
  • TaskTracker performs the map and reduce tasks that are assigned by the JobTracker. TaskTracker also constantly sends a hearbeat message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not.

This was only first look into Hadoop, Many more terms to come…stay tune for further Hadoop insights.

Take a look into Apache Hadoop Project home page.