Hadoop Cluster configuration using Ansible

Akshansh Singh
4 min readJan 6, 2021

In this blog, we would configure a fully functional Hadoop cluster and start cluster services using Ansible.

Hadoop + Ansible

Ansible is the simplest way to automate apps and IT infrastructure by providing a variety of features like Application Deployment, Configuration Management, Continuous Delivery and much more. You can learn more about ansible and its architecture in my previous blog mentioned below.

What is Hadoop?

Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications in scalable clusters of computer servers. It is a great framework which is used to solve the problem of big data. It works on a master-slave architecture where master node controls the flow of data to the cluster. The slave nodes are mainly responsible for providing storage resources to the cluster.

In this task, we need to set up the master node, slave node and client of the Hadoop cluster. In order to do so, we need to set up the control node for Ansible.

Steps to configure Ansible in Control node:

Step 1: The following commands would install and check the version of Ansible in your system. You would also need to install sshpass as it is a dependency of Linux.

#to install Ansible
pip3 install ansible
#to install sshpass
yum install sshpass
#to check the version of Ansible installed in the system
ansible --version

Step 2: Create an inventory file (text file) and add the details of your managed nodes in the format given below:

<ip address> ansible_user=root ansible_ssh_pass=<password> ansible_connection=ssh

Now go to ansible configuration file and provide the location of your inventory file. Note: Create a folder at “/etc/ansible/” if not already created.

#to open ansible configuration file
vim /etc/ansible/ansible.conf
Configuration file

Step 3: Now that you’ve configured Ansible, check whether your managed nodes are connected and active.

#to check the list of managed nodes
ansible all --list-hosts
#to check the connectivity
ansible all -m ping

Now that we’ve configured Ansible Controller node, lets create Ansible playbook for Hadoop cluster configuration.

Playbook Configuration:

First of all, we would create a playbook for name node configuration:

#create a playbook for name node configuration
vim namenode.yml
playbook for name node configuration

Now, we would create a playbook for data nodes configuration:

#create a playbook for datanodes configuration
vim datanode.yml
playbook for data node configuration

Finally, we would create a playbook for client node:

#create a playbook for client node
vim client.yml
playbook for client node configuration

Now, after successfully creating the respective playbooks, we just need to run them sequentially. In order to run the playbooks, follow the commands mentioned below:

Running name node playbook:

ansible-playbook namenode.yml

We can also check whether our name node service has started or not:

name node is running successfully

Here we can see that name node is running successfully.

Running data node playbook:

ansible datanode.yml

We can also check whether our data node is running successfully or not by running the following command in data node:

jps

Since the jps command displays this system as DataNode, hence the data node is running successfully and is connected to the cluster.

Finally, we can run the client playbook:

ansible-playbook client.yml

Since the namenode, datanode and client has been successfully configured, we can use the client to perform various actions in the newly created Hadoop storage cluster.

GitHub repository link:

Thank you!!

--

--

Akshansh Singh

Final Year Undergrad from Indian Institute of Information Technology Ranchi interested in learning the ins and outs of Technology