Hadoop Cluster configuration using Ansible
In this blog, we would configure a fully functional Hadoop cluster and start cluster services using Ansible.
Ansible is the simplest way to automate apps and IT infrastructure by providing a variety of features like Application Deployment, Configuration Management, Continuous Delivery and much more. You can learn more about ansible and its architecture in my previous blog mentioned below.
How Microsoft is using Ansible.
Automation has made setting up of infrastructure available on one’s fingertips. Before the age of automation and…
What is Hadoop?
Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications in scalable clusters of computer servers. It is a great framework which is used to solve the problem of big data. It works on a master-slave architecture where master node controls the flow of data to the cluster. The slave nodes are mainly responsible for providing storage resources to the cluster.
In this task, we need to set up the master node, slave node and client of the Hadoop cluster. In order to do so, we need to set up the control node for Ansible.
Steps to configure Ansible in Control node:
Step 1: The following commands would install and check the version of Ansible in your system. You would also need to install sshpass as it is a dependency of Linux.
#to install Ansible
pip3 install ansible#to install sshpass
yum install sshpass#to check the version of Ansible installed in the system
Step 2: Create an inventory file (text file) and add the details of your managed nodes in the format given below:
<ip address> ansible_user=root ansible_ssh_pass=<password> ansible_connection=ssh
Now go to ansible configuration file and provide the location of your inventory file. Note: Create a folder at “/etc/ansible/” if not already created.
#to open ansible configuration file
Step 3: Now that you’ve configured Ansible, check whether your managed nodes are connected and active.
#to check the list of managed nodes
ansible all --list-hosts#to check the connectivity
ansible all -m ping
Now that we’ve configured Ansible Controller node, lets create Ansible playbook for Hadoop cluster configuration.
First of all, we would create a playbook for name node configuration:
#create a playbook for name node configuration
Now, we would create a playbook for data nodes configuration:
#create a playbook for datanodes configuration
Finally, we would create a playbook for client node:
#create a playbook for client node
Now, after successfully creating the respective playbooks, we just need to run them sequentially. In order to run the playbooks, follow the commands mentioned below:
Running name node playbook:
We can also check whether our name node service has started or not:
Here we can see that name node is running successfully.
Running data node playbook:
We can also check whether our data node is running successfully or not by running the following command in data node:
Since the jps command displays this system as DataNode, hence the data node is running successfully and is connected to the cluster.
Finally, we can run the client playbook:
Since the namenode, datanode and client has been successfully configured, we can use the client to perform various actions in the newly created Hadoop storage cluster.
GitHub repository link:
This project helps to create a Hadoop storage cluster consisting of master, slave and client nodes using Ansible…