HADOOP: Contribute a limited/specific amount of storage as a slave to the cluster

In this task(4.1) we will be discussing how Hadoop contributes its limited or specific amount of storage as a slave to the cluster.

Lets first dig slightly into the main concepts of Hadoop known so far:

Image for post
Image for post

Hadoop

Hadoop is an open-source software framework used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license.

Hadoop was developed, based on the paper written by Google on the MapReduce system and it applies concepts of functional programming. Hadoop is written in the Java programming language and ranks among the highest-level Apache projects. Hadoop was developed by Doug Cutting and Michael J. Cafarella.

Features of Hadoop

Image for post
Image for post

Reliability

When machines are working as a single unit, if one of the machines fails, another machine will take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.

Economical

Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop cluster, all your DataNodes can have normal configurations like 8–16 GB RAM with 5–10 TB hard disk and Xeon processors.

But if I would have used hardware-based RAID with Oracle for the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project is minimized. It is easier to maintain a Hadoop environment and is economical as well. Also, Hadoop is open-source software and hence there is no licensing cost.

Scalability

Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because you can go ahead and procure more hardware and expand your setup within minutes whenever required.

Flexibility

Hadoop is very flexible in terms of the ability to deal with all kinds of data. We discussed “Variety” in our previous blog on Big Data Tutorial, where data can be of any kind, and Hadoop can store and process them all, whether it is structured, semi-structured, or unstructured data.

These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges. Now that we know what is Hadoop, we can explore the core components of Hadoop. Let us understand, what are the core components of Hadoop.

Image for post
Image for post

NameNode

  • It is the master daemon that maintains and manages the DataNodes (slave nodes)
  • It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
  • It records every change that takes place to the file system metadata
  • If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
  • It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are alive
  • It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
  • It has high availability and federation features which I will discuss in HDFS architecture in detail

DataNode

  • It is the slave daemon which runs on each slave machine
  • The actual data is stored on DataNodes
  • It is responsible for serving read and write requests from the clients
  • It is also responsible for creating blocks, deleting blocks, and replicating the same based on the decisions taken by the NameNode
  • It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds
Image for post
Image for post

In the starting when two name nodes have been connected to the Datanode whatever volume we have given to the root volume will be shown and now let's see how the Datanode can give some specific amount of storage to the Namenode.

Image for post
Image for post

At first, we need to create a volume let’s say 1Gb volume in the same instance as our cluster is. Then attach the newly created volume to one of the Datanode or the slave node.

Image for post
Image for post

As visible our created 1GB volume has been attached to this slave. The next step is to create a partition of the volume recently attached.

Image for post
Image for post

Here ’n’ represents the creation of a new partition.

‘p’ is the printing of the partition table

‘w’ is writing table to the disk and exit

Image for post
Image for post

As visible, the formatting is done successfully now the next step to be done is creating a directory and then we have to replace the existing directory in the HDFS folder using the newly created one.

Now after stopping the Datanode restart the same once more and after entering Hadoop dfsadmin -report we will be able to identify the 1GB storage attached to the Namenode.

Image for post
Image for post

Thank you for your patient reading!! Hope I was able to deliver to the best of my knowledge.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store