How HDFS and YARN Store and Process Big Data in Hadoop Ecosystem


Hadoop | Sirchend Softwares

So far as working with Hadoop is concerned, a sound knowledge of the features and components is every part as important as the facts or information related to its core components as it is closely connected to the ecosystem of the publicly distributed framework.


Unlike a service or a programming language, the Hadoop Ecosystem is a framework or platform for the resolution of various issues related to big data.  As a suite, it encloses numerous services that include the storage, ingestion, maintenance and analysis of colossal amounts of data; these tasks would be virtually impossible to execute by means by using the traditional system for both structured and unstructured data.

While there are several components of the Hadoop Ecosystem, two components, namely HDFS and YARN, occupy a special place of significance. The reason for this is discussed below in the subsequent sections.

HDFS: HDFS is the acronym for Hadoop Distributed File System. As the central component of Hadoop Ecosystem, it makes the action of storing large amounts of data – structured as well as unstructured – both simple and effortless.

Interestingly, HDFS adds the element of abstraction to its resources which makes the entire HDFS look like a unit. Plus, it also contributes in two other ways: apart from housing the user data across different nodes, it also records the data of its maintenance in a log file. This is facilitated by two of its primary components: DataNode and NameNode.

Storing vast amounts of data and processing them simultaneously can be an uphill task in any framework. Both DataNode and NameNode work in collaboration with one another to help meet this challenge. While the latter serves as the primary source of metadata and the secondary data with higher computational resources, the former provides the space to save the actual big data with its colossal storage resources. DataNodes act as commodity hardware in a distributed environment, just like in desktops or laptops. Together, these components account for the success of Hadoop in providing a cost-effective solution to the accumulation or the retrieval of unstructured, semi-structured and structured data.

At the time of writing the data, a user usually communicates to the NameNode. Thereafter, it sends an internal request to the client to not only store but also replicate the user data on distinct DataNodes.

YARN: YARN stands for Yet Another Resource Negotiator. If the entire Hadoop Ecosystem is considered as a body, YARN is like its brain. Its primary tasks are to allocate resources and schedule tasks. Thus, it essentially undertakes the processing activities.

For managing different resources and nodes, it consists of two core components: ResourceManager and NodeManager.

ResoursceManager is the main processing component. After accepting the processing requests, it sends the parts of it to the relevant NodeManagers depending on the nature of the requests.  This way, it executes or performs all kinds of tasks related to processing. In order to simplify matters in this regard, ResourceManager is provided with two additional components namely, Schedulers and ApplicationsManager.

Every DataNode consists of NodeManagers for the execution of node related tasks on it.

Comments