Apache Hadoop is an open source distributed data processing framework. The core of Hadoop consists of the HDFS (Hadoop Distributed File System), MapReduce and Spark. Hadoop also includes related modules such as Zookeeper, Pig and Flume. These submodules extend and customize the functionality of the core Hadoop. Other core Hadoop modules include HDFS, which is a Java-based file system for storing and managing large data sets across cluster nodes. The YARN framework is used for cluster resource management, scheduling jobs, and planning tasks.
Apache Hadoop is a framework that facilitates massively distributed parallel processing of large data sets. Its modules can be distributed across hundreds or thousands of commodity servers. Hadoop was born out of the need for companies and industries to process data at massive scales and deliver web results faster. The Hadoop community has developed several different technologies that extend the Hadoop library. Here is a brief overview of some of these tools:
Hadoop includes a scalability component known as MapReduce. It uses a specialized programming model to enable massively parallel processing of unstructured data. The framework features two distinct phases, the mapper phase, and the shuffle phase. Each phase maps the data, and the sort phase produces the final output. If Hadoop runs out of resources during the processing, it isolates the node and reassigns all tasks to another node.
Hadoop also includes the application and storage components. Nodes run services that are hosted in HDFS, which is a distributed file system. Data is stored on data nodes in HDFS and is replicated across multiple nodes. This ensures that data is balanced across clusters. In the event of a failure, the ApplicationMaster restarts failed tasks and the ResourceManager attempts to restart the entire application. After a failed task completes, the node is removed from the list of active nodes.
Another component of Hadoop is Apache Hive, an application for transforming large data sets. Unlike Hive, Pig has its own SQL-like language called Pig Latin, which helps developers write complex MapReduce jobs without coding in Java. Similarly, Flume is an application for processing large log data and is a Java-based distributed service. It typically delivers files into HDFS. But it can’t handle data sets that are structured and use a structured data structure.
Big data applications tend to write data once and read it many times. Using Hadoop, the assumption that a file is created once and never changed simplifies the coherency model and enables high throughput. The DataNode, meanwhile, maintains metadata on the file system and regulates client access. Its logical model is based on a set of nodes called nodes. One name node controls the namespace, while the others perform file system executions.
Hadoop is a popular open source big data framework that makes use of commodity resources and provides high availability, built-in point-of-failure detection, and quick response times. While the framework is based on Java, there are native C and Python code modules and shell scripts for command line management. Other applications built on Hadoop include HBase and HCatalog. This tool allows users to manage a Hadoop cluster using a dashboard.