Categories: AdviceHow-TosImplementation Guides

Getting Started with Big Data on Hadoop-Part I

– Narayana Murthy Pola, Sr. Project Manager, DST India

“Big Data” technologies are increasingly related to every walk of human life. No aspect of it is left untouched by these ever-evolving technologies. In this article series, let’s start with the various components of Big Data solution and get into the nuts and bolts of them without overwhelming the readers.

Big Data Lifecycle
A typical Big-Data solution can be represented with the lifecycle diagram shown in Figure 1.
Hadoop forms one of the key components of a Big-data solution. Many a times, Hadoop and Big-Data are used synonymously. In fact success and popularity of Hadoop framework brought in the growth of Big Data Technologies and myriads of applications using the same.

Doug Cutting is credited for creating Hadoop while solving the search engine problems at Yahoo. Hadoop ecosystem comprises of various components that help immensely in solving complex scenarios of Big Data.
Figure 2 provides a schematic representation of the Hadoop Ecosystem.

Hadoop is most commonly defined as a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from a single server to thousands of machines, each offering local computation and storage. Rather than relying on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Map Reduce: This is a programming paradigm created by Google for parallel processing of large datasets. The term Map Reduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reducer job takes the output from a map as input and combines that data tuples into a smaller set of tuples.

Hadoop Distributed File System (HDFS): This is the default storage layer of Hadoop ecosystem that provides scalable and reliable data storage. HDFS is highly fault tolerant and is designed to handle large data and be deployed in commodity hardware.

Yet Another Resource Navigator (YARN): YARN is the heart of Hadoop providing a centralized platform for resource management that assigns CPU, memory to applications running on Hadoop cluster. It also enables other application frameworks to run on Hadoop apart from Map Reduce.

Oozie: Apache Oozie is used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack, with YARN as its architectural center.

Ambari: A web-based tool for provisioning, managing, and monitoring Hadoop clusters and also integrating Hadoop with other existing infrastructure. Ambari also provides a dashboard for viewing cluster health and applications visually along with features to diagnose their performance characteristics.

ZooKeeper: Zookeeper provides operational services for a Hadoop cluster. ZooKeeper provides a distributed configuration service, a synchronization service and a naming registry for distributed systems.

Avro: Avro provides data serialization and data exchange services for Hadoop. This helps in (Big) data exchange between programs written in different languages. Using the serialization service, programs can efficiently serialize data into files or into messages. The data storage is compact and efficient. Avro stores both the data definition and the data together in one message or file, making it easy for programs to dynamically understand the information stored in an Avro file or message

Pig: Pig helps developers to focus on actual data processing using an SQL like Scripting language called Pig Latin. Pig Latin allows developers to write a data flow that describes how your data will be transformed.

Hive: Hive allows developers to explore and analyze data using Hive Query Language having SQL like commands. This is mostly used for ad hoc querying the data stored in Hadoop cluster.

HBase: HBase is a column-oriented database management system that runs on top of HDFS. HBase supports a structured data storage. It is scalable that can run into billions of rows.
Flume: Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data (continuous streams of data like tweets etc.) into the Hadoop Distributed File System (HDFS).

Spark: Spark (one of the latest tools in Hadoop ecosystem) is an in-memory processing engine for processing Hadoop data. It promises 100 times faster processing speed compared to technologies in the
market today.

Mahout: Mahout is a machine learning and data mining library implemented on top of Hadoop.

Sqoop: Sqoop supports transfer of data between Hadoop and Structured data stores.
With this brief introduction we shall explore these tools in more detail in coming months.

PCQ Bureau: