Advertisment

The Data Analysis Toolkit of Hadoop - Hive

Last month we had seen Pig–the data flow engine in Hadoop Data Tool kit. This month, let us have a closer look at Hive, one of the widely used query engines of Hadoop

author-image
PCQ Bureau
New Update
Untitled

- Narayana Murthy Pola, Sr. Project Manager, DST India

Advertisment

Hive is the open source data warehouse software that helps in analyzing large datasets in distributed storage. Hive is built on Hadoop and is closely integrated with HDFS (Distributed File storage) and other file systems that closely integrate with Hadoop HBase (Columnar storage engine of Hadoop Suite), Cassandra etc.

Hive was conceived and developed by Facebook engineers to ease out the learning curve for their analysts and bring in Hadoop eco-system to the reach of larger analyst community. Hive provides an SQL dialect called Hive Query Language (HQL) that is similar to SQL for querying data on Hadoop cluster. Hive internally submits a MapReduce job to execute the query. A typical Hello World MapReduce program of around 63 lines of Java code can be written in 8 lines in Hive.This is an example of the simplicity and ease of analysis which Hive brings to the table.

Narayana Murthy Pola Sr. Project Manager, DST-India Narayana Murthy Pola

Sr. Project Manager, DST-India

Advertisment

“Hive was conceived and developed by Facebook engineers to ease out the learning curve for their analysts and bring in Hadoop eco-system to the reach of larger analyst community. Hive provides an SQL dialect called Hive Query Language (HQL) that is similar to

SQL for querying data on Hadoop cluster.”

The Hive architecture

Hive sits on top of HDFS (Distribution engine) and a processing engine YARN or MapReduce above that. Hortonworks, a third party distributor of Hadoop, developed a framework Tez to write native YARN applications to optimize data processing applications. If Tez is included in the distribution it is embedded with Hive and can be represented in the scheme as in fig1. Hive Driver compiles optimizes and executes the Hive QL commands and queries generally through MapReduce jobs.

Advertisment

Hive Thrift server is one of the key component of Hive that allows programmatic access to Hive. It provides a software framework that allows applications written in various languages like Java.C, C++ to remotely access applications. Likewise, JDBC and ODBC drivers enable programmatic access to Hive. Hive includes a CLI Command Line Interface to use Hive interactively. CLI can be used to run scripts of HQL commands as well.

Untitled-33

Metastore is used by Hive to store metadata like table schemas. This metadata is specified during table creation and used whenever time the table is referenced. Metastore has the following components Database, Table and Partition. Database is essentially a catalog or namespace of tables. Table contains list of information such as columns, types, owner storage etc. Partition can have its own columns; storage information that can be used to support schema evolution in future.It is implemented by using a relational database. By default Hive uses a built-in Derby SQL server which provides single process storage. Any JDBC compliant DBMS like MySQL can be used as Metastore.

Advertisment

A simple Hive Web interface (HWI) provides remote web interface To Hive. Apart from HWI.

Many Open source Graphical User Interfaces like Hue and other proprietary tools that are available in the market to provide easy access and abstract layer to use Hive.

Hive organizes data in databases, tables and partitions as explained in the meta store component above. Hive stores the partitions tables and databases as directories in HDFS. Within each partition the data can be bucketed into files. Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT and complex types like maps, lists and structures as well. Hive is extensible and supports User defined functions or writing custom built Map Reduce routines.

Areas of application

Hive in its initial design is most suitable for is most suited for data warehouse applications with relatively static data is analyzed and fast response times are not required. However, Tez and Stinger initiatives spearheaded by Horton Works (Third party provider of Hadoop) have improved Hive’s performance and are now actively used in interactive querying over large datasets. However, Hive is still not suitable for use-cases that require sub-second responses. Projects to improve Hive and also integrate with in-memory processing paradigms over Hadoop are very active which promise Hive to be de-facto analytics engine on Hadoop.

data hive
Advertisment