Tech & Trends Developers

Decoding Pig: The Data Toolkit of Hadoop

Writing Map Reduce programs to Order and Join large datasets for analysis is a cumbersome process. Pig is a data analysis tool built on top of Hadoop to ease the pain and increase its popularity

PCQ Bureau

18 Dec 2015 11:43 IST

New Update

- Narayana Murthy Pola, Sr. Project Manager, DST India

Advertisment

We have seen over the last few months how Hadoop ecosystem with its Map-Reduce programming model has grown into the most cost-effective distributive computing model processing large datasets. However, most of the analytic community is well-versed with SQL the de-facto language for data analysis. Writing Map-Reduce programs invites a steep learning curve if not, over-dependence on Map-reduce programmers. Also, Writing MR programs to Order and Join large datasets for analysis is a cumbersome process. This necessitated development of data analysis tools on top of Hadoop to increase its adaption.

Narayana Murthy Pola

Sr. Project Manager, DST-India

Engineers at Yahoo and Facebook pioneered in developing such data analysis tools Pig and Hive to offset this problem and bring in Hadoop eco-system to the reach of larger analyst community.

Pig: Pig is a data flow language runs on top of Hadoop. It has two components Pig Latin, a data flow language and a run time environment to execute the data flow instructions. Under the hood, Pig compiles the Pig Latin script into one or more MapReduce jobs to execute them.

Pig Latin provides standard data-processing operations, such as join, filter, group by, order by, union, etc.Pig can analyze the Pig Latin scripts and understands the dataflow requested by the user beforehand thus helping in early error detection and subsequent optimization.

Philosophy behind Pig

Pig is developed with the following theme in the background.

• Pigs eat anything: Pig can operate on any data structured,unstructured,relational,Key/value stores etc.

• Pigs live anywhere: Pig is not tied to Hadoop framework alone.Pig is supposed to be implementable on any Parallel framework

• Pigs are domestic animals: Pig can be easily controlled by users.and supports user defined functions that can be written in Java or any scripting language that can be compiled down to Java.

• Pigs can fly: Pig processes data quickly.

Hadoop ecosystem developers share a unique sense of humor in naming the various tools they develop that convey no meaning of the function, however adds some fun element along with a theme which is evident above. The theme sets the tone of what can be expected from future versions of Pig.

Areas of application

Pig being a data flow language allows users to describe how one or more inputs are read, joined, processed and stored thus finding application in 3 key areas.

• Traditional ETL data pipelines: Pig being data flow language finds right application in traditional ETL,For example extracting data from different sources including Streams like logs from Webservers etc , cleansing and enriching the data and loading the data in the required format to HDFS or any underlying Data store like HDFS.

• Research on Raw Data: Analysis on raw data used to be carried out by adhoc SQL queries or specialized tools which may submit SQLs internally. Schema is not generally known in raw data, Pig can rightly fit in such scenarios as it can operate when schema is not known, incomplete or inconsistent. Pig’s data flow paradigm is preferred by analysts rather than the declarative paradigms of SQL.An example of such a use case is an internet search engine (like Yahoo, etc) engineers who wish to analyze the petabytes of data where the data doesn’t conform to any schema.

• Iterative Processing: This is one use case where Pig finds its application.In this case Generallythere is one large dataset that is maintained. Processing involves continuous addition of incremental data loads that changes the state of the dataset..A classic example of such a scenario is an online news portal that keeps all the news stories as a huge graph with each news story as a node. There will be links between the stories as they may refer same events. Whenever a news story is added which happens periodically (for example every few minutes), a new item is needed to be added to the graph,find related stories and create links.There is the constant inflow of changes and the need for incremental processing. Pig Latin can do all the standard database operations in an incremental way making it the right area for Pig. Likewise, Popular social networking site like LinkedIn is a classic use case for this scenario.

How Pig Works

Pig runs on Hadoop. Pig is written in Java and is portable across operating systems. It need not be installed on Hadoop cluster.It runs on the machine from which the user launches Hadoop job, i.e. it can be run from one’s own laptop. However as a best practice cluster owners can set up one or more machines which are not part of the cluster to install Pig and submit jobs. This helps in securing the cluster and ease of updating Pig and other tools to their next versions.

Pig can run in two modes Local mode and on Hadoop cluster.The only thing Pig needs to know to run on Hadoo cluster is the location of cluster’sNameNode and JobTracker. Local mode is useful for prototyping and debugging Pig Latin scripts.

Pig provides a shell for interactive sessions

with users and is called Grunt.It enables users to

enter Pig Latin interactively and interact with HDFS. Grunt provides command-line history and editing, as well as Tab completion. One of the main uses of Grunt is to quick sampling and prototyping new Pig Latin scripts. Grunt can also be used to control Pig

and MapReduce jobs.

Pig submits MapReduce jobs in the background. Thus, Latency should be expected and is not suitable for Online or Analytical type of processing. It is best suited for dataflow operations as discussed above. However, Pig is Pig is extensible – users can create custom functions in Python or in Java. Likewise, Pig is very much integrated with Hadoop ecosystem and can share metadata information between Hive,MapReduce and other tools integrated with Hadoop.Pig is patronized by commercial distribution vendors like Hortonworks and projects to reduce the latency are underway. Pig also enjoys many committers. It is not a surprise if Pig can become the defacto ETL standard in Hadoop ecosystem.

hadoop

Advertisment