Author: Hrishikesh Dewan. He works for Siemens and is a PhD student at the Dept of CSA, IISc, Bangalore.
In the last two articles we focused on the organization of data in the form of B+Trees in a CouchDB installation. In this issue, we will see how to query the data. Most cloud storage solutions provides a new distributed and parallel data processing programming framework popularized by Google known as MapReduce. Map and Reduce are two independent functions and work in stages. Map functions are executed first and then the Reduce function works on the output of the map. The concept of MapReduce however is not new and there are several different variations of MapReduce that existed before. In fact there is a large debate in the DBMS world in this aspect after Google was awarded the patent by USPTO (Unites States Patent and Trademark Office) for this framework. Originally the map and reduce functions are designed in a language called Lisp (List Processing System), a language used mostly in the design of Artificial Intelligence application). However, the functions defined in Lisps and in MapReduce are different.
|
Introducing MapReduce
The essence of the MapReduce is that it is applicable in solving a large number of problem statements effectively. The original paper on MapReduce cited examples of distributed grep, url link count, distributed sort and many others. And until recently, there are a number of applications where Map-reduce is being used in varied environments such as in multi-core processors, GPU's etc to solve problems in statistical, analytical, machine learning and social web applications.
The Map function
In MapReduce there are essentially two important functions-Map and Reduce Apart from the core Map and Reduce functions, there are other functions that are involved in creating a perfect data analysis output. The entire sequence of MapReduce executes in stages and the input of each stage is the output of the next stage. The input to the map function is generally an input practitioner which divides the input to chunks and each chunk is provided to a Map function. The map function is a single routine and each such routine is independent of other such map functions. In a cluster of PC's where MapReduce is implemented, a scheduler is used to run Map functions in several different nodes. Thus each Map function is provided a unique chunk of data to deal with and process. The input to the map function is a key-value pair (k1,v1) and the output is also a key-value pair (m1,v1) or a collection of key-value pairs (list
The Reduce function
The third stage in a map reduce function could be a sort or a combine function. A sort function sorts the output of the key-value pairs so generated. An intermediate partitioning function is also applied to aggregate keys of same type so that individual Reduce functions can act on it. Thus, based on the input and the intermediate partitioning function, there could be large number of reduce functions that run in parallel. The input to the reduce function is what the Map produced and the output of the Reduce could be single value or a list of values depending on the problem statement we are trying to solve. We can use the same example cited in the Map description to illustrate the Reduce function. In the example, we considered calculating the occurrence of words in each line of a text document. If we elevate this problem statement and now say that we are interested in knowing the occurrence of words in the entire text document, we need a reduce function. Each map runs in parallel and produces arbitrary output of list of key-value pairs. Therefore, we first need to sort them and put them in order. Based on the sorted output, the intermediate portioning function now will provide to each reduce function a group of key-value pairs which are same. For example, if the map functions produced
Looking back
The above description seems trivial to understand and implement. But, there are some larger issues that need to be thought of before we seriously implement such a framework. In the design of distributed systems, dependability or the ability to witshand failures is an important criterion. We need to design schedulers that keeps tracks of jobs, have the ability to identify failed jobs and also re-start them in an another node when it fails. The original paper on MapReduce have solutions on these problems and in an open source implementation called Hadoop, a lot of these are put in place. Still, such issues in several context for example peer-peer is still an open issue for researchers to solve. Consequently, there is a large body of research that is going on in this respect. Note that in many use cases, as in our example of calculating the occurrence of words, a successful Map-Reduce function requires the entire set of map and reduce functions to be completed; partial output will bear no considerable meaning to the problem we are trying to solve.
Further, akin to this framework is the storage and the distribution of data that map, reduce and other intermediate functions acts on. In this example, our input to the map and reduce is very small, but there could be applications where in the input size is too large. For example in large processing applications, particularly in healthcare, a size of an image could as large as 1 GB. In problem statement like these, it is a foolishness to move data to the process which is executing a map function. But rather, the map function should move closer to the data. This would certainly speed up the processing of the map-reduce functions and is a good case of utilizing a distributed system effectively. Jim Gray, pioneering computer scientists described this paradigm and others in a beautiful paper called “Distributed Computing Economics”. In most map-reduce applications, large scale distributed database or file systems are used to store this data and the scheduler is designed to handle optimize such scenarios.
MapReduce and CouchDB
Though map-reduce is not that new, but none the less, it is an interesting and important framework to solve several problems that exists today. This century is all about data and it's processing to determine the semantic meaning of the data. It is therefore very essential that we have frameworks of such kind that can handle problems of these sorts. In most Cloud Storage platforms where large chunks of data are stored in multiple parallel data nodes, Map-Reduce implementations are used to decipher the meaning of data. And CouchDB is no exception. In CouchDB, queries are nothing but Map or Map-Reduce functions. Each query needs to be defined first and stored in a document called 'design document'. In CouchDB, there are two different types of Views- Temporary view and Permanent View. Temporary views act for a single time when it written and permanent views are disk resident implying that are saved to the disk and can be queried any time. These views are actually indexes and stored as a B+Tree with REST (Representation State Transfer) for access. In the next article, we will see how these views are created and how they can be created using a simplified data model.