Understanding MapReduce in CouchDB

PCQ Bureau
New Update

Author: Hrishikesh Dewan. He works for Siemens and is a PhD student at the Dept of CSA, IISc, Bangalore.


In the last two articles we focused on the organization of data in the form of B+Trees in a CouchDB installation. In this issue, we will see how to query the data. Most cloud storage solutions provides a new distributed and parallel data processing programming framework popularized by Google known as MapReduce. Map and Reduce are two independent functions and work in stages. Map functions are executed first and then the Reduce function works on the output of the map. The concept of MapReduce however is not new and there are several different variations of MapReduce that existed before. In fact there is a large debate in the DBMS world in this aspect after Google was awarded the patent by USPTO (Unites States Patent and Trademark Office) for this framework. Originally the map and reduce functions are designed in a language called Lisp (List Processing System), a language used mostly in the design of Artificial Intelligence application). However, the functions defined in Lisps and in MapReduce are different.

Applies to:

Cloud developers

USP:Learn the essentials of CouchDB for building scalable apps

Related Articles: CouchDB Part 1:; CouchDB Part 2:

Search Engine Keywords: couchdb


Introducing MapReduce

The essence of the MapReduce is that it is applicable in solving a large number of problem statements effectively. The original paper on MapReduce cited examples of distributed grep, url link count, distributed sort and many others. And until recently, there are a number of applications where Map-reduce is being used in varied environments such as in multi-core processors, GPU's etc to solve problems in statistical, analytical, machine learning and social web applications.

The Map function


In MapReduce there are essentially two important functions-Map and Reduce Apart from the core Map and Reduce functions, there are other functions that are involved in creating a perfect data analysis output. The entire sequence of MapReduce executes in stages and the input of each stage is the output of the next stage. The input to the map function is generally an input practitioner which divides the input to chunks and each chunk is provided to a Map function. The map function is a single routine and each such routine is independent of other such map functions. In a cluster of PC's where MapReduce is implemented, a scheduler is used to run Map functions in several different nodes. Thus each Map function is provided a unique chunk of data to deal with and process. The input to the map function is a key-value pair (k1,v1) and the output is also a key-value pair (m1,v1) or a collection of key-value pairs (list). As per the MapReduce specification, there needn't be a relation between the keys and values passed to the map function and its output. A simple example to illustrate a particular use case of map function along with the input partiotioner is the word count example. Let's say we have one thousand lines of text and you want to count the occurrence of each word in each line. In this case, a map function will take each line of text of the document and find the number of occurrence of each word in it. Clearly each line of the document in this problem statement is disjoint and has no relation to the other lines in the document. And the problem statement can also be easily be made to run in parallel. Also note the input and the output key-value pairs. The input key-value could be where as output is a list of . If we are to draw this step, Figure 1 would be the right illustration.

The Reduce function


The third stage in a map reduce function could be a sort or a combine function. A sort function sorts the output of the key-value pairs so generated. An intermediate partitioning function is also applied to aggregate keys of same type so that individual Reduce functions can act on it. Thus, based on the input and the intermediate partitioning function, there could be large number of reduce functions that run in parallel. The input to the reduce function is what the Map produced and the output of the Reduce could be single value or a list of values depending on the problem statement we are trying to solve. We can use the same example cited in the Map description to illustrate the Reduce function. In the example, we considered calculating the occurrence of words in each line of a text document. If we elevate this problem statement and now say that we are interested in knowing the occurrence of words in the entire text document, we need a reduce function. Each map runs in parallel and produces arbitrary output of list of key-value pairs. Therefore, we first need to sort them and put them in order. Based on the sorted output, the intermediate portioning function now will provide to each reduce function a group of key-value pairs which are same. For example, if the map functions produced , , , , then the two reduce functions will get and inputs. The Reduce functions, in this example, will aggregate the result and produce a new output in the form of and . Note that, to perform optimally, the input needs to be disjoint. Further, the reduce function may call yet another reduce and the recursion may go till we reach the finer level of granularity that we intend to achieve. Combining these steps with Figure 1, we have a complete illustration as in Figure 2. As a first glimpse, it looks like two cones attached together upside down.

Looking back

The above description seems trivial to understand and implement. But, there are some larger issues that need to be thought of before we seriously implement such a framework. In the design of distributed systems, dependability or the ability to witshand failures is an important criterion. We need to design schedulers that keeps tracks of jobs, have the ability to identify failed jobs and also re-start them in an another node when it fails. The original paper on MapReduce have solutions on these problems and in an open source implementation called Hadoop, a lot of these are put in place. Still, such issues in several context for example peer-peer is still an open issue for researchers to solve. Consequently, there is a large body of research that is going on in this respect. Note that in many use cases, as in our example of calculating the occurrence of words, a successful Map-Reduce function requires the entire set of map and reduce functions to be completed; partial output will bear no considerable meaning to the problem we are trying to solve.


Further, akin to this framework is the storage and the distribution of data that map, reduce and other intermediate functions acts on. In this example, our input to the map and reduce is very small, but there could be applications where in the input size is too large. For example in large processing applications, particularly in healthcare, a size of an image could as large as 1 GB. In problem statement like these, it is a foolishness to move data to the process which is executing a map function. But rather, the map function should move closer to the data. This would certainly speed up the processing of the map-reduce functions and is a good case of utilizing a distributed system effectively. Jim Gray, pioneering computer scientists described this paradigm and others in a beautiful paper called “Distributed Computing Economics”. In most map-reduce applications, large scale distributed database or file systems are used to store this data and the scheduler is designed to handle optimize such scenarios.

MapReduce and CouchDB

Though map-reduce is not that new, but none the less, it is an interesting and important framework to solve several problems that exists today. This century is all about data and it's processing to determine the semantic meaning of the data. It is therefore very essential that we have frameworks of such kind that can handle problems of these sorts. In most Cloud Storage platforms where large chunks of data are stored in multiple parallel data nodes, Map-Reduce implementations are used to decipher the meaning of data. And CouchDB is no exception. In CouchDB, queries are nothing but Map or Map-Reduce functions. Each query needs to be defined first and stored in a document called 'design document'. In CouchDB, there are two different types of Views- Temporary view and Permanent View. Temporary views act for a single time when it written and permanent views are disk resident implying that are saved to the disk and can be queried any time. These views are actually indexes and stored as a B+Tree with REST (Representation State Transfer) for access. In the next article, we will see how these views are created and how they can be created using a simplified data model.