by August 2, 2011 0 comments

In the last issue we discussed BTree and emphasised on thefact that CouchDB is nothing but a set of BTrees with a welldefined interface to access remotely. In this issue, we dis cuss a few more concepts that are essential in understandingCouchDB. Note that as I wrote in my previous article, theseconcepts are used in several cloud storage platforms and areone of the best known distributed computing algorithms tosolve problems of different kinds. The first thing we explore isan extension of BTree, known as B+Tree. It is actually aB+Tree, not a BTree that is used in CouchDB.


Applies: Cloud developers
USP: Learn the essentials of CouchDB forbuilding scalable apps
Related articles:CouchDB Part 1:
Search engine keywords: couchdb

The B+Tree

B+Tree is a variant of the general purpose BTree, where thedata structure is modified to handle a few interesting usecases. Two important modifications are that in a B+Tree theintermediate and the root nodes hold only the key; they don’thave any reference to values and all key value pairs are lo cated in the leaf. The root and the intermediate node acts aslocation finders for the search key and all values correspon ding to them are found in the keys. Second, the leaf nodeswith M distinct elements are interconnected. That is there isa link list of leaf nodes and so if you are at a given leaf, youcan traverse all through the rightmost node in the BTree.These two simple modifications provide a lot of flexibilitieswhen we look for nodes for a specific key or keys in betweena specific range. The latter is explicitly known as Range Query in database systems. The leaf nodes are intercon nected to each other and values are placed in the leaf node.In the root node certain elements are just pointers and theirkey value pairs are repeated at the leaf.

[image_library_tag 489/67489, border=”0″ align=”right” hspace=”4″ vspace=”4″ ,default]

In previous examples, we searched for a single key and if itwas found we either retrieved it or reported a miss. Let us to day, modify this search key criterion. Instead of a single key, letus say that we want to retrieve all values which are between 3and 5. Such queries are common in SQL where instead of a sin gle predicate you include multiple predicate values. In aB+Tree, such a search is trivial to solve. Here, you locate the el ement which is close to 3. The element with key is in the firstleaf node. This leaf node is connected to the next leaf node andhence we don’t need to do yet another search but instead moveon linearly in the link list till we reach 5.

CouchDB uses B+Trees to store your documents, each doc ument is provided an ID and also uses B+Tree to store views.However, it is important to know that for every key, you de fined, there is a B+Tree that is generated. For example, let ussuppose that the B+Tree in the figure is the sorted arrange ment of the age of an employee. I assume that we have a largecollection of employee documents where in we are storing thename, address and the age of an employee. If you to choose tocreate a B+Tree or a View to be precise, of the employees withthe age key, then searching for an employee with first name will bear no results. In such a case, there is no index createdand hence, you will not receive any performance benefit.Searching for a non key element will be in the order on O(N).In CouchDB, views are created using MapReduce type of pro gramming patterns and they are incremental. MapReduce cre ates key value pairs and each of these key values is stored inthe CouchDB as a B+Tree. Incremental means that after a keyis processed and stored, any updates to the view are added in crementally without re computing the view or generating theB+Tree again. For instance, after generating the view for em ployee age, we might decide to insert a few moreemployee records into the CouchDB store. At thispoint, the CouchDB software would realize thatapart from the document store there are viewsassociated with documents and so it would alsoinsert the age ‘keys’ in the view store as well.This rule is also applied when documents aredeleted as well.

Click on the image to enlarge

[image_library_tag 491/67491, border=”0″ align=”middle” hspace=”4″ vspace=”4″ ,default]

In CouchDB, every document has an ID andthis ID is generated by the CouchDB system it self. Apart from this ID, there is one more im portant field that is associated with theCouchDB document. It is known as Revision_ID.Revision_ID has a lot to do on concurrent man agement of request and a technique called Multi Version Con currency Control. We will study this technique as this is veryrelevant to operations executed in the CouchDB B+Tree.

Multi Version Concurrency Control (MVCC)

[image_library_tag 493/67493, border=”0″ align=”right” hspace=”4″ vspace=”4″ ,default]

Generally, update operations on a structure say an integer overwrites previous values stored in that location. That is if wehave variable x of type integer with initial value 10, updatingthe contents of x will wipe out the value and re instate a newvalue in its place. In multi version techniques, however, a value is not wiped out. Instead all operations executed on a variableare retained. That is, if there is a sequence of operations suchas x+ 10, X+20, X 30 is executed on the variable x with 10 asthe value, there would be four memory locations pointing tovarious states of x. To distinguish between different states andalso to know the latest state, the variables are marked with onemore attribute called revision number. A revision could be aninteger, a double or time stamp. When used as an integer, it isincremented every time when it is updated (insert or delete)and when time stamp is used, then it is used to denote the timethe operation was executed. Clearly from each of these revisionidentifier types, we can ascertain the latest state and also rollback to previous state if required. In CouchDB, documents arebased on such a technique, so whenever you update an existingdocument, a snapshot of the old document is created and anincreasing revision number is used. By virtue of this, CouchDB increases the throughput of read and write operations execut ing concurrently without using the costly lock primitives. Ear lier, designers used several different varieties of locks to solveconcurrent updates to a single entity but with the gradualadoption of MVCC, locks are nowaday’s very sparingly used. Infact, most RDBMSes today, like the open source Post GRESQL,use MVCC to isolate and run concurrent transactions.

CouchDB uses MVCC to support a concurrency controlscheme called Optimistic Concurrency. In this model, each ofthe ‘write’ originated from multiple clients is allowed to mod ify the record at the same time. The first client that updates therecord wins and the other client updates are discarded. To un derstand it precisely, let us illustrate the example of the vari able ‘x’. Let us suppose we use time stamps to denote therevision number. Initially the value of x is 10 and the revisionnumber is ‘t0’. Clients who wish to update the value of x haveto first read the value of x. When each of the clients reads thevalue of x, the CouchDB application also returns the revisionnumber t0. Clients modify the value x in their own way andsubmit the new value along with the revision number. As soonas the new value is reached, CouchDB compares the revisionnumber sent by the client to the highest revision number asso ciated with the variable x in the server. Clearly, out of all theclients, only one wins. The CouchDB system then updates thevalue of x based on the value updated by the winning clientand also updates the revision number. Note that it doesn’tdelete the old value; the old value remains intact as it was be fore and instead a new location is used to store the updatedvalue with the revision number. The other client receives an er ror and they either re read the old value or simply discard theoperation. This simple technique has a great benefit and it isproven to be a very effective solution for achieving concurrentoperations. In fact most of the cloud vendors and storage solu tions have adopted this solution of solving concurrent opera tions.

Thus a B+Tree in CouchDB is not just a simple one, everynode in the B+Tree includes elements and each element mayhave multiple versions. You might be wondering that with sucha technique there are issues with space as every update wouldun necessary increase the size of the B+Tree even though wemay not be interested in keeping them for ages. The solution tosolve this problem is to use a garbage collector that runs peri odically and deletes old values. And your favourite cloud stor age system uses such a technique as well!

In the next issue, we shall discuss on a few more interest ing techniques. Watch out for the next issue.

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.

Your data will be safe!Your e-mail address will not be published. Also other data will not be shared with third person.