Trends Watch

Open Source Apps: Deploying Cassandra, the cloud friendly database

PCQ Bureau

03 Mar 2011 05:29 IST

New Update

Cassandra was developed by Facebook and later open sourced in 2008. Eventually, Cassandra became an Apache project hosted at http://cassandra. apache.org/. It falls under a category of databases called NoSQL, which stands for Not Only SQL. Such databases do not use the popular SQL (Structured Query Language) to create tables and insert, delete or update data. The expectation of a NoSQL database to have features found in RDBMS (Relational Database Management System) stretches the learning even further. In this article, we show you how typical RDBMS operations can be performed against a Cassandra database.

Installing Cassandra

For this article, we will install Cassandra on a machine running Fedora 13 Linux. While the Cassandra binary and source are available at http://cassandra.apache.org, we will install Cassandra via RPM. Login as root and issue the following commands to install Cassandra.

rpm-ivh http://rpm.riptano.com/Fedora/13/i386/riptano-release-5-3.fc13.noarch.rpm
yum install apache-cassandra

This will install Cassandra (version 0.7.1 as of this writing). The RPMs for other versions of Fedora, Redhat Enterprise Linux or CentOS are also available. Browse to the URL http://rpm.riptano.com/ to find these RPMs.

To start Cassandra, issue the following command:

service cassandra start

While it is considered inappropriate to map RDBMS concepts to a NoSQL database, it is still the quickest way to start learning the latter. While we will deal with querying Cassandra later in the article, in this section we will map the basic database objects to their look alikes in Cassandra.

A Keyspace contains one or more Column Families. A Column Family contains one or more Columns. To query Cassandra we use the cassandra-cli command line client which is installed by the RPM (see above). Open a terminal window or console and issue the following to run the client:

cassandra-cli

This will show the following output and drop you to a prompt to start querying Cassandra.

Welcome to cassandra CLI.
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.

Here is the prompt from where you can start issuing statements to query Cassandra. To quit from this prompt issue:

quit;

Connecting to Cassandra

By default, Cassandra runs on port 9160. To connect to it, at the default@unknown prompt, issue the following:

connect localhost/9160;

This will produce an output saying - Connected to: "Test Cluster" on localhost/9160. In the following sections, we explain querying Cassandra along with example SQL (RDBMS) queries. We will use MySQL as the RDBMS. As explained above, a database is called a Keyspace. In MySQL, a database can be created as follows:

create database news;

To create a Keyspace in Cassandra, issue the following at the cassandra-cli prompt:

create keyspace news;

To drop a database:

drop database news; (SQL)
drop keyspace news; (Cassandra)

In the news database, now we create a table called article in SQL as follows:

create table article
( ID int,
Title varchar(50),
SubTitle varchar(50),
Dated datetime,
Category int,
index (ID),
index(Title),
index(Dated) );

A table in Cassandra is called a Column Family. The above SQL translates into the following for Cassandra:

create column family article
with

comparator=UTF8Type
and
column_metadata=< {column_name:Title, validation_class: UTF8Type, index_type: KEYS},
{column_name:SubTitle, validation_class: UTF8Type},
{column_name:Dated, validation_class:LongType, index_type: KEYS}>;

Note comparator=UTF8Type. This specifies the data type of the column names — Unicode in this case. This will sound new to those from RDBMS background where column names are alphanumeric by default.

Next, the syntax for defining each column is as follows:

{column_name:, validation_class: , index_type: KEYS}

Note the missing ID column. We will use the key (of the key value pair) as the ID (as explained later). Note that since there is no datetime type in Cassandra, so we use LongType to store a timestamp value for the Dated column. In RDBMS, it is recommended, but not mandatory, to index columns that are searched (in a 'where' condition). In Cassandra, it is mandatory to have indexes for columns that you want to search. To create the Column Family called article in the news Keyspace, issue the following at the cassandra-cli prompt:

use news;

create column family article with comparator=UTF8Type and column_metadata=<{column_name:Title, validation_class: UTF8Type, index_type: KEYS}, {column_name:SubTitle, validation_class: UTF8Type}, {column_name:Dated, validation_class:LongType, index_type: KEYS}>;

Note the statement “use news;”. This sets the current Keyspace, just as the current database is set in SQL by using “use ;”. To truncate and drop a Column Family:

truncate table article; (SQL)
truncate article; (Cassandra)
drop table article; (SQL)
drop column family article; (Cassandra)

Insert update and delete

To list the data in a Column Family (article in our case), issue the following command at the Cassandra prompt:

use news;
list article;

“list article” is similar to issuing “select * from article” in SQL. This will produce an output as follows:

Using default limit of 100
0 Row Returned.

Obviously there is no data in the Column Family. We haven't yet entered any data. To insert data in SQL we use:
insert into article (ID, Title, SubTitle, Dated) values (1,'Nokia and Microsoft','Nokia and Microsoft enter strategic alliance on Windows Phone','2011-02-05 10:10:10');

The corresponding way to insert data in Cassandra is as follows:

set article<1><'Title'>='Nokia and Microsoft';
set article<1><'SubTitle'>='Nokia and Microsoft enter strategic alliance on Windows Phone';
set article<1><'Dated'>='1296880810';

In case of Cassandra, the ID becomes the key for the row (called Column in Cassandra). Those with a programming background will find inserting data in Cassandra similar to adding values to a multidimensional array.

Note that for each column, the index or key remains 1 - which is the ID of the article. For Dated, we have used the UNIX timestamp for the date 2011-02-05 10:10:10. For more on UNIX timestamp refer to http://en.wikipedia.org/ wiki/Unix_time. Now, when we issue list article; we will get an output that looks similar to:

Using default limit of 100
------------------
RowKey: 1
=> (column=Dated, value=1296880810, timestamp=1297953137021000)
=> (column=SubTitle, value=Nokia and Microsoft enter strategic alliance on Windows Phone, timestamp=1297953102353000)
=> (column=Title, value=Nokia and Microsoft, timestamp=1297953054788000)
1 Row Returned.

Here timestamp is something that Cassandra auto generates for each column. Note that the RowKey (ID) is set to 1 and values of each column is shown on a separate line. Let us add one more row to the article:

set article<2><'Title'>='Google and Bing';
set article<2><'SubTitle'>='Google accuses Bing of copying its search results';
set article<2><'Dated'>='1295949600'
;
Now, list article; will show both the rows with their RowKeys.

list article;
Using default limit of 100
-------------------
RowKey: 2
=> (column=Dated, value=1295949600, timestamp=1298059203710000)
=> (column=SubTitle, value=Google accuses Bing of copying its search results, timestamp=1298059135156000)
=> (column=Title, value=Google and Bing, timestamp=1298059119524000)
-------------------

RowKey: 1
=> (column=Dated, value=1296880810, timestamp=1297953137021000)
=> (column=SubTitle, value=Nokia and Microsoft enter strategic alliance on Windows Phone, timestamp=1297953102353000)
=> (column=Title, value=Nokia and Microsoft, timestamp=1297953054788000)
2 Rows Returned.

Updating data is as simple as:

update article set Subtitle='Nokia and Microsoft enter strategic alliance on Windows Phone 7' where ID=1 (SQL)
set article<1><'SubTitle'>='Nokia and Microsoft enter strategic alliance on Windows Phone 7'; (Cassandra)

And to delete:

delete from article where ID=1; (SQL)
del article<1>; (Cassandra)

Cassandra attempts to improve performance by mandating the use of a range condition with a smaller subset of data.
Did we tell you that Cassandra is a distributed database from its very core? Throughout this article we have been working on a single node installation of Cassandra. For more refer to http://wiki.apache.org/cassandra.

Stay connected with us through our social media channels for the latest updates and news!