Developers

Using MySQL Indexes Effectively

PCQ Bureau

01 Jul 2011 07:40 IST

New Update

Sufyan bin Uzayr, Freelance Writer, Graphic Artist and Photographer, www.sufyan.co.nr

Advertisment

In the last issue, we covered brief MySQL concepts like storage engines, EXPLAIN, the problem of full table scans, etc. We had concluded by adding indexes to overcome full table scans. This time, we shall resume where we left and take a detailed look at indexes.

Full table scans can strike back

First up, coming to common MySQL errors or issues that are often ignored. Lets create a table along the lines of the following sample (this sample table has few flaws in it, as we shall see later on):

Advertisment

Applies to:

Database admins

USP: Learn to optimize the use indexes in a MySQL database

Related articles: Performance Tuning a MySQL Database: http://ld2.in/3dp

Maatkit:

A Handy Tool for MySQL Users: http://ld2.in/3dq

CREATE TABLE 'awesome_table' (

'awe_a' INT(10) NOT NULL AUTO_INCREMENT,

awe_date' DATE NOT NULL,

PRIMARY KEY ('awe_a'),

INDEX 'awe_date' ('awe_date')

)

Additionally, you may suffix the following to the above table too (it depends on the environment you have at your disposal, though the following code is a recommended addition, if possible):

Advertisment

COLLATE = 'utf8_general_ci'

ROW_FORMAT = DEFAULT

Populate the table with some sample records (say, 10 records). The Primary Key lies with the awe_a column, while the awe_date column is indexed as well. So, simply because indexing is on for awe_date column, we can assume that any queries done on the column will not run unoptimised, right? Apparently, not! Run the following sample query:

EXPLAIN SELECT * FROM awesome_table

WHERE awe_date < '1980'

Advertisment

What did you get? Correct! It runs a full table scan, yet again, in spite of the index placed on the awe_date column. Now, let us modify the above query slightly:

EXPLAIN SELECT * FROM awesome_table

WHERE awe_date < '1980'-01-02'

What did you see now? It no longer performs a full table scan, but instead, shows the scan type as 'range' rather than 'index'. The outcome? Faster processing! As you must have noticed, in the first query, '1980' is an ambiguous parameter but in the second query the entire date eliminates the possibility of a scan type such as 'ALL'. Another common scenario wherein an otherwise-not-required full table scan is called upon is one comprising of UCASE and LCASE. More often than not, applications perform case-insensitive searches. For example:

Advertisment

EXPLAIN SELECT * FROM table-name

WHERE UCASE (column-name) = 'THIS IS SO WONDERFUL' ;

In such searches, MySQL will ignore the indexes, convert the values held by the specified column in each row to UCASE and then perform the search for the given sample text. The easiest way out of such a situation is to store either UCASE or LCASE values (the requisite case conversion should ideally be performed when the record is inserted in the table). Following that, the case of the value under consideration can be automatically compared, as shown in the query below:

EXPLAIN SELECT * FROM table-name

WHERE column-name = UCASE ('this is so wonderful') ;

Advertisment

This shall compel MySQL to convert the given value into UCASE in order to match a rule that allows for storage of only UCASE values in the given column.

MyISAM storage engine-a closer look

As we covered in last month's article, MyISAM is MySQL's default storage engine. Now we shall take a closer look at it. MyISAM by default stores a table in two files (one for data, the other for indexes). For the data file, the extension is .MYD while for the index file, the extension is .MYI.

Advertisment

You can also use the DATA DIRECTORY and INDEX DIRECTORY options along with the CREATE TABLE command to specify the location of each file of the given table. Since these files are platform independent, most databases support specifying of the directories. Also, all readers having SELECT associated with queries need to obtain read locks and multiple users can do the same by means of shared locks. However, on the contrary, all writers need to have exclusive locks. Thus, while two users can acquire read locks (SELECT) at the same time, they cannot perform write operations (INSERT, UPDATE or DELETE, etc.) simultaneously.

This is the precise reason why indexing is crucial and needless to say, if you fail to handle your indexes well, write operations will become slow and time consuming resulting in heavy load on the system. MyISAM supports Full Text Index (also known as Full Text Search) but doesn't yet support transactions. Table locks are a possibility, but row locks are not. Further more, MyISAM supports compressed tables too (read only). However, it must be noted that only individual rows are compressed and not the entire table as a whole. MyISAM also has the advantage of specifying NULL values even in indexed fields, as well as providing a different character set for each CHAR or VARCHAR column type.

MyISAM and B-tree

In a MyISAM powered table, the type of each index is B-tree. So before going any further, let us analyze what a B-tree is, and to do so, we shall turn to Wikipedia (http://en.wikipedia.org/wiki/B-tree): “... a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic amortized time. With the introduction out of the way, we now turn our attention once again to MyISAM and B-tree, with special focus on indexes.”

First, we can briefly sum up the theoretical aspect of the issue. It can be said that the B-tree index has a root node on the top (since it is a tree, it has to have a root). In B-tree, any node that doesn't have a child attached to it is called a leaf node. Therefore, the root node is a non-leaf node while all the nodes that spring from it are called leaf nodes. The links between a node and its immediate children can be shown as 'pointers'. Do not confuse the 'pointers' to be C/C++ pointers.

Going below the leaf nodes (ones without children nodes), you'll find the actual table data. The data is linked to the leaf nodes on the basis of 'key values'. Thus, it becomes quite obvious that effective and speedy searches depend on how the key values associate the data to the leaf nodes, or, in simple terms, how effectively a table is indexed. At this junction, we can also tear apart the myth that MyISAM supports clustered indexes. Truth is, MyISAM does not store data in a sorted fashion, whereas for a clustered index to work, data must be sorted. MyISAM, on the other hand, stores data as and when it is inserted into the table. It does sort the indexes, but as we have already covered, indexes are stored in a separate file (.MYI) than data itself (.MYD). MyISAM uses indexes to point to the exact location of unsorted data and as a result, removes the need of data storage in a sorted manner. The most obvious benefit of employing a B-tree is that it considerably improves the search functionality (SELECT queries). However, on the down side, queries such as INSERT and DELETE tend to become slower as each time a record is either inserted or deleted, the indexes located in .MYI file also need to be modified. The cure in such a case is to index selectively.

Index selectivity implies the difference in values stored or recorded in the columns of a table. Selectivity is measured on a scale of 0 or 1, wherein 1 implies that each value in the selected column is UNIQUE. Generally, selectivity of 1 occurs with columns that are UNIQUE or PRIMARY KEY, though this isn't always the case and it varies with the nature of values stored in the given columns. For the sake of simplicity, we can stick to the following formula:

SELECTIVITY = NO. OF DISTINCT RECORDS / TOTAL NO. OF RECORDS

The above formula is a stripped down and simplified version for the purpose of understanding. If you so desire, you can use the alternate way to calculate selectivity by employing a production class database and finding the number of DISTINCT rows in it. Bear in mind though, that the number of DISTINCT values in a column may or may not always work perfectly.

Higher selectivity means the operations shall be of shorter duration and vice-versa. As a result, lower selectivity is termed as an expensive operation while higher selectivity is an inexpensive operation.

Finally, coming back to the sample table that we created at the start of the article. The awe_a column is a PRIMARY KEY, and will thus have a selectivity of 1. the awe_date column is indexed as well, so lets focus on it. Quite obviously, all dates cannot be distinct or UNIQUE and this column is bound to have a low selectivity. In such a case, it will not serve as a good index and as a result, in spite of indexing the column, we got a full table scan in the first query that we ran earlier.

Before performing a query, MySQL calculates the 'cost' of the different ways in which the query can be performed and then picks the cheapest or most effective way. So if a low selectivity column is used for an index, it will overload the system. To avoid such overloading, MySQL may choose not to use your index if the selectivity is low. This is precisely the reason why even after using multiple indexes, your queries may still result in full table scans (read: slower outputs) and burden the system resources. In simple terms, the entire input and output process depends on the appropriateness of the indexing and querying. Hence, it becomes vital that indexes are used judiciously and selectively.

Advertisment