MySQL is a pretty fast database and serves most requirements for a decent
sized Web application. But the dragon starts raising its head, when the number
of records grow. In this series we will look into various ways of optimizing the
MySQL database so as to tame the dragon from time to time.
The simplest optimization in MySQL is to choose correct data types, depending
on the record set stored in the table. Many MySQL database designers fail to do
this, resulting in database driven websites failing miserably. In this article,
we will cover the most common data types used in MySQL and how to use built-in
MySQL functionalities to figure out the most optimal data type for a column.
Size does matter
A standard practice among database designers is to define primary key as an
auto-increment INTEGER. By default MySQL will set this as Unsigned INT. This
means you can store 4,294,967,295 records in the given table. Sounds good from a
scalability point of view but if you are designing the table for a small website
ask yourself a question, is your website going to get 4,294,967,295 users ? Of
course we wish your website gets these many users but most websites do not have
so many users.
Thus choosing Unsigned INT as userID might not be a good choice. For small to
medium sized websites choosing MEDIUIMINT as userID will be the bast choice.
When compared to INT, MEDIUMINT consumes approximately 25% less space. A
column with INT type consumes 4 bytes whereas a column with MEDIUMINT type will
consume 3 bytes.
Direct Hit! |
Applies To: Database designers USP: Learn various ways of optimizing MySQL database Primary Link: None Keywords: Optimizing MySQL Database |
MySQL has several specialized data types which require lesser memory and
space. Smaller the data type size, faster is the record look up and faster the
query execution. Specially the primary index of a table should be as small as
possible to make row identification faster.
Of course when more users start joining your website you can always alter the
table to go to next higher data type. MySQL manual is a good source to know size
of different data types and identify the most suitable, smallest data type for a
column. Storage requirements for each data type can be found here http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
.
Rule to remember is to start small and grow the field type as record set
increase in the table.
NULL is not NULL
NULL columns require additional one bit per column. Setting column as NOT NULL
can make everything faster. Specially when your query involves subqueries, MySQL
will perform an additional search for rows that contain NULL. This search can be
avoided when there are no NULLs in the table. You should definitely use NULL if
your application requires them but do not set columns as NULLable by default.
Using a constant value which falls outside domain range for the column instead
of NULL can speed up the look ups.
Date and time
Setting record creation date and record update date to DATETIME is a favorite
among database designer. Most people are not aware that DATETIME is among the
costliest data types in terms of space. DATETIME consumes 8 bytes of space. Date
and time types should be avoided unless you need to compare dates in your quries.
If you have to store dates for display purpose only, storing Unix time stamp
as a MEDIUMINT can surve the purpose. All programing languages provide mechanism
for creating current Unix time stamp and converting Unix time stamp back to date
format. So, 1223890532 will result in faster look up compared to 2008-10-13
05:35:32.
CHAR or VARCHAR ?
CHAR is fixed width and VARCHAR is variable length. But VARCHAR requires
extra byte to store the record. If length of VARCHAR is 255 or less it requires
one byte but if the length of VARCHAR is greater than 255 bytes it requires two
bytes. A VARCHAR is store as one-byte or two-byte length prefix plus data.
The length prefix contains the number of bytes in the record. A field
declared as VARCHAR(5) may take 6 bytes instead of 5 bytes as expected.
Finding the correct data type
MySQL comes with a built-in procedure to analyze a query and suggests the
optimal data types for each column. This procedure is called "Procedure Analyse".
Procedure analyse will analyze the existing data in a table and try to guess
the optimal data type based on the distinct values per column present in the
table.
The syntax for this procedure is following;
SELECT col1, col2 FROM table1 PROCEDURE
ANALYSE(
Procedure Analyze expects two optional parameters max_elements and max_memory.
max_memory is maximum memory that shall be allocated on per column basis while
trying to find all distinct values for a column.
max_elements defaults to a value of 256 and is used to identify the maximum
number of distinct values analyse will use per column. Distinct values are used
to identify if ENUM is a
suitable data type for a given column.
To test procedure analyse we created a table called user_master here is the
description of the table
mysql> desc user_master ;
+-----------------+---------------------+---
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------------+---
| user_id | int(11) | NO | PRI | | auto_increment |
| first_name | varchar(20) | YES | | NULL | |
| last_name | varchar(20) | YES | | NULL | |
+-----------------+----------------------+--
3 rows in set (0.00 sec)
We loaded the table with 16 users but ensured that first_name of the users is
"Pcquest" or "Preview" or "Test". We also ensured every user has unique
last_name. Following command was used to analyze the table and identify the
optimal data type.
mysql> select user_id,first_name,last_name from
user_master procedure analyse(10,50)\G
********* Row 1*****************
Field_name: mydb.user_master.user_id
Min_value: 1
Max_value: 16
Min_length: 1
Max_length: 2
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 8.5000
Std: 4.6098
Optimal_fieldtype: TINYINT(2) UNSIGNED NOT NULL
**************Row 2********** ***
Field_name: mydb.user_master.first_name
Min_value: Pcquest
Max_value: Test
Min_length: 4
Max_length: 7
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 4.1875
Std: NULL
Optimal_fieldtype: ENUM('Pcquest','Preview','Test') NOT NULL
*************Row 3************ Field_name: mydb.user_master.last_name
Min_value: Usr1
Max_value: User101
Min_length: 4
Max_length: 7
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 4.1875
Std: NULL
Optimal_fieldtype: VARCHAR(7) NOT NULL
As you can notice the optimal data type suggested for user_id is TINYINT, as
for only 16 users INT type is an overkill. TINYINT will consume 1 byte where as
INT will consume 4 bytes, a saving of 75% of space and memory.
Procedure Analyse at times reports optimal data type as ENUM where you least
expect. In the example above first_name is suggested to be ENUM as procedure
analyse will only see 3 distinct values in 16 records. As procedure analyse
works on existing records in the table and distinct records in the current data
set, ENUM will the most optimal data type for first_name.
By trying different values to procedure analyse, more suggestions about optimal
data type can be obtained. You may have to play with settings till procedure
analyse does not return ENUM data type when it is not appropriate.
On a fairly large record set, results of procedure analyse are very useful.
Vivek Khurana