When working on web sites with small MySQL databases and with a small number of visitors, both fast and slow queries would be quick enough to provide adequate performance. Thus, there is no need to further optimize the queries. However, when working with large amounts of data (a few hundreds of thousands rows) and with large number of visitors, every millisecond of speed improvement can have a significant impact on overall performance of the web site. This ability to handle the growing amounts of data with acceptable performance is called scalability. We will examine the most common scalability techniques in MySQL databases below. Note that the query execution times mentioned in the text are done on a 80,000-row table taking up 195MB of disk space on an average home PC.
1. MySQL Cluster
MySQL cluster is one of the technologies that provides high availability and scalability for the MySQL database management system. Its main features are automatic database partitioning (auto-sharding), separate write operations for each partition and real-time responsiveness using low latency in-memory tables and indexes. The instructions on how to install and configure MySQL cluster can be found here.
2. Primary Keys and Indexes
The primary key uniquely identifies each record in a database table. The means that each row has a different value for the primary key and that null values are not allowed. Since MySQL works fast with integers, the primary key column's type is usually an integer type with auto increment attribute. Auto increment generates the primary key value automatically, where the primary key of the next row is greater than the previous one.
Primary keys are crucial for MySQL database performance. For example, fetching a row by primary key (e.g. SELECT * FROM news WHERE news_id=40000) took only 0.0953s, while the same query without a primary key needed 3.6120s to return the result. Other types of indexes, such as unique and key, also provide performance improvements.
3. MySQL Query Caching
MySQL databases have built-in query caching. The query cache stores the text of a SELECT statement together with the corresponding result that was sent to the client, thus improving query performance. First 25 rows from a test table were fetched in 0.0821s, but after the query was cached, the execution time was 0.0014s. However, the query caching has some limitations. Although faster than raw queries, it is still slower than the other cache implementations, such as Memcached or APC. Also, MySQL query caching is not distributed, meaning that it cannot be effectively used on systems with multiple machines.
4. MySQL Aggregate Functions
Aggregate functions in MySQL (COUNT, AVG, SUM, etc.) are quite slow and should be avoided when building scalable applications. The reason they are slow is that the where clause usually contains a non-indexed field that needs to be checked throughout the entire table (for example, we need a number of comments for an article, but need to count only approved comments). Instead of using these functions, their results should be calculated using a programming language (usually PHP) and stored in the database. If we use the above example with one-to-many relationship between article and comments, the article table would have an additional field called comment_num, where the number of comments will be stored. That field would be updated whenever a comment is approved or deleted by incrementing or decrementing its current value. So, when the comment number needs to be shown, only one query would be executed on the article table.
5. ORDER BY RAND()
Using ORDER BY RAND() in a MySQL query is a common, but very inefficient and slow way of fetching a number of random results from the database. On the test table, it takes 6.2802s to fetch a single row. The reason why it is slow is that the whole table is randomly ordered first, and the desired number of results is fetched after that. There are a few better solutions to this problem. One of them is to find the maximum id by executing the following query: SELECT MAX(news_id) FROM news. After that, generate a random id in PHP by using rand() function. Finally, fetch a random row by its primary key.
Database scalability is a complex issue and, although the tips above are the most common ones, they might be sufficient in certain use-cases that would require further optimization. If you have experience with building scalable applications, feel free to share your thoughts and solutions in the comments.