atabase performance is one of the most critical characteristics of almost any application, and for the developer or DBA it usually depends on how fast they can retrieve data. Hence, many a performance optimization and tuning book discusses the ways to make queries faster. Even RDBMS makers understand the need for fast data retrieval and provide different tools (indexes, configurations options, and so on) to facilitate it. However, database performance is not always represented by speed of data retrieval. In some situations, database performance relies on the speed of data inserts.
Suppose you need to design a control measuring system or survey site that stores results in a database. The task seems to be pretty straightforward, but a closer look at the specifications reveals that everything is not as simple as you might assume:
- You need to process and insert a very high volume of incoming transactions into a database.
- You have to provide 24x7 availability.
- The number of fields (parameters) that has to be stored in the tables can vary widely. For example, the number of questions differs in different surveys (i.e., number of controls or elements on the Web pages), or the number of detectors or measuring devices can differ for different processes in the control measuring system.
- The application will be used not only for data inserts but also for data retrieval, though not very intensively. (For extensive analyses, you can create an OLAP system, but that is another story.)
Comprehensive analyses of all the possible solutions for this scenario are beyond the scope of this article. However, you can generally attack the requirements from different directions. For example, you can use all or some of the following options:
- High-speed Internet and fast networks
- Powerful servers with fast CPUs and lots of RAM
- Fast disks and high-performance RAID(s)
- Load balancing for the Web servers
- Failover clustering for database servers
- Tables partitioning
- Lots of storage space (SAN), and so on
All the above solutions, except for partitioning, are quite expensive. And even if you are ready to make a big investment, you still need to put together a few puzzles.
For instance, any database needs to be maintained on a regular basis: indexes, backups, purges of old data, fragmentation, and so on. If you supply your data from the application (Web) server(s) directly, then during database maintenance, you can loose some of your ready-for-insert data or crash your application (or database) server. So you need to provide some kind of buffer that can temporary hold your data during the heavy resources consumption (which means slow inserts) on a database server. Of course, you can get plenty of maintenance time by adding more database servers. But in that case, you produce another problem: consolidating data from different servers into one database.
A good choice for a buffer would be a Berkeley DB, which is very efficient for repetitive static queries and stores data in key/value pairs. (Recall that the survey site or control measuring system examples submit data as control (element) name/value or detector position/value pairs.) But no buffer can grow endlessly, and if you can't transfer data to a regular database quickly enough, your servers still will end up crashing.
Thus, the speed of inserts becomes one of the most critical aspects of the example applications.
How to Store Data?
The design of your database can significantly affect the performance of inserts. So you should be careful when choosing your database (storage) structure. For example, you might want to store data as XML. That choice is very attractive, but in this case it will slow down the inserts and occupy a lot of storage space. You may also want to build the database in the best traditions of database design: each table reproduces the real world object and each column in the table corresponds to the object's property. But in this case (survey site or control measuring system), the number of properties (columns) is dynamic. It can vary from tens to thousands, making the classical design unsuitable.
You most likely will choose the next solution: storing your data in name/value pairs, which perfectly match with the HTML controls' (elements') name/value pairs and with Berkeley DB field/value data. Since most survey (control device) data values can be interpreted as integers, you probably will find it convenient to split data by type. You can create two tables, let's say tbl_integerData and tbl_textData. Both tables will have exactly the same structure with only one exception: the data type for the "value" column will be integer for the first table and text (varchar) for the second.