Robin Schumacher is MySQL's Director of Product Management and has over 13 years of database experience in DB2, MySQL, Oracle, SQL Server and other database engines. Before joining MySQL, Robin was Vice President of Product Management at Embarcadero Technologies.
More than at any other time, database professionals are being aggressively challenged by mushrooming volumes of data in corporate business systems. While some industry analysts project data growth at an average rate of 42% a year, that figure is conservative in some installations where growth rates are several hundred percent, with no end in sight.
What's to blame for sky-rocketing data volumes? First, corporations have realized the golden nugget potential of all the data floating around in their transactional systems and are utilizing data warehousing more than ever before. The strategic use of data is high on the mind of nearly every CIO and business executive, so in response, transactional data is stockpiled into data warehouses where business intelligence users constantly aim their analytic queries to produce forecasts that are used to make key business decisions.
Second, government regulations such as Sarbanes-Oxley, HIPAA, and others have created an atmosphere of confusion over exactly what types of data must be retained to be in compliance with the powers that be. As a result, rather than risk being in violation of federal mandates, corporations, educational institutions, and not-for-profit organizations are keeping everything that relates to their operations. Naturally, this creates a heavy burden for the DBA and others that are charged with managing a never ending tide of data.
Finally, one aspect of government compliance that is increasing in scope and importance with data professionals is security control and data auditing. Corporate officials in charge of vast data storehouses must oftentimes answer for how that data is accessed, and this translates into being able to track who-changed-what-when and who-looked-at-what-when. For businesses with thousands of active users and countless transactions that run through their systems every day, this equates to massive volumes of audit data that must be kept close at hand.
To help meet these and other challenges of the corporate data explosion, MySQL 5.0 introduces a brand new storage engine called Archive. This progressive new data management tool gives MySQL professionals a new weapon to use for handling the special issues that come with managing large volumes of data.
MySQL 5.0's new Archive storage engine has been uniquely designed to help data management professionals cope with today's data retention needs of corporations and other organizations. Let's quickly go over some of the unique characteristics of the Archive storage engine that make it so attractive for all the data challenges previously mentioned.
Because corporations are struggling with managing huge volumes of seldom-referenced historical data, the use of Information Lifecycle Management or ILM is beginning to grow. ILM basically involves putting together business rules that control the migration of operational (often-referenced) data to offline storage areas that are still capable of being reached should the situation require it. MySQL 5.0's new Archive storage engine is the perfect solution for retaining seldom-referenced data in data warehouses or operational data stores because it employs a compression algorithm that results in impressive storage savings for data retained in an Archive table.
For years, MySQL has offered compressed MyISAM tables that also employed compression to reduce the size of datasets, however the Archive storage engine is more beneficial than a compressed (or packed) MyISAM tables for the following reasons:
myisampack
utility from the operating system command line, whereas an Archive table can be created via simple DDL.MySQL DBAs who are charged with data archiving or security compliance retention tasks will like the Archive engine because they no longer have to back up older, historical data to tape to save on storage costs. In addition, availability of such historical data is increased because no lengthy/selective restore of data from tape to disk has to occur.
In addition to its compression and ease-of-use capabilities, the Archive storage engine is ideal for data auditing uses. Archive tables only support SELECT and INSERT
operations, which should cause every data security officer or auditor to smile. Why? Because auditing restrictions require that, once created, audit data should not be manipulated in any way. Archive tables support just these types of restrictions since data cannot be updated or selectively removed - in other words, no tampering allowed.
Other database vendors offer compression capabilities for tables or databases, but none are targeted directly for use in data archiving or auditing situations like the MySQL Archive storage engine. For example, Oracle introduced compressed tables in release 9.2, but they don't have the data auditing feature set of only supporting SELECT and INSERT
. Microsoft SQL Server DBAs sometimes use operating system compression to shrink down their database files, but many have been scared off from this practice by occasional reports of data corruption. MySQL Archive tables are easier to use, require no operating system intervention, prevent unauthorized data manipulation, and keep data safe and secure.
So whether it's a data warehousing, data archiving, or data auditing situation, the new MySQL 5.0 Archive storage engine is an excellent option for those wishing to secure and store data, with the added bonus of having a great ROI on the amount of storage saved. Let's now dive down and see the Archive storage engine in action.
At the time of this writing, the Archive storage engine is included in the current MySQL 5.0 max beta build. To determine if you have the Archive engine enabled, you can issue a SHOW ENGINES
from the mysql command line utility:
mysql> show engines; +------------+---------+----------------------------------------------------------------+ | Engine | Support | Comment | +------------+---------+----------------------------------------------------------------+ | MyISAM | YES | Default engine as of MySQL 3.23 with great performance | | MEMORY | YES | Hash based, stored in memory, useful for temporary tables | | HEAP | YES | Alias for MEMORY | | MERGE | YES | Collection of identical MyISAM tables | | MRG_MYISAM | YES | Alias for MERGE | | ISAM | NO | Obsolete storage engine, now replaced by MyISAM | | MRG_ISAM | NO | Obsolete storage engine, now replaced by MERGE | | InnoDB | DEFAULT | Supports transactions, row-level locking, and foreign keys | | INNOBASE | YES | Alias for INNODB | | BDB | YES | Supports transactions and page-level locking | | BERKELEYDB | YES | Alias for BDB | | NDBCLUSTER | NO | Clustered, fault-tolerant, memory-based tables | | NDB | NO | Alias for NDBCLUSTER | | EXAMPLE | NO | Example storage engine | | ARCHIVE | YES | Archive storage engine | | CSV | NO | CSV storage engine | | FEDERATED | YES | Federated MySQL storage engine | | BLACKHOLE | YES | /dev/null storage engine (anything you write to it disappears) | +------------+---------+----------------------------------------------------------------+
To test the how well the new Archive engine compresses data, let's create several tables of different engine types from an existing table that contains a little over 100,000 rows and then check the storage consumption. One quick note: All tests shown below were performed on a Dell Windows XP laptop with a Pentium M 2.00 GHz processor and 1GB of RAM, running MySQL 5.0.10 beta.
mysql> create table test_myisam engine=myisam as select * from client_transaction_hist; Query OK, 112050 rows affected (1.06 sec) Records: 112050 Duplicates: 0 Warnings: 0 mysql> create table test_innodb engine=innodb as select * from client_transaction_hist; Query OK, 112050 rows affected (3.72 sec) Records: 112050 Duplicates: 0 Warnings: 0 mysql> create table test_archive engine=archive as select * from client_transaction_hist; Query OK, 112050 rows affected (1.92 sec) Records: 112050 Duplicates: 0 Warnings: 0 mysql> SELECT table_name table_name, -> engine, -> ROUND(data_length/1024/1024,2) total_size_mb, -> table_rows -> FROM information_schema.tables -> WHERE table_schema = 'gim' and -> table_name like 'test%' -> ORDER BY 3; +--------------+---------+---------------+------------+ | table_name | engine | total_size_mb | table_rows | +--------------+---------+---------------+------------+ | test_archive | ARCHIVE | 1.64 | 112050 | | test_myisam | MyISAM | 6.46 | 112050 | | test_innodb | InnoDB | 9.52 | 112050 | +--------------+---------+---------------+------------+
The Archive table is almost 75% smaller in size then the MyISAM table and almost 83% smaller in size then the transactional InnoDB engine. Let's now pack the MyISAM table and see what the differences are::
C:\Program Files\MySQL\MySQL Server 5.0\data\gim>..\..\bin\myisampack test_myisam.MYI Compressing test_myisam.MYD: (112050 records) - Calculating statistics - Compressing file 67.76% +--------------+---------+---------------+------------+ | table_name | engine | total_size_mb | table_rows | +--------------+---------+---------------+------------+ | test_archive | ARCHIVE | 1.64 | 112050 | | test_myisam | MyISAM | 2.08 | 112050 | | test_innodb | InnoDB | 9.52 | 112050 | +--------------+---------+---------------+------------+
Even with compression applied, the Archive table is still 7% smaller than the packed MyISAM table.
What about INSERT
speed? The Archive engine can be somewhat slower on inserts than standard MyISAM tables as illustrated in the above CREATE TABLE ... AS SELECT ...
examples as it involves data under 16K. However, in other cases, Archive can actually be faster as shown in this example of inserting 1,000,000 rows through a procedure:
mysql> create table insert_test (c1 int, c2 varchar(20)) engine=myisam; Query OK, 0 rows affected (0.00 sec) mysql> drop procedure test_insert; Query OK, 0 rows affected (0.00 sec) mysql> delimiter // mysql> create procedure test_insert() -> begin -> declare v_ctr tinyint; -> set v_ctr = 0; -> while v_ctr < 1000000 -> do -> insert into insert_test values (1,'testing insert'); -> set v_ctr = v_ctr + 1; -> end while; -> end -> // Query OK, 0 rows affected (0.00 sec) mysql> delimiter ; mysql> call test_insert(); Query OK, 1 row affected (33.06 sec) mysql> truncate table insert_test; Query OK, 0 rows affected (0.01 sec) mysql> alter table insert_test engine=archive; Query OK, 0 rows affected Records: 0 Duplicates: 0 Warnings: 0 mysql> call test_insert(); Query OK, 1 row affected (21.42 sec)
As you can see, the singleton INSERT
operation for the Archive engine is faster in this case than MyISAM. In some situations, use of the INSERT DELAYED
option may speed up inserts into Archive tables that are being queried upon. The Archive engine employs a compression buffer to house insert activity when the DELAYED
option is used, which should also help performance.
Another approach is to load a standard MyISAM table and then perform an ALTER TABLE function to compress the table in bulk:
mysql> alter table myisam_insert engine=archive; Query OK, 3000000 rows affected, 0 warning (8.84 sec) Records: 3000000 Duplicates: 0 Warnings: 0
As shown above, the mass move of a 3 million row MyISAM table to an Archive table only took about 9 seconds, which is quite good.
What about read operations? How does an Archive table compare to other MySQL storage engines? First, one of the current restrictions of Archive tables is that they do not support any indexes, so you will be looking at a table scan for any SELECT
tasks. MySQL is examining index support for Archive tables in upcoming releases, but for now, one thing you can depend on to help table scan times is the fact that Archive tables are supported by the MySQL Query Cache, which can dramatically reduce response times for Archive table queries that are repetitively issued.
Let's check response table scan times for a simple query against three tables (a little over 100,000 rows) with different storage engines:
mysql> select count(*) from test_myisam where client_id = 50; +----------+ | count(*) | +----------+ | 24 | +----------+ 1 row in set (0.25 sec) mysql> select count(*) from test_innodb where client_id = 50; +----------+ | count(*) | +----------+ | 24 | +----------+ 1 row in set (0.59 sec) mysql> select count(*) from test_archive where client_id = 50; +----------+ | count(*) | +----------+ | 24 | +----------+ 1 row in set (0.41 sec)
The Archive engine appears to fare pretty well, so now let's check a simple table scan over 3 million rows:
mysql> select count(*) from myisam_insert where c1 = 1; +----------+ | count(*) | +----------+ | 3000000 | +----------+ 1 row in set (1.05 sec) mysql> select count(*) from archive_insert where c1 = 1; +----------+ | count(*) | +----------+ | 3000000 | +----------+ 1 row in set (2.20 sec) mysql> flush status; Query OK, 0 rows affected (0.00 sec) mysql> select count(*) from archive_insert where c1 = 1; +----------+ | count(*) | +----------+ | 3000000 | +----------+ 1 row in set (0.00 sec) mysql> show status like 'qcache_hits'; +-------------------------+----------+ | Variable_name | Value | +-------------------------+----------+ | Qcache_hits | 1 | +-------------------------+----------+
In the above test, the Archive table is slower than a standard MyISAM table, but still does well by crossing the finish line in about two seconds. And as the test also shows, the Archive table scan is picked up by the Query Cache, so identically issued statements will experience near instantaneous response times (as long as the query/result set combination are found in the cache).
When reading data, an Archive table employs a consistent or 'snapshot' read much in the same was as InnoDB does, so high levels of concurrency will always be experienced as readers will never block writers. Also like InnoDB, Archive tables use row-level locking.
Will an Archive table scan ever beat a standard MyISAM table scan? Yes, according to MySQL engineering. When a table approaches about 1.5GB in size, a table scan on an Archive table can take less time than a MyISAM table if the server has a fast CPU, as the CPU will likely be the bottleneck instead of the I/O subsystem. Remember that less actual physical I/O is done on an Archive table as it is much smaller than its other engine counterparts.
Keep in mind that the smaller footprint of Archive tables not only has benefits from a performance standpoint during SQL operations, but it also helps in areas such as backups. The backup window (for cold/file system backups) of large historical data volumes can be substantially reduced as there is much less data to physically move.
The smaller size of Archive tables also helps if you are doing a platform move of a MySQL database from one server to another (like Unix to Linux). Archive tables support the same ease-of-use portability of other MySQL objects as you can simply move the underlying files from one machine to another and they are immediately ready for use.
Data archiving, a subset of Information Lifecycle Management (ILM), typically involves horizontally partitioning data so that a set of current information resides in one or more objects, while the historical/archived data is placed in separate objects. Oftentimes, historical data can be transferred to lower-cost storage devices (like EMC Centera) that allow easy disk access, but provide significant financial savings. Of course, the idea behind data archiving is to lessen response times by reducing the sheer amount of data needed for current access. However, for analytic purposes, there are times when both the current and archive data needs to be accessed together.
For data archiving environments, Archive tables can easily be accessed in standalone fashion or be set up in union views, which is the primary vehicle used to reference both current and archived data in one object. By combining MySQL's Archive storage engine with the new 5.0 Federated table support, a DBA can smartly transfer historical data to cheapter storage devices and then reference both current and archived data when needed.
For example, let's say you have a set of recent information regarding customer transactions in a transactional InnoDB table called CLIENT_TRANSACTION
, and a set of historical transactions stored in an Archive table called CLIENT_TRANSACTION_HIST
. To lessen load on the active server, you can
On the historical MySQL Server:
mysql> alter table client_transaction_hist engine=archive; Query OK, 112050 rows affected, 0 warning (1.98 sec) Records: 112050 Duplicates: 0 Warnings: 0
On the MySQL Server that contains active data:
mysql> select count(*) from client_transaction; +----------+ | count(*) | +----------+ | 18675 | +----------+ 1 row in set (0.00 sec) mysql> CREATE TABLE client_transaction_hist ( -> client_transaction_id int(11) NOT NULL, -> client_id int(11) NOT NULL, -> investment_id int(11) NOT NULL, -> action varchar(10) NOT NULL, -> price decimal(12,2) NOT NULL, -> number_of_units int(11) NOT NULL, -> transaction_status varchar(10) NOT NULL, -> transaction_sub_timestamp datetime NOT NULL, -> transaction_comp_timestamp datetime NOT NULL, -> description varchar(200) default NULL, -> broker_id bigint(10) default NULL, -> broker_commission decimal(10,2) default NULL -> ) ENGINE=FEDERATED -> COMMENT='mysql://mysql:password@serv1:3306/gim/client_transaction_hist'; Query OK, 0 rows affected (0.14 sec) mysql> CREATE VIEW client_transaction_all as -> select * from client_transaction -> union all -> select * from client_transaction_hist; Query OK, 0 rows affected (0.08 sec) mysql> select count(*) from client_transaction_all; +----------+ | count(*) | +----------+ | 130725| +----------+ 1 row in set (1.20 sec)
Note: rather than use a SELECT *
in the view definition above, the actual columns should be listed out; the above was only done to save space for this article.
The targets for Archive tables in data warehousing or data archiving environments are fact and summary tables as they usually contain the bulk of the information. Dimension tables are rarely large enough to warrant a switch to Archive.
We talked earlier about the Archive engine being a good choice for storing data auditing results as only INSERT and SELECT
are supported. Let's check to be sure:
mysql> delete from test_archive where client_id = 50; ERROR 1031 (HY000): Table storage engine for 'test_archive' doesn't have this option mysql> update test_archive set price=120 where client_id = 50; ERROR 1031 (HY000): Table storage engine for 'test_archive' doesn't have this option
We also talked about how Archive doesn't support indexes at this time:
mysql> create index my_index on test_archive (client_transaction_id); ERROR 1069 (42000): Too many keys specified; max 0 keys allowed
One special thing to note about Archive tables and indexes is that, if you want to alter another storage engine table to be an Archive table, you must first drop any indexes that have been defined on the table.
Archive tables are supported by the new sql_mode
data integrity enhancements in MySQL 5.0, so you can ensure only proper data finds its way into your Archive tables:
mysql> create table test (c1 int) engine=archive; Query OK, 0 rows affected (0.11 sec) mysql> insert into test values ('hi'); Query OK, 1 row affected, 1 warning (0.05 sec) mysql> select * from test; +------+ | c1 | +------+ | 0 | +------+ 1 row in set (0.00 sec) mysql> set sql_mode='strict_all_tables'; Query OK, 0 rows affected (0.00 sec) mysql> insert into test values ('hi'); ERROR 1264 (22003): Out of range value adjusted for column 'c1' at row 1
DBAs facing the problem of corporate data explosion have an excellent new tool to help them in the MySQL 5.0 Archive storage engine. Whether it's a data warehousing, data archiving, or data auditing situation, MySQL Archive tables can be just what the doctor ordered when it comes to maintaining large amounts of standard or sensitive information, while keeping storage costs at a bare-bones minimum.
Updated: 2005-09-16