Robin Schumacher is MySQL's Director of Product Management and has over 13 years of database experience in DB2, MySQL, Oracle, SQL Server and other database engines. Before joining MySQL, Robin was Vice President of Product Management at Embarcadero Technologies.
It's amazing in an age where relational databases reign supreme when it comes to managing data that so much information still exists outside RDBMS engines in the form of flat files and other such constructs. In most enterprises, data is passed back and forth between disparate systems in a fashion and speed that would rival the busiest expressways in the world, with much of this data existing in common, delimited files. Target systems intercept these source files and then typically proceed to load them via ETL (extract, transform, load) processes into databases that then utilize the information for business intelligence, transactional functions, or other standard operations. ETL tasks and data movement jobs can consume quite a bit of time and resources, especially if large volumes of data are present that require loading into a database. This being the case, many DBAs welcome alternative means of accessing and managing data that exists in file format.
In MySQL 5.1, a new engine has been provided that assists with the management and use of delimited data found in common flat files. Let's take a quick look at this new engine, along with the features and benefits it provides users of MySQL.
The CSV (which stands for “Comma Separated Value”) engine was originally debuted in MySQL 5.0, although it wasn't enabled for most installations and platforms with the release of version 5.0 that occurred in October of 2005. But in 5.1 of MySQL, which is currently in Beta, the CSV engine is fully enabled and ready for action.
Those using Oracle will recognize that the CSV engine somewhat mirrors what are called External Tables in Oracle, although MySQL offers a simpler implementation than what is required in Oracle to make External Tables function. Let's just see how easy it is to use the CSV engine from within MySQL, and examine the practical uses of this new engine when it comes to managing data.
To begin with, you will need to download the latest Beta of MySQL 5.1, which can be found at http://dev.mysql.com/downloads/mysql/5.1.html. Once the Beta has been installed, a quick query can be used to validate that the CSV is available and open for business:
mysql> select engine, support from information_schema.engines order by 1; +------------+----------+ | engine | support | +------------+----------+ | ARCHIVE | ENABLED | | BLACKHOLE | ENABLED | | CSV | ENABLED | | EXAMPLE | ENABLED | | FEDERATED | ENABLED | | InnoDB | ENABLED | | MEMORY | ENABLED | | MRG_MYISAM | ENABLED | | MyISAM | ENABLED | | ndbcluster | DISABLED | +------------+----------+ 10 rows in set (0.02 sec)
The CSV engine can be used just like any other storage engine when it comes to creating tables and populating them with data; the only thing to remember is to ensure that you specify engine=csv when you create a CSV table:
mysql> create table testcsv (c1 int, c2 datetime, c3 varchar(10)) -> engine=csv; Query OK, 0 rows affected (0.00 sec) mysql> insert into testcsv values -> (1,now(),'test 1'), -> (2,now() + 1, 'test 2'), -> (3,now() + 2, 'test 3'), -> (4,now() + 3, 'test 4'); Query OK, 4 rows affected (0.09 sec) Records: 4 Duplicates: 0 Warnings: 0 mysql> select * from testcsv; +------+---------------------+--------+ | c1 | c2 | c3 | +------+---------------------+--------+ | 1 | 2006-05-01 09:22:04 | test 1 | | 2 | 2006-05-01 09:22:05 | test 2 | | 3 | 2006-05-01 09:22:06 | test 3 | | 4 | 2006-05-01 09:22:07 | test 4 | +------+---------------------+--------+ 4 rows in set (0.00 sec)
If we go to the operating system and check out what happens when you create a CSV table, you'll see a number of files:
[mysql@linux1 gim]$ ls -l test* -rw-rw---- 1 mysql mysql 35 May 1 09:22 testcsv.CSM -rw-rw---- 1 mysql mysql 140 May 1 09:22 testcsv.CSV -rw-rw---- 1 mysql mysql 8608 May 1 09:21 testcsv.frm
Like any other MySQL table, a .frm format file is created for the table structure. A small file - .CSM - that is used for metadata management is also present, and the actual data in the table is found in the <table name>.CSV file:
[mysql@linux1 gim]$ more testcsv.CSV "1","2006-05-01 09:22:04","test 1" "2","2006-05-01 09:22:05","test 2" "3","2006-05-01 09:22:06","test 3" "4","2006-05-01 09:22:07","test 4"
Notice that all data is contained within quotes regardless of its datatype and that all data is delimited, naturally, by commas. This data format allows you to easily manipulate the contents of your CSV table outside of MySQL, such as when the database happens to be down. You can also alter the contents while the object is online if you'd like:
[mysql@linux1 gim]$ vi testcsv.CSV "1","2006-05-01 09:22:04","test 1" "2","2006-05-01 09:22:05","test 2" "3","2006-05-01 09:22:06","test 3" "4","2006-05-01 09:22:07","test 4" "5","2006-05-01 09:22:08","test 5" "6","2006-05-01 09:22:09","test 6" ~ ~ ~ mysql> flush tables; Query OK, 0 rows affected (0.00 sec) mysql> select * from testcsv; +------+---------------------+--------+ | c1 | c2 | c3 | +------+---------------------+--------+ | 1 | 2006-05-01 09:22:04 | test 1 | | 2 | 2006-05-01 09:22:05 | test 2 | | 3 | 2006-05-01 09:22:06 | test 3 | | 4 | 2006-05-01 09:22:07 | test 4 | | 5 | 2006-05-01 09:22:08 | test 5 | | 6 | 2006-05-01 09:22:09 | test 6 | +------+---------------------+--------+ 6 rows in set (0.01 sec)
Realize that if you choose to edit the contents of a CSV table via a file editor outside of MySQL while the object is online, you can possibly interfere with data changes that are directed at the CSV table inside of MySQL. Most file editors will warn you if a file has been written to since it was loaded for edit, so just be aware that the potential for contention does exist.
Naturally, you can easily pour data from a CSV table into standard spreadsheet program by just opening the file:
This capability makes CSV tables exceptional well-suited for things like statistical analysis and the like, and is one of the reasons why in MySQL 5.1, the CSV engine is used to manage data from the slow and general query logs, which are logs that contain information about slow running queries or all queries encountered by the MySQL Server.
One thing that the CSV engine allows for is the instantaneous data transfer from flat files to information accessible via SQL in MySQL. For example, imagine having the need to load a file containing 5 million records into a MySQL table that uses the MyISAM storage engine:
mysql> desc client_detail; +-----------------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------------+---------------+------+-----+---------+-------+ | client_transaction_id | decimal(22,0) | YES | | NULL | | | transaction_timestamp | datetime | YES | | NULL | | | transaction_comment | varchar(30) | YES | | NULL | | +-----------------------+---------------+------+-----+---------+-------+ 3 rows in set (0.02 sec) mysql> load data infile '/usr/local/mysql519/data/gim/flatdata.dat' -> into table client_detail -> fields terminated by ','; Query OK, 5023575 rows affected, 0 warnings (27.38 sec) Records: 5023575 Deleted: 0 Skipped: 0 Warnings: 0
While MyISAM accepts the data pretty quickly, you can make all the data in the flat file instantly available to MySQL in the following manner. First, create a CSV table that mirrors the format of the incoming flat file:
mysql> create table client_detail_csv (client_transaction_id decimal(22,0), -> transaction_timestamp datetime, transaction_comment varchar(30)) -> engine=csv; Query OK, 0 rows affected (0.01 sec)
Then, go to the operating system and simply rename the operating system flat file to the file MySQL created for the new CSV table:
[mysql@linux1 ~]$ cd /usr/local/mysql519/data/gim [mysql@linux1 gim]$ ls -l total 785848 -rw-rw---- 1 mysql mysql 35 May 1 06:54 client_detail_csv.CSM -rw-rw---- 1 mysql mysql 0 May 1 06:54 client_detail_csv.CSV -rw-rw---- 1 mysql mysql 8718 May 1 06:54 client_detail_csv.frm -rw-rw---- 1 mysql mysql 8718 May 1 05:56 client_detail.frm -rw-rw---- 1 mysql mysql 221037300 May 1 06:00 client_detail.MYD -rw-rw---- 1 mysql mysql 1024 May 1 06:00 client_detail.MYI -rw-rw---- 1 mysql mysql 291367350 May 1 05:55 flatdata.dat [mysql@linux1 gim]$ mv flatdata.dat client_detail_csv.CSV
The rename operation occurs instantly as nothing data-wise changes at the operating system level. At that point, MySQL has all the data available to it:
mysql> flush tables; Query OK, 0 rows affected (0.00 sec) mysql> select count(*) from client_detail_csv; +----------+ | count(*) | +----------+ | 5023575 | +----------+ 1 row in set (16.17 sec)
Thus, you've effectively loaded 5 million records instantaneously into MySQL. The great thing is the same effect would be experienced if the file contained 10 million, 20 million, or 100 million records.
There are a few limitations to the current 5.1 implementation of the CSV engine that need to be remembered. Most notably, a CSV table does not currently support indexing. This means that any read operations against the table cause a full table scan. A second thing to keep in mind is that, because the actual file is in simple comma-delimited format, a CSV table will use more space than other engines such as MyISAM.
Final notes include the fact that the CSV engine does automatic recovery for corruption, and it takes advantage of all MySQL Enterprise Services such as replication, backup, federated table support, and more.
The CSV storage engine is a great new option in MySQL that allows the fast and easy inclusion of flat file data into the MySQL Server. Why not download a copy of the MySQL 5.1 Beta and give the CSV engine a try? And if you find any bugs or issues with the new engine, please make sure and report them via the bug system on our web site (you could win an iPod for helping us...!)