About the Author

Robin Schumacher is VP of Products at Calpont, which makes the analytic database, InfiniDB. Robin as over 20 years of database experience in MySQL, Oracle, SQL Server, Teradata, and other database engines. Before joining Calpont, Robin was Director of Product Management for MySQL/Sun.

We are always looking for interesting articles about MySQL! Have you written something and would like to it published here? Please contact us via feedback form »

MySQL Database Analytics with InfiniDB from Calpont – Part 2

In Part 1 of this article, I took you through some of the reasons why a column-oriented database based on MySQL can be compelling. We also examined the architecture of InfiniDB, which is an open source analytic, reporting, and read-intensive database, and how its modular design helps you scale both up and out. Now let’s kick the tires of the database so you can see these things in action.

Loading Data

First, let’s load some data into a basic database to work with. The below load and query tests were run on a small 32-bit Fedora Core 6 machine with 2GB of RAM, and 4 CPU’s (2.40GHz with 512KB cache).

With InfiniDB, you have a high-speed loader to work with called cpimport that does a decent job at taking flat files and pushing their data into database tables. There are two primary steps to using the cpimport utility:

Create a job file that is used to load data from one or more flat files into a database
Run the cpimport utility to perform the data import

Note that bulk loads are an append operation to a table so they allow existing data to be read and remain unaffected during the process. Upon completion of the load operation, a high water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data.

You can create a load job to load one table at a time or load multiple tables from different flat files. Let’s go ahead and load up an entire TPC-H database:

mysql> show tables;
+-----------------+
| Tables_in_tpch2 |
+-----------------+
| customer        |
| lineitem        |
| nation          |
| orders          |
| part            |
| partsupp        |
| region          |
| supplier        |
+-----------------+
8 rows in set (0.00 sec).

First, you put all your load files in InfiniDB’s import directory:

[rschumacher@srvsn import]$ pwd
/usr/local/Calpont/data/bulk/data/import
[rschumacher@srvsn import]$ ls -l
total 1076072
-rw-rw-r-- 1 root     calpont  24346144 Sep 25 11:39 customer.tbl
-rw-rw-r-- 1 root     calpont 759863287 Sep 25 11:39 lineitem.tbl
-rw-rw-r-- 1 root     calpont      2224 Sep 25 11:39 nation.tbl
-rw-rw-r-- 1 root     calpont 171952161 Sep 25 11:39 orders.tbl
-rw-rw-r-- 1 root     calpont 118984616 Sep 25 11:39 partsupp.tbl
-rw-rw-r-- 1 root     calpont  24207240 Sep 25 11:39 part.tbl
-rw-rw-r-- 1 root     calpont       389 Sep 25 11:39 region.tbl
-rw-rw-r-- 1 root     calpont   1409184 Sep 25 11:39 supplier.tbl

Next, you create a job file that’s used by the import operation. The job file is created with the colxml utility. The reason for the job file step is that oftentimes loads are repetitively done (especially in dev/test environments) and a job file can be reused many times to re-load data into the same objects. So, let’s create our job file to import all our data into a database called ‘tpch2’, where the flat files use the default pipe (‘|’) character as the flat file field delimiter, and give our job a number of ‘500’:

[rschumacher@srvsn bin]$ ./colxml tpch2 -j500
Running colxml with the following parameters:
2009-10-07 15:14:20 (9481) INFO :
        Schema: tpch2
        Tables:
        Load Files:
        -b      0
        -c      1048576
        -d      |
        -e      10
        -f      CSV
        -j      500
        -m      50
        -n
        -p      /usr/local/Calpont/data/bulk/job/
        -r      5
        -s
        -u
        -w      10485760
        -x      tbl

File completed for tables:
        tpch2.customer
        tpch2.lineitem
        tpch2.nation
        tpch2.orders
        tpch2.part
        tpch2.partsupp
        tpch2.region
        tpch2.supplier

Normal exit.

Now, let’s perform the actual import for all eight tables using the cpimport utility:

[rschumacher@srvsnp bin]$ ./cpimport -j 500

Bulkload root directory : /usr/local/Calpont/data/bulk
job description file : Job_500.xml
2009-10-07 15:14:59 (9952) INFO : successfully load job file /usr/local/Calpont   data/bulk/job/Job_500.xml
2009-10-07 15:14:59 (9952) INFO : PreProcessing check starts
2009-10-07 15:15:04 (9952) INFO : PreProcessing check completed
2009-10-07 15:15:04 (9952) INFO : preProcess completed, total run time : 5 seconds
2009-10-07 15:15:04 (9952) INFO : No of Read Threads Spawned = 1
2009-10-07 15:15:04 (9952) INFO : No of Parse Threads Spawned = 3
2009-10-07 15:15:06 (9952) INFO : For table tpch2.customer: 150000 rows processed and 150000 rows inserted.
2009-10-07 15:16:12 (9952) INFO : For table tpch2.nation: 25 rows processed and 25 rows inserted.
2009-10-07 15:16:12 (9952) INFO : For table tpch2.lineitem: 6001215 rows processed and 6001215 rows inserted.
2009-10-07 15:16:31 (9952) INFO : For table tpch2.orders: 1500000 rows processed and 1500000 rows inserted.
2009-10-07 15:16:33 (9952) INFO : For table tpch2.part: 200000 rows processed and 200000 rows inserted.
2009-10-07 15:16:44 (9952) INFO : For table tpch2.partsupp: 800000 rows processed and 800000 rows inserted.
2009-10-07 15:16:44 (9952) INFO : For table tpch2.region: 5 rows processed and 5 rows inserted.
2009-10-07 15:16:45 (9952) INFO : For table tpch2.supplier: 10000 rows processed and 10000 rows inserted.
2009-10-07 15:16:45 (9952) INFO : Bulk load completed, total run time : 106 seconds

The cpimport utility pushed 8.6 million rows into our database in a little over 100 seconds, which isn’t bad for a small 32-bit machine (about 82,000 rows/second). Now, let’s run some query tests.

Querying Data in InfiniDB

To run queries against InfiniDB, you’ll be using a mysql client interface named calpontmysql.

As nearly everyone likes to test query speed by issuing a count(*) against a table, let me start off by saying that InfiniDB does not keep a count of a table’s rows immediately on hand, so it will always do an actual count of data:

mysql> select count(*) from lineitem;
+----------+
| count(*) |
+----------+
|  6001215 |
+----------+
1 row in set (0.81 sec)

With that caveat out of the way, let’s now run a few queries and see what we see. Again, the database has no indexes or manual partitioning of data, so it’s just pure column-oriented tables we’re working with. Let’s start with a 3-way table join between customer, orders, and lineitem:

mysql> select c_name, sum(l_quantity)
    -> from customer, orders, lineitem
    -> where c_custkey = o_custkey and
    -> l_orderkey = o_orderkey and
    -> l_shipdate = '1992-01-02'
    -> group by c_name;
+--------------------+-----------------+
| c_name             | sum(l_quantity) |
+--------------------+-----------------+
| Customer#000094199 |           35.00 |
| Customer#000146399 |           21.00 |
| Customer#000104149 |           20.00 |
| Customer#000081157 |           12.00 |
| Customer#000071485 |           27.00 |
.
.
.
17 rows in set (1.12 sec)

InfiniDB brings back the result set in a little more than a second, which isn’t bad. Now, let’s switch to another TPC-H database where I have 20 times the data in the lineitem table, which equates to 120 million rows instead of just 6 million and run the same query:

mysql> select c_name, sum(l_quantity)
    -> from customer, orders, lineitem
    -> where c_custkey = o_custkey and
    -> l_orderkey = o_orderkey and
    -> l_shipdate = '1992-01-02'
    -> group by c_name;
+--------------------+-----------------+
| c_name             | sum(l_quantity) |
+--------------------+-----------------+
| Customer#000094199 |          700.00 |
| Customer#000009263 |          380.00 |
| Customer#000146399 |          420.00 |
| Customer#000072862 |          100.00 |
.
.
.
17 rows in set (10.48 sec)

If we added CPU’s to the box or increased the data cache size, we could reduce the response time further, or if we decided to use the Enterprise Edition option and go MPP, we can expect to cut the response time in half for one new node and in half again if we added two more nodes after that.

Now, let’s try a full 8-way join with the smaller TPC-H database:

mysql> select c_name, p_name, ps_availqty, s_name,
    -> o_custkey, r_name, n_name, sum(l_quantity)
    -> from orders, lineitem, customer, part,
    -> partsupp, supplier, nation, region
    -> where o_orderkey = l_orderkey and
    -> c_custkey = o_custkey and
    -> p_partkey = l_partkey and
    -> ps_partkey = p_partkey and
    -> s_suppkey = ps_suppkey and
    -> r_regionkey = n_regionkey and
    -> s_nationkey = n_nationkey and
    -> o_orderkey = 4161537
    -> group by c_name, p_name, ps_availqty, s_name, o_custkey, r_name, n_name;
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| c_name             | p_name                                 | ps_availqty | s_name             | o_custkey | r_name      | n_name        | sum(l_quantity) |
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| Customer#000122437 | slate light gainsboro dark seashell    |        8343 | Supplier#000000597 |    122437 | AMERICA     | UNITED STATES |           47.00 |
.
.
28 rows in set (5.68 sec)

And with the bigger TPC-H lineitem table (120 million rows):

mysql> select c_name, p_name, ps_availqty, s_name,
    -> o_custkey, r_name, n_name, sum(l_quantity)
    -> from orders, lineitem, customer, part,
    -> partsupp, supplier, nation, region
    -> where o_orderkey = l_orderkey and
    -> c_custkey = o_custkey and
    -> p_partkey = l_partkey and
    -> ps_partkey = p_partkey and
    -> s_suppkey = ps_suppkey and
    -> r_regionkey = n_regionkey and
    -> s_nationkey = n_nationkey and
    -> o_orderkey = 4161537
    -> group by c_name, p_name, ps_availqty, s_name, o_custkey, r_name, n_name;
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| c_name             | p_name                                 | ps_availqty | s_name             | o_custkey | r_name      | n_name        | sum(l_quantity) |
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| Customer#000122437 | slate light gainsboro dark seashell    |        8343 | Supplier#000000597 |    122437 | AMERICA     | UNITED STATES |           47.00 |
.
.
28 rows in set (22.98 sec)

With this join, we see pretty good performance in the smaller 8-way join and not quite a 4x increase in response time for the larger table in the same 8-way operation, which again is not bad for our little 32-bit system.

One interesting item of note is that InfiniDB supports hash joins, which at the time of this writing, MySQL does not. Moreover, InfiniDB’s hash joins can be distributed and processed in parallel when a multi-node, MPP configuration is used.

Let’s now test InfiniDB’s Extent Map to see if queries that use a value outside of a column’s maximum and minimum value can be satisfied in the Extent Map alone and not have to read any actual data blocks:

mysql> select min(l_orderkey), max(l_orderkey) from lineitem;
+-----------------+-----------------+
| min(l_orderkey) | max(l_orderkey) |
+-----------------+-----------------+
|               1 |         6000000 |
+-----------------+-----------------+
1 row in set (1.25 sec)

mysql> select count(*) from lineitem where l_orderkey < 1;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.04 sec)

mysql> select count(*) from lineitem where l_orderkey > 6000000;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.04 sec)

The Extent Map seems to be doing its job here, which is good. There are few things more frustrating in my book than waiting a long time for a result set that ends up in zero rows being sent back.

I could run through other example queries, but the above should give you a decent first impression of how InfiniDB will handle data loads and queries.

When To and Not To Use InfiniDB

So what are some rules of thumb as to when you should and shouldn’t use InfiniDB? In general, InfiniDB may be a help to you when you have:

Read-only databases or read-only parts of an application where response time speeds are important and row subsets (i.e. not all columns in a row) are being requested
Data marts/ warehouses/ analytic applications where query patterns change and cannot predicted. In other words, it is difficult to continually know what columns will be requested and will need indexing
Very large data volumes that are not served well by a general purpose database’s optimizer
Scale out scenarios where you want to do parallel processing of queries across multiple machines (note: today, this is only possible with the Enterprise Edition of InfiniDB)

Places where you don’t want to use InfiniDB include:

Transactional applications where you want InfiniDB to serve as a transactional back end database. Even though InfiniDB is transactional, the column-oriented nature of it means it will not do singleton/selective inserts and deletes as fast as a general purpose RDBMS will
Query systems where all the columns in a table are requested and a primary key lookup is performed
Databases that are very small in nature and are static/not growing

Conclusion

Yes, working with big databases is fun and not fun at the same time. But hopefully, if you’re running into roadblocks using MySQL for big data and analytics, InfiniDB will take some of the ‘unfun’ nature of managing large databases away.

You can now download the open source of InfiniDB and give it a whirl yourself. InfiniDB will run on a 32 or 64-bit Intel and AMD Linux box and can be downloaded at the InfiniDB.org website at: www.infinidb.org. Also on the site, you’ll find forums, roadmaps, and more concerning the future of InfiniDB.

If you have comments or questions, please don’t hesitate to make your voice heard in the online forums or by shooting us an email. Thanks for your support of InfiniDB and Calpont!