Robin Schumacher is VP of Products at Calpont, which makes the analytic database, InfiniDB. Robin as over 20 years of database experience in MySQL, Oracle, SQL Server, Teradata, and other database engines. Before joining Calpont, Robin was Director of Product Management for MySQL/Sun.
In Part 1 of this article, I took you through some of the reasons why a column-oriented database based on MySQL can be compelling. We also examined the architecture of InfiniDB, which is an open source analytic, reporting, and read-intensive database, and how its modular design helps you scale both up and out. Now let’s kick the tires of the database so you can see these things in action.
First, let’s load some data into a basic database to work with. The below load and query tests were run on a small 32-bit Fedora Core 6 machine with 2GB of RAM, and 4 CPU’s (2.40GHz with 512KB cache).
With InfiniDB, you have a high-speed loader to work with called cpimport
that does a decent job at taking flat files and pushing their data into database tables. There are two primary steps to using the cpimport
utility:
cpimport
utility to perform the data importNote that bulk loads are an append operation to a table so they allow existing data to be read and remain unaffected during the process. Upon completion of the load operation, a high water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data.
You can create a load job to load one table at a time or load multiple tables from different flat files. Let’s go ahead and load up an entire TPC-H database:
mysql> show tables; +-----------------+ | Tables_in_tpch2 | +-----------------+ | customer | | lineitem | | nation | | orders | | part | | partsupp | | region | | supplier | +-----------------+ 8 rows in set (0.00 sec).
First, you put all your load files in InfiniDB’s import directory:
[rschumacher@srvsn import]$ pwd /usr/local/Calpont/data/bulk/data/import [rschumacher@srvsn import]$ ls -l total 1076072 -rw-rw-r-- 1 root calpont 24346144 Sep 25 11:39 customer.tbl -rw-rw-r-- 1 root calpont 759863287 Sep 25 11:39 lineitem.tbl -rw-rw-r-- 1 root calpont 2224 Sep 25 11:39 nation.tbl -rw-rw-r-- 1 root calpont 171952161 Sep 25 11:39 orders.tbl -rw-rw-r-- 1 root calpont 118984616 Sep 25 11:39 partsupp.tbl -rw-rw-r-- 1 root calpont 24207240 Sep 25 11:39 part.tbl -rw-rw-r-- 1 root calpont 389 Sep 25 11:39 region.tbl -rw-rw-r-- 1 root calpont 1409184 Sep 25 11:39 supplier.tbl
Next, you create a job file that’s used by the import operation. The job file is created with the colxml
utility. The reason for the job file step is that oftentimes loads are repetitively done (especially in dev/test environments) and a job file can be reused many times to re-load data into the same objects. So, let’s create our job file to import all our data into a database called ‘tpch2’, where the flat files use the default pipe (‘|’) character as the flat file field delimiter, and give our job a number of ‘500’:
[rschumacher@srvsn bin]$ ./colxml tpch2 -j500 Running colxml with the following parameters: 2009-10-07 15:14:20 (9481) INFO : Schema: tpch2 Tables: Load Files: -b 0 -c 1048576 -d | -e 10 -f CSV -j 500 -m 50 -n -p /usr/local/Calpont/data/bulk/job/ -r 5 -s -u -w 10485760 -x tbl File completed for tables: tpch2.customer tpch2.lineitem tpch2.nation tpch2.orders tpch2.part tpch2.partsupp tpch2.region tpch2.supplier Normal exit.
Now, let’s perform the actual import for all eight tables using the cpimport
utility:
[rschumacher@srvsnp bin]$ ./cpimport -j 500 Bulkload root directory : /usr/local/Calpont/data/bulk job description file : Job_500.xml 2009-10-07 15:14:59 (9952) INFO : successfully load job file /usr/local/Calpont data/bulk/job/Job_500.xml 2009-10-07 15:14:59 (9952) INFO : PreProcessing check starts 2009-10-07 15:15:04 (9952) INFO : PreProcessing check completed 2009-10-07 15:15:04 (9952) INFO : preProcess completed, total run time : 5 seconds 2009-10-07 15:15:04 (9952) INFO : No of Read Threads Spawned = 1 2009-10-07 15:15:04 (9952) INFO : No of Parse Threads Spawned = 3 2009-10-07 15:15:06 (9952) INFO : For table tpch2.customer: 150000 rows processed and 150000 rows inserted. 2009-10-07 15:16:12 (9952) INFO : For table tpch2.nation: 25 rows processed and 25 rows inserted. 2009-10-07 15:16:12 (9952) INFO : For table tpch2.lineitem: 6001215 rows processed and 6001215 rows inserted. 2009-10-07 15:16:31 (9952) INFO : For table tpch2.orders: 1500000 rows processed and 1500000 rows inserted. 2009-10-07 15:16:33 (9952) INFO : For table tpch2.part: 200000 rows processed and 200000 rows inserted. 2009-10-07 15:16:44 (9952) INFO : For table tpch2.partsupp: 800000 rows processed and 800000 rows inserted. 2009-10-07 15:16:44 (9952) INFO : For table tpch2.region: 5 rows processed and 5 rows inserted. 2009-10-07 15:16:45 (9952) INFO : For table tpch2.supplier: 10000 rows processed and 10000 rows inserted. 2009-10-07 15:16:45 (9952) INFO : Bulk load completed, total run time : 106 seconds
The cpimport
utility pushed 8.6 million rows into our database in a little over 100 seconds, which isn’t bad for a small 32-bit machine (about 82,000 rows/second). Now, let’s run some query tests.
To run queries against InfiniDB, you’ll be using a mysql
client interface named calpontmysql
.
As nearly everyone likes to test query speed by issuing a count(*)
against a table, let me start off by saying that InfiniDB does not keep a count of a table’s rows immediately on hand, so it will always do an actual count of data:
mysql> select count(*) from lineitem; +----------+ | count(*) | +----------+ | 6001215 | +----------+ 1 row in set (0.81 sec)
With that caveat out of the way, let’s now run a few queries and see what we see. Again, the database has no indexes or manual partitioning of data, so it’s just pure column-oriented tables we’re working with. Let’s start with a 3-way table join between customer, orders, and lineitem:
mysql> select c_name, sum(l_quantity) -> from customer, orders, lineitem -> where c_custkey = o_custkey and -> l_orderkey = o_orderkey and -> l_shipdate = '1992-01-02' -> group by c_name; +--------------------+-----------------+ | c_name | sum(l_quantity) | +--------------------+-----------------+ | Customer#000094199 | 35.00 | | Customer#000146399 | 21.00 | | Customer#000104149 | 20.00 | | Customer#000081157 | 12.00 | | Customer#000071485 | 27.00 | . . . 17 rows in set (1.12 sec)
InfiniDB brings back the result set in a little more than a second, which isn’t bad. Now, let’s switch to another TPC-H database where I have 20 times the data in the lineitem
table, which equates to 120 million rows instead of just 6 million and run the same query:
mysql> select c_name, sum(l_quantity) -> from customer, orders, lineitem -> where c_custkey = o_custkey and -> l_orderkey = o_orderkey and -> l_shipdate = '1992-01-02' -> group by c_name; +--------------------+-----------------+ | c_name | sum(l_quantity) | +--------------------+-----------------+ | Customer#000094199 | 700.00 | | Customer#000009263 | 380.00 | | Customer#000146399 | 420.00 | | Customer#000072862 | 100.00 | . . . 17 rows in set (10.48 sec)
If we added CPU’s to the box or increased the data cache size, we could reduce the response time further, or if we decided to use the Enterprise Edition option and go MPP, we can expect to cut the response time in half for one new node and in half again if we added two more nodes after that.
Now, let’s try a full 8-way join with the smaller TPC-H database:
mysql> select c_name, p_name, ps_availqty, s_name, -> o_custkey, r_name, n_name, sum(l_quantity) -> from orders, lineitem, customer, part, -> partsupp, supplier, nation, region -> where o_orderkey = l_orderkey and -> c_custkey = o_custkey and -> p_partkey = l_partkey and -> ps_partkey = p_partkey and -> s_suppkey = ps_suppkey and -> r_regionkey = n_regionkey and -> s_nationkey = n_nationkey and -> o_orderkey = 4161537 -> group by c_name, p_name, ps_availqty, s_name, o_custkey, r_name, n_name; +--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+ | c_name | p_name | ps_availqty | s_name | o_custkey | r_name | n_name | sum(l_quantity) | +--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+ | Customer#000122437 | slate light gainsboro dark seashell | 8343 | Supplier#000000597 | 122437 | AMERICA | UNITED STATES | 47.00 | . . 28 rows in set (5.68 sec)
And with the bigger TPC-H lineitem
table (120 million rows):
mysql> select c_name, p_name, ps_availqty, s_name, -> o_custkey, r_name, n_name, sum(l_quantity) -> from orders, lineitem, customer, part, -> partsupp, supplier, nation, region -> where o_orderkey = l_orderkey and -> c_custkey = o_custkey and -> p_partkey = l_partkey and -> ps_partkey = p_partkey and -> s_suppkey = ps_suppkey and -> r_regionkey = n_regionkey and -> s_nationkey = n_nationkey and -> o_orderkey = 4161537 -> group by c_name, p_name, ps_availqty, s_name, o_custkey, r_name, n_name; +--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+ | c_name | p_name | ps_availqty | s_name | o_custkey | r_name | n_name | sum(l_quantity) | +--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+ | Customer#000122437 | slate light gainsboro dark seashell | 8343 | Supplier#000000597 | 122437 | AMERICA | UNITED STATES | 47.00 | . . 28 rows in set (22.98 sec)
With this join, we see pretty good performance in the smaller 8-way join and not quite a 4x increase in response time for the larger table in the same 8-way operation, which again is not bad for our little 32-bit system.
One interesting item of note is that InfiniDB supports hash joins, which at the time of this writing, MySQL does not. Moreover, InfiniDB’s hash joins can be distributed and processed in parallel when a multi-node, MPP configuration is used.
Let’s now test InfiniDB’s Extent Map to see if queries that use a value outside of a column’s maximum and minimum value can be satisfied in the Extent Map alone and not have to read any actual data blocks:
mysql> select min(l_orderkey), max(l_orderkey) from lineitem; +-----------------+-----------------+ | min(l_orderkey) | max(l_orderkey) | +-----------------+-----------------+ | 1 | 6000000 | +-----------------+-----------------+ 1 row in set (1.25 sec) mysql> select count(*) from lineitem where l_orderkey < 1; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (0.04 sec) mysql> select count(*) from lineitem where l_orderkey > 6000000; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (0.04 sec)
The Extent Map seems to be doing its job here, which is good. There are few things more frustrating in my book than waiting a long time for a result set that ends up in zero rows being sent back.
I could run through other example queries, but the above should give you a decent first impression of how InfiniDB will handle data loads and queries.
So what are some rules of thumb as to when you should and shouldn’t use InfiniDB? In general, InfiniDB may be a help to you when you have:
Places where you don’t want to use InfiniDB include:
Yes, working with big databases is fun and not fun at the same time. But hopefully, if you’re running into roadblocks using MySQL for big data and analytics, InfiniDB will take some of the ‘unfun’ nature of managing large databases away.
You can now download the open source of InfiniDB and give it a whirl yourself. InfiniDB will run on a 32 or 64-bit Intel and AMD Linux box and can be downloaded at the InfiniDB.org website at: www.infinidb.org. Also on the site, you’ll find forums, roadmaps, and more concerning the future of InfiniDB.
If you have comments or questions, please don’t hesitate to make your voice heard in the online forums or by shooting us an email. Thanks for your support of InfiniDB and Calpont!