Robin Schumacher is VP of Products at Calpont, which makes the analytic database, InfiniDB. Robin as over 20 years of database experience in MySQL, Oracle, SQL Server, Teradata, and other database engines. Before joining Calpont, Robin was Director of Product Management for MySQL/Sun.
Let’s be honest: working with big databases is a lot of fun. There’s something cool about dealing with tables that have hundreds of millions or billions of rows in them, loading huge amounts of data, building star and snowflake schemas for data warehouses/marts, optimizing query performance, and all that jazz. Yes, working with big databases is a lot of fun.
On the other hand, let’s be honest: working with big databases is not a lot of fun. There’s a lot of pain in dealing with tables that have hundreds of millions or billions of rows in them, waiting for huge amounts of data to be loaded only to have the load job toss its cookies and fail when it’s 99% done, building special schemas that you wonder whether make any difference at all, and trying to figure out why just a simple two-way join query has been hanging for over an hour. Yes, working with big databases is not a lot of fun.
But here’s the thing: big databases aren’t going away; in fact, there’s more of them than ever before and they just keep multiplying and getting bigger. I could quote a lot of industry statistics and TDWI reports on this fact, but that would just bore you because you likely know all this stuff already. You also know that there are a number of database vendors out there who can take on big data, but they also come with big price tags.
Surveys done by MySQL show that data warehousing is the fifth most common use case for MySQL users, so it’s understandable that many try and get around the sticker shock of proprietary data warehouse DB’s with MySQL. One of the challenges is scaling MySQL for very large data volume sizes or for complex/analytic queries with fast response times. Moreover, one of the drawbacks that most generic RDBMS’s have is a row-based architecture, whereas most data warehouses and read-intensive query situations are selective column-based use cases.
Lately, the idea of column-oriented databases has been catching new wind in its sails. I say ‘new wind’ because Sybase IQ did the column database thing when columns weren’t that cool. But now there are a number of column-oriented databases that are really taking off, however most still are proprietary and cost quite a bit. But what happens when you marry a column-oriented database that’s modular in design, can scale both up and out, with MySQL and open source? You get Calpont’s InfiniDB.
To be frank, I’m as skeptical as they come when claims are made that say how this or that different ‘type’ of database is better than a general purpose RDBMS. When object-oriented databases came out, I scoffed. When OLAP/cube databases were hogging all the tech press, I smirked. I believed that neither could really challenge the good ol’ relational model, which isn’t perfect by any means, but hey: it still has enough punch to relegate those two other challengers to the very boutique marketplace. That being the case, why should we entertain a relational database that is vertical vs. horizontally architected?
Because, for querying and processing medium to huge amounts of data, they actually work. And, many times, they work better at finishing the SQL query horse race than a general purpose RDBMS. That's why.
Before column databases came out, experienced data modelers creating data warehouses, marts, and reporting databases knew that by designing vertically partitioned tables that were subsets of an overall table, they could sometimes buy better physical query performance because the rows were shorter and thus I/O could somewhat be reduced. Using views, they’d then tie the vertically partitioned tables together to form one logically-based table that could be referenced. And sometimes this practice worked. But oftentimes, it didn’t make the grade because (1) even the vertically partitioned tables were still row-based and I/O wasn’t always reduced by much; (2) it was really hard to predict what columns would be consistently queried together; (3) the practice didn’t reduce the need for indexes (and in fact made the matter worse sometimes because you needed indexes to smartly do joins between the vertically partitioned tables in the view definitions); (4) like with the difficulty in predicting what columns would be queried together, it was difficult to do index design because user’s query patterns could change at the drop of a hat and suddenly queries would do full scans instead of index searches. And on and on it went.
Column-oriented databases designed especially for analytics overcome these limitations by storing, managing, and querying data based on columns rather than rows. Because only the necessary columns in a query are accessed rather than entire rows, the column can act as the index and I/O activities as well as overall query response times can oftentimes be reduced. The end result is the ability to interrogate and return query results against either moderate or large amounts of data in a pretty darn efficient manner. On average, a row-based system can do up to 5-10x the physical I/O that a column-based database does to retrieve the same information. This is why column bigot Philip Howard (Research Director for Bloor Research) says: "Columns provide better performance at a lower cost with a smaller footprint: it is difficult to understand why any company seriously interested in query performance would not consider a column-based solution."
Nice quote, but should you really check out a column database? Can they really help you? Read on…
Column databases can be a help for sure in the right situation, but there’s more that’s needed to crunch through large data volumes than just having a vertical data design. In addition to being column-oriented, the open source version of InfiniDB provides the following core feature set, which adds a number of needed items to the mix:
InfiniDB also uses MySQL as its front end, so if you’re familiar with MySQL, you’ll be right at home with InfiniDB. MySQL is used mainly in InfiniDB for security, SQL parsing, and initial query plan output.
In addition to the above open source product feature set, a commercial/Enterprise version of InfiniDB will be offered that has a pay-for scale out option where multiple machines can be used in a massive parallel processing (MPP) configuration. However, for many data warehousing, data mart, and analytic database tasks, the open source version of InfiniDB provides plenty of power to plow through big databases.
The architecture of InfiniDB is modular and consists of three basic components:
All the above modular components can exist on one server, or in the Enterprise version, be broken up and run on different machines. If you want to compare the architecture and design of InfiniDB to other proprietary analytic databases in the market that do both scale up (use a box’s available CPU’s) and scale out (MPP), InfiniDB is comparable to Vertica, Sybase IQ, and Paraccel. There are other analytic databases available that scale and do MPP, but all others are row-based (e.g. Greenplum, Aster Data, Teradata, Netezza, etc.) vs. column-oriented.
One interesting feature that allows InfiniDB to handle lots of data is a structure called the Extent Map. It’s the Extent Map that removes the need for you to do any indexing, manual partitioning of data, create summary tables, etc.
An “extent” in InfiniDB is a logical block of space that exists within a physical file, with an extent being anywhere from 8-64MB in size depending on a column’s datatype. Each extent supports the same number of rows, with smaller data types using less space on disk. The Extent Map catalogs all extents and their corresponding blocks (identified with logical block identifiers or LBID’s). The Extent Map also maintains minimum and maximum values for a column’s data within an extent.
The Extent Map provides the ability for InfiniDB to only retrieve the blocks needed to satisfy a query, but it also provides another benefit – that of logical range partitioning. This is accomplished via the minimum and maximum values for each extent that are contained within the Extent Map. Extent elimination is first accomplished in InfiniDB via the column-oriented architecture (only needed columns are scanned), but the process is accelerated because of this logical horizontal partitioning that is implemented in the Extent Map.
This automatic extent elimination behavior is well suited for time-based data where data is loaded frequently and often referenced by time. Near real-time loads with queries against the leading edge of the data can easily show good extent elimination for all of the date/time columns as well as an ascending key value. Any column with clustered values is a good candidate for extent elimination.
To eliminate an extent when a column scan involves a filter, that filter value is compared to the minimum and maximum values stored in each extent for the column:
In the above figure, if a WHERE
column filter of col1 between 220 and 250
is specified, InfiniDB will eliminate extents 1, 2 and 4 from being scanned, saving ¾ of the I/O and many comparison operations. If the extent cannot possibly contain any rows that would match the filter, that extent is ignored entirely. Additionally, since each extent eliminated for a column also eliminates the same extents for all the other columns in a table, impossible filter chains can be quickly identified without any I/O being required. For example, take the following two columns and their Extent Map information:
If a column WHERE
filter of col1 between 220 and 250 and col2 < 10000
is specified, InfiniDB will eliminate extents 1, 2 and 4 from the first column filter, then, looking at just the matching extents for col2
(i.e. just extent 3), it will determine that no extents match and return zero rows without doing any I/O at all.
OK, enough design and architecture talk for now. In Part 2 of this article, we’ll take InfiniDB for a test drive and see all these things in action. You can now download the open source of InfiniDB and test it yourself. InfiniDB will run on a 32 or 64-bit Intel and AMD Linux box and can be downloaded at the InfiniDB.org website at: www.infinidb.org. Also on the site, you’ll find forums, roadmaps, and more concerning the future of InfiniDB.
If you have comments or questions, please don’t hesitate to make your voice heard in the online forums or by shooting us an email. Thanks for your support of InfiniDB and Calpont!