Anders Karlsson is a sales engineer with MySQL AB, and lives in Stockholm, Sweden.
This is the first of an on-going series of articles that explain some the new features in MySQL 4.1, which as of this writing is in the gamma phase of its development cycle, on the way to a production-ready release in the near future.
By Anders Karlsson
MySQL 4.1 introduces spatial functionality in MySQL. This article describes some of the uses of spatial extensions in a relational database, how it can be implemented in a relational database, what features are present in MySQL and some simple examples.
Often the spatial functions are called geographic information system (GIS) functions, because GIS applications are the most obvious use-case for the spatial functionality. The spatial functions can be used both to provide a means to organize GIS data together with more traditional types of data and to represent non-GIS data with a spatial attribute.
A GIS (geographic information system) stores and looks up objects which have one or more spatial attributes, such as size and position, and is used to process such objects. A simple example would be a system that stores addresses in a town using geographic coordinates. If this rather static data was then combined with other information, such as the location of a taxi-cab, then this data could be used to find the closest cab to a certain location. Many GIS applications are very specialized, in areas such as the mapping of real-world objects, map creation and meteorology.
Among the many attributes of a general GIS is that the objects may have multiple dimensions, and that complex shapes are supported. Typically two-dimensional objects with nearly unlimited complexity need to be supported.
Another attribute of a general GIS is that it supports combining objects and looking for different types of overlaps, such as looking for points that are contained within a given geometrical object. Lastly, a GIS provides a means of organizing objects in layers, i.e. REGIONs inside DISTRICTs.
A good example of an application where GIS is of importance and which also has real-world implications is in the area of meteorology. When it comes to producing weather maps we see every day on the TV screen, meteorology is a complex science. Very advanced and high-powered IT systems that are concerned with the actual number-crunching are combined with large databases. The issue is not that data is not available, the issue is what data to use and how, and how the different types of data are matched. The data from the number-crunching systems is supplied in the form of large files with multi-dimensional arrays of prognosticated data. The issue now is, how can a specific data object be matched with another, and provide a relevant prognosis of the weather in the coming five days.
In other words, we have data covering at least three dimensions (two-dimensional geography plus a height of a measurement for example, maybe combined with a time dimension).
What is obvious from the above discussion is that a database that only works
on simple data types, such as INTEGER
or DECIMAL
,
will just not be good enough for this type of application. The geographical
shapes might be any type of polygon that then needs to be matched with some
other object, i.e. what is the total area of two objects (which is not
necessarily the sum of the areas of the two objects, as they might
overlap).
The traditional solution to all this has been to use specialized database
systems that use proprietary spatial indexing, and proprietary interfaces.
These systems are complex though, and not only that, any GIS stores some
non-GIS data (for example, a meteorological database might store a temperature
measurement at a given geographical location, height and time as a
DECIMAL
). And for non-GIS data, SQL has become the standard. As
applications for GIS have expanded beyond what goes on in labs and high-powered
scientific institutions and are found in more traditional applications, such a
taxicab scheduling systems, the need for a more standardized solution has
exploded. Another factor here is the availability of inexpensive and easy to
use GPS systems which may be combined with mobile communication systems, so
that getting a real-time position from, a taxi-cab, a ship on the sea or a
police car is no longer complex or expensive.
The desire to combine SQL with GIS was initially driven by the developers of GIS applications desiring to expand into other areas, more than by SQL database vendors getting into the GIS market.
The driving organization behind opening up GIS to a broader market and making GIS technologies available everywhere the Open Geospatial Consortium (OGC). OGC has been around for about 10 years and is a non-profit organization that works on many areas of GIS and provides specifications of interoperability with many other standards. One such standard that OGC has provided is a specification for interoperability SQL databases.
This specification, in short, defines extensions to a SQL based relational database to allow for GIS objects and operations. There are four important areas here:
In addition to the above, there is also a need for GIS metadata, and in some cases for using different coordinate systems. MySQL currently supports a planar coordinate system. The other major coordinate system in use is the geocentric one, i.e. a coordinate system on the Earth's surface, which is not yet supported by MySQL.
There are uses for spatial extensions outside of the pure GIS world. Spatial data is not limited to maps or to the Earth's geography. The definition of spatial data and operations is wide, and even though MySQL follows the OpenGIS specification to a large extent, this specification does not limit the use of spatial data to GIS applications.
Any type of data that has more than one dimension can be treated as a
spatial entity. The specification does not limit the axis on the coordinate
system to any particular unit, such as inches or centimeters, and each axis
does not necessarily have to represent the same unit: they are just numbers. Of
course, certain functionality assumes that the X and Y axis have similar
properties, such as the Area()
function.
Fact is, a lot of real-world data has two dimensions if we look at it this
way, in particular if we assume that one dimension is a date. For example we
could store the maximum and minimum price for a stock as a POINT
,
where the X axis is the date and the Y axis is the price. We can then use the
spatial functions to check for overlaps and intersections with other
stocks.
This specification is what is used as the ground for almost all implementations of GIS-functions within an SQL-based relational database. This standard defines the data types, operations, input and output format, functions and much more. This is the standard that is followed by almost all SQL databases with spatial extensions, including MySQL.
MySQL GIS Datatypes (abstract types in gray)
The data types start from the most generic at the top of the hierarchy,
GEOMETRY
, to a number of specific types, such as
POINT
and LINESTRING
. Some of the data types are
"abstract" in the sense that you cannot create objects of this type, such as
the GEOMETRY
type. This does not mean that you cannot have a
column of the type GEOMETRY
, just that you cannot have any value
in that column of the type GEOMETRY
, but you can have values of
any other spatial type in that column. Among the abstract object types, only
GEOMETRY
can be used as a column type.
Among the functions that can be performed on spatial objects are functions
that evaluate the proximity of objects, such as Overlap()
,
functions that combine two spatial objects to create another spatial object,
such as Envelope()
and functions that perform conversion to and
from text and binary formats, such as AsText()
and
GeomFromText()
. As any spatial object can be treated as a
GEOMETRY
object, any function that operate on a
GEOMETRY
can operate on any other type of spatial object, such as
a POINT
, which doesn't necessarily mean that this operation always
is meaningful.
The specification includes the definition of two formats for external representation of spatial data, WKB (Well Known Binary) and WKT (Well Known Text). This allows data to be exported and imported in binary and text formats. To complement these formats, there are functions that convert between the WKB and WKT formats and all of the spatial data types.
Spatial data may be indexed, just like other data in MySQL. To make this effective, a special type of indexing is used for spatial data called R-tree ("R" stands for Region) indexing. The involves organizing the minimum bounding rectangle (MBR) of the spatial objects in a tree structure that is then used by the different spatial functions. There are a few variations of R-tree indexing, MySQL uses R-trees with quadratic splitting, which is one of the standard methods of building an R-tree index.
An R-tree index is similar to a B+-tree in many ways, and organizes the indexed nodes in a hierarchy where the nodes in the index represent the MBR of the objects in the node. The leaf nodes in the index contain references to the row that contain the data, just like a B+-tree index.
In the current implementation of spatial extensions in MySQL, parts of the OpenGIS Simple Features have been left out. Among these are:
Except the above, all of the features of the OpenGIS Simple Features specification are included in MySQL 4.1 Spatial Extensions, support for the missing features will be added in future MySQL versions.
The first example is a very simple database with just two tables: one that
contains addresses and the location of the address as a POINT
, and
a second table that contains information on taxi cabs, including the location
of the cab, again as a POINT
.
Note that we also create a spatial index on both tables.
CREATE TABLE address ( address CHAR(80) NOT NULL, address_loc POINT NOT NULL, PRIMARY KEY(address), SPATIAL KEY(address_loc) ); CREATE TABLE cab ( cab_id INT AUTO_INCREMENT NOT NULL, cab_driver CHAR(80) NOT NULL, cab_loc POINT NOT NULL, PRIMARY KEY(cab_id), SPATIAL KEY(cab_loc) );
Now we need to insert some data into these tables to test the spatial
features. To do this, the easiest way is to use the GeomFromText()
function, that takes a string in WKT format and converts that to a spatial
object.
INSERT INTO address VALUES('Foobar street 12', GeomFromText('POINT(2671 2500)')); INSERT INTO address VALUES('Foobar street 56', GeomFromText('POINT(2971 2520)')); INSERT INTO address VALUES('Foobar street 78', GeomFromText('POINT(3171 2510)')); INSERT INTO address VALUES('Foobar street 97', GeomFromText('POINT(5671 2530)')); INSERT INTO address VALUES('Foobar street 99', GeomFromText('POINT(6271 2460)')); INSERT INTO address VALUES('Bloggs lane 10', GeomFromText('POINT(5673 3520)')); INSERT INTO address VALUES('Bloggs lane 20', GeomFromText('POINT(5665 3550)')); INSERT INTO address VALUES('Bloggs lane 45', GeomFromText('POINT(5571 3510)')); INSERT INTO cab VALUES(0, 'Joe Bloggs', GeomFromText('POINT(2262 2100)')); INSERT INTO cab VALUES(0, 'Bill Bloggs', GeomFromText('POINT(2441 1980)')); INSERT INTO cab VALUES(0, 'Sam Spade', GeomFromText('POINT(5400 3200)'));
Now we have some data to work with, let's try a spatial query: We will query
for the closest cab to a given location. As both of the columns involved in the
query have the POINT
data type, the standard OpenGIS function
Distance()
could be used. Regrettably, this is not yet implemented
in the MySQL Spatial Extensions, so we have to take a different route. (The
Distance()
function will be implemented in a later release of
MySQL.) We convert the two POINT
values to a
LINESTRING
and then compute the length of that, which will be the
same as the distance between the two points. The finished query then looks like
this:
SELECT c.cab_driver, ROUND(GLength(LineStringFromWKB(LineString(AsBinary(c.cab_loc), AsBinary(a.address_loc))))) AS distance FROM cab c, address a WHERE a.address = 'Foobar street 110' ORDER BY distance ASC LIMIT 1; +------------+----------+ | cab_driver | distance | +------------+----------+ | Sam Spade | 1143 | +------------+----------+ 1 row in set (0.00 sec)
A lot more data than you might think has multi-dimensional properties; it's
just that we do not often think of it as such. If we introduce time as one
dimension, a lot of data may fit in the multi-dimensional model. Let's have a
look at the example mentioned earlier with stock prices. To being with, we need
a table to store the data. This will contain the stock ticker symbol (such as
GOOG), and the starting and ending value of the ticker over a day, represented
as a LINESTRING
, where the X-axis is the date, and the Y-axis is
the value. The start point of each ticker value is the day's date and the
opening value, the end point is the date of the following day and the value of
the stock by the end of the day. In this case, a date is just an 8-digit
integer.
CREATE TABLE stock ( ticker CHAR(5) NOT NULL, ticker_date DATE NOT NULL, ticker_range LINESTRING NOT NULL, PRIMARY KEY(ticker, ticker_date) );
The next step is to insert some data:
INSERT INTO stock VALUES('FOO', '20040401', GeomFromText('LINESTRING(20040101 50,20040102 55)')); INSERT INTO stock VALUES('FOO', '20040402', GeomFromText('LINESTRING(20040102 55,20040103 67)')); INSERT INTO stock VALUES('FOO', '20040403', GeomFromText('LINESTRING(20040103 67,20040104 63)')); INSERT INTO stock VALUES('FOO', '20040404', GeomFromText('LINESTRING(20040104 63,20040105 57)')); INSERT INTO stock VALUES('FOO', '20040405', GeomFromText('LINESTRING(20040105 57,20040106 58)')); INSERT INTO stock VALUES('BAR', '20040401', GeomFromText('LINESTRING(20040101 37,20040102 38)')); INSERT INTO stock VALUES('BAR', '20040402', GeomFromText('LINESTRING(20040102 38,20040103 42)')); INSERT INTO stock VALUES('BAR', '20040403', GeomFromText('LINESTRING(20040103 42,20040104 45)')); INSERT INTO stock VALUES('BAR', '20040404', GeomFromText('LINESTRING(20040104 45,20040105 62)')); INSERT INTO stock VALUES('BAR', '20040405', GeomFromText('LINESTRING(20040105 62,20040106 56)')); INSERT INTO stock VALUES('BLG', '20040401', GeomFromText('LINESTRING(20040101 54,20040102 85)')); INSERT INTO stock VALUES('BLG', '20040402', GeomFromText('LINESTRING(20040102 85,20040103 87)')); INSERT INTO stock VALUES('BLG', '20040403', GeomFromText('LINESTRING(20040103 87,20040104 93)')); INSERT INTO stock VALUES('BLG', '20040404', GeomFromText('LINESTRING(20040104 93,20040105 62)')); INSERT INTO stock VALUES('BLG', '20040405', GeomFromText('LINESTRING(20040105 65,20040106 52)'));
Now, the query we are to issue will find the stocks and dates that have values that are within a range of a given stock's value. For this example, the date is ignored. To make this query a bit more interesting, let's issue it as a single query, using another new MySQL 4.1 feature, subqueries.
SELECT s2.ticker, s2.ticker_date FROM stock s2 WHERE s2.ticker != 'FOO' AND MBRINTERSECTS(s2.ticker_range, (SELECT LineStringFromWKB(LineString( Point(MIN(X(STARTPOINT(s1.ticker_range))), MIN(IF(Y(STARTPOINT(s1.ticker_range)) < Y(ENDPOINT(s1.ticker_range)), Y(STARTPOINT(s1.ticker_range)), Y(ENDPOINT(s1.ticker_range))))), Point(MAX(X(STARTPOINT(s1.ticker_range))), MAX(IF(Y(STARTPOINT(s1.ticker_range)) > Y(ENDPOINT(s1.ticker_range)), Y(STARTPOINT(s1.ticker_range)), Y(ENDPOINT(s1.ticker_range))))))) FROM stock s1 WHERE s1.ticker = 'FOO')) = 1;
This query constructs a LINESTRING
in the subquery which as its
first point has the start date, and the lowest value of the stock during the
period. The second point is the end date, and the highest value of the stock
during the period. The MBRINTERSECT()
will check if the MBR of the
objects passed as arguments intersect, and if so, will return 1, else 0. The
result of the query above is:
+--------+-------------+ | ticker | ticker_date | +--------+-------------+ | BAR | 2004-04-04 | | BAR | 2004-04-05 | | BLG | 2004-04-01 | | BLG | 2004-04-04 | | BLG | 2004-04-05 | +--------+-------------+ 5 rows in set (0.00 sec)
This just scratches the surface of what you can do with the MySQL Spatial Extensions in MySQL 4.1, and hopefully gets you thinking about ways you can use it with even non-spatial data. You can get all of the details about the spatial functionality in the "Spatial Extensions in MySQL" chapter of the MySQL Reference Manual, and learn more about the OpenGIS standards the Open Geospatial Consortium (OGC) website.