Mike Hillyer is a Technical Writer for MySQL AB and lives in Alberta, Canada.
By Mike Hillyer
One of the least commonly used of the MySQL datatypes has to be the SET datatype. The SET datatype is a string type, but is often referred to as a complex type due to the increased complexity involved in implementing them. A SET datatype can hold any number of strings from a predefined list of strings specified during table creation. The SET datatype is similar to the ENUM datatype in that they both work with predefined sets of strings, but where the ENUM datatype restricts you to a single member of the set of predefined strings, the SET datatype allows you to store any of the values together, from none to all of them.
The MySQL SET datatype is stored as an integer value within the MySQL tables, and occupies from one to eight bytes, depending on the number of elements available.
Number of Elements | Number of Bytes |
---|---|
1 - 8 | 1 |
9 - 16 | 2 |
17 - 24 | 3 |
25 - 32 | 4 |
33 - 64 | 8 |
The SET elements are stored in the MySQL table as a bitmap: each element is represented by a single bit. To demonstrate this, lets look at the following sample table, which will be the basis for the rest of this tutorial. (NOTE: In the examples that follow, there are never whitespaces between elements and their comma seperators. Also note that only during the definition of a set are the elements individually wrapped in single quotes, otherwise the set as a whole gets only one set of quotes. Please do likewise.)
CREATE TABLE set_test( rowid INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, myset SET('Travel','Sports','Dancing','Fine Dining') );
Each element in our set is assigned a single bit in our bitmap (note that in spite of there only being 4 elements our set will still occupy 1 byte). If our row incorporates a given element, the associated bit will be a one. Because of this storage approach, each element also has an associated decimal value.
Element | SET Value | Decimal Value |
---|---|---|
Travel | 00000001 | 1 |
Sports | 00000010 | 2 |
Dancing | 00000100 | 4 |
Fine Dining | 00001000 | 8 |
Multiple elements can be represented by adding their decimal values (for example, a person's interests in Travel and Fine Dining can be represented as 00001001 or the decimal value of 8 + 1 = 9).
The MySQL SET datatype is not commonly used for a few reasons; First, using the MySQL SET datatype limits you to 64 elements. While you could get around this by using multiple SETs, this still represents a loss of versatility. Second, you cannot include commas in your set elements, as the comma is an element separator. Third, using a set means your data is not normalized. In our above example, we are tracking a person's interests for a hypothetical dating site. In a normalized schema, there should be three tables: one for the person, one for all possible interests, and one that links a person to their particular interests. Fourth, an INDEX on a set datatype is going to refer to the set as a whole and will not be used for searching individual elements (this may or may not be a problem for certain applications).
So why do we use the MySQL SET datatype? Well there are a few reasons; The
MySQL SET datatype allows us to handle multiple values with a lot less trouble
than if we had our table fully normalized. Our schema is simplified because we
only need one column instead of three tables to store our list of interests.
The MySQL SET datatype allows us to compare multiple values without using
complex JOIN
operations. We can manipulate the set with binary
functions to do complex comparisons by comparing bit values on a single column
instead of comparing multiple rows of multiple tables.
You may find that the MySQL SET datatype is something that you never need to use. If you find that it's advantages outweigh the disadvantages, read on to discover how to use it.
There are two ways to INSERT
data into MySQL SET columns: by
decimal value or by using strings. To INSERT
the combination of
Travel and Sports into our sample tables, we can use the
following:
INSERT INTO set_test(myset) VALUES(3); INSERT INTO set_test(myset) VALUES('Travel,Sports');
Because the decimal value of Travel is 1 and the decimal value of Sports is 2, the combination of the two can be inserted at the sum of the decimal values, or 3. Another example of matching INSERT statements:
INSERT INTO set_test(myset) VALUES(7); INSERT INTO set_test(myset) VALUES('Travel,Sports,Dancing');
There are three kinds of UPDATE
that we need to deal with:
complete replacement of the SET data, addition of set members, and removal of
set members. We will deal with these in the order listed above.
To completely replace the data in a SET, we simply specify the new values:
UPDATE set_test SET myset = 'Travel,Dining' WHERE rowid = 1; UPDATE set_test SET myset = 2 WHERE rowid = 2;
To add an element to an existing SET we use the CONCAT()
function to add the new element to our comma separated list. To work with
decimal values, we can use the bitwise OR operator |
.
UPDATE set_test SET myset = CONCAT(myset,",Travel") WHERE rowid = 3; UPDATE set_test SET myset = myset | 1 WHERE rowid = 3;
We can also use the CONCAT_WS()
function, which handles list separators for
us:
UPDATE set_test SET myset = CONCAT_WS(',',myset,'Dancing') WHERE rowid = 6;
To remove set elements from an existing set, we use the REPLACE
function to remove the element. If using decimal values, we use a combination
of the bitwise AND operator &
with the bitwise NOT operator
~
.
UPDATE set_test SET myset = REPLACE(myset,'Dancing','') WHERE rowid = 6; UPDATE set_test SET myset = myset & ~2 WHERE rowid = 6;
With the bitwise operators we can add and remove multiple items at once by using the sum of the decimal values of the elements we wish to remove.
There are several ways we can select data from a SET column, but they come
down to two distinct types: SELECT
queries that have a specific
WHERE
clause, and SELECT
queries that have pattern
matching in the WHERE
clause.
Before discussing the different WHERE
clauses, it would
probably be good to cover what the data will look like as it is returned from
the SELECT
query. If the column is merely SELECT
ed,
you will get a comma separated list of the elements involved in your SET. If
you add a +0 to the column name, you will see the integer values in the
columns.
mysql> SELECT rowid, myset, myset+0 -> FROM set_test; +-------+-----------------------+---------+ | rowid | myset | myset+0 | +-------+-----------------------+---------+ | 1 | Sports | 2 | | 2 | Travel,Sports | 3 | | 3 | Travel,Dancing | 5 | | 4 | Travel,Sports | 3 | | 5 | Travel,Sports,Dancing | 7 | | 6 | Travel,Dancing | 5 | | 7 | Sports | 2 | | 8 | Travel,Dancing | 5 | +-------+-----------------------+---------+ 8 rows in set (0.00 sec)
All of the SELECT
s we will be doing will be based on one of
these two values.
Let's look at WHERE
clauses with specific values. When
searching against a specific set of values, we can specify them in our where
clause:
SELECT * FROM set_test WHERE myset = 'Travel,Sports'; SELECT * FROM set_test WHERE myset = 3;
To check for items with a specific element, we can use a couple of methods:
SELECT * FROM set_test WHERE myset LIKE '%Sports%'; SELECT * FROM set_test WHERE myset & 2;
It should be noted that the bitwise AND exhibits an OR-like behavior when used with combinations of values. Take for example the following query, which involves Travel (1) and Sports (2).
SELECT * FROM set_test WHERE myset & 3;
This query actually returns rows that contain Travel OR Sports, and not just
rows that contain both. For rows with have both Travel AND Sports, we need to
also check the result of our bitwise AND (&
) operation against
the value we are performing the bitwise AND (&
) against, or
use an AND clause in our query:
SELECT * FROM set_test WHERE myset & 3 = 3; SELECT * FROM set_test WHERE myset & 1 AND myset & 2;
This returns all rows that contain Travel AND contain Sports, and also
returns the combination of the two. The & 3 = 3
syntax is
preferable as it does not require an increasing number of AND clauses.
We can also use multiple wildcard queries like the following:
SELECT * FROM set_test WHERE myset LIKE '%Travel%Sports%';
It should be noted that the elements must be in order (that is, in the order they are listed in the initial table creation) as order is important in the multiple wildcard queries. For example '%Travel%Sports%' will return rows, but '%Sports%Travel%' will not.
Another way to retrieve rows is to use the FIND_IN_SET()
function as follows:
SELECT * FROM set_test WHERE FIND_IN_SET('Sports',myset) > 0;
FIND_IN_SET()
will return 0 if there is no match and a number
greater than zero if there is a match. NULL will be returned if either the set
or the search string is NULL.
We can take advantage of the bitmap approach to storing SET data to rank our users in order of matching interests to help our users find people who best match their own interests, thus allowing you to find users who also like Sports, Travel, and Dancing (or at least like some of the things you do).
Here's how it works: each of our users has their interests stored in the SET
datatype defined above. We want to rank their fellow users by how well their
interests overlap. First let's look at the interests in binary form using the
BIN()
function (we use LPAD()
to add leading
zeroes):
mysql> SELECT rowid, myset, LPAD(BIN(myset+0),4,'0') AS binset -> FROM set_test; +-------+-----------------------------------+--------------+ | rowid | myset | binset | +-------+-----------------------------------+--------------+ | 1 | Travel,Sports,Dancing | 0111 | | 2 | Travel,Dancing | 0101 | | 3 | Travel | 0001 | | 4 | Dancing | 0100 | | 5 | Dancing | 0100 | | 6 | Sports,Dancing | 0110 | | 7 | Travel,Sports,Dancing,Fine Dining | 1111 | | 8 | Travel,Fine Dining | 1001 | | 9 | Sports,Fine Dining | 1010 | | 10 | Travel,Dancing,Fine Dining | 1101 | +-------+-----------------------------------+--------------+ 10 rows in set (0.00 sec)
What we want to do is order our table by the number of matching interests. For our example, we will order our users by how well they match with user 8, who is interested in Travel, Dancing, and Fine Dining.
mysql> SELECT A.rowid, A.myset, BIN(A.myset+0 & B.myset+0) -> FROM set_test AS A, set_test AS B -> WHERE b.rowid = 8 -> AND A.rowid != 8 -> ORDER BY BIT_COUNT(A.myset & B.myset) DESC; +-------+-----------------------------------+---------+ | rowid | myset | matches | +-------+-----------------------------------+---------+ | 7 | Travel,Sports,Dancing,Fine Dining | 2 | | 10 | Travel,Dancing,Fine Dining | 2 | | 1 | Travel,Sports,Dancing | 1 | | 2 | Travel,Dancing | 1 | | 3 | Travel | 1 | | 9 | Sports,Fine Dining | 1 | | 4 | Dancing | 0 | | 5 | Dancing | 0 | | 6 | Sports,Dancing | 0 | +-------+-----------------------------------+---------+ 9 rows in set (0.00 sec)
This results in our users being organized by how well they match with user
8. The BIT_COUNT()
function in the above example will return the
number of set bits in the expression passed to it. Our first two matches (users
7 and 10) have two overlapping interests, the next four have one match, and the
remaining three users have no overlapping interests (they could be trimmed out
by adding the following to the WHERE
clause:
AND BIT_COUNT(A.myset & B.myset)
Because of the functions being performed on the rows, this type of query cannot benefit from an index. This means that the entire table will have to be scanned to find matches to user 8. However, as this is a bitwise operation, it should be quite efficient in spite of the full table scan (I was able to perform this operation on a randomly generated table of 500,000 rows and have 250,000 rows returned in 1.89 seconds on my 1.6 Ghz PC).
By comparison, performing this operation on the normalized three table
schema I described above would require a three table JOIN
with
N AND clauses in the WHERE
section of the query, one for
every interest user 8 possessed.
To retrieve a list of the set members, use the DESCRIBE
syntax
as follows:
mysql> DESCRIBE set_test myset; +-------+-------------------------------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+-------------------------------------------+------+-----+---------+-------+ | myset | set('Travel','Sports','Dancing','Dining') | YES | | NULL | | +-------+-------------------------------------------+------+-----+---------+-------+ 1 row in set (0.00 sec)
If working programmically, you need to use the second column and strip the 'set(' and ')' to have a comma separated list, which most programming languages can break into an array automatically (in Visual Basic you would use the split function).
The MySQL SET datatype is not the perfect solution for all MySQL databases, but can be quite powerful when appropriate. If you need to track less than 64 attributes for a given entity and make comparisons between different entities, the MySQL SET datatype may be ideal for your needs.