Trudy Pelzer is a senior software architect with MySQL AB, co-author of SQL Performance Tuning, and lives in Edmonton, Alberta.
This is the sixth of our on-going series of articles that explain some the new features in MySQL 4.1, which is now available as a generally-available (GA, or production) release.
By Trudy Pelzer
Effective with MySQL version 4.1, there are two ways in which one can get data from multiple tables in a single query: with a join and with a subquery. For example, assume you have the following tables:
CREATE TABLE clients ( clno INT, fname VARCHAR(15), lname VARCHAR(15), job VARCHAR(15), account_balance DECIMAL(7,2)); INSERT INTO clients VALUES (10, 'sam','smith','auditor',5525.75), (20,'james','jones','manager',8960.25); CREATE TABLE firms ( clno INT, company VARCHAR(15), city VARCHAR(15)); INSERT INTO firms VALUES (10,'abc co','leduc'), (20,'def ltd','nisku'), (30,'ghi inc','nisku');
The following query uses a join to get all available information for client number 10:
SELECT fname, lname, city, job, company, account_balance FROM clients c, firms f WHERE c.clno = f.clno AND c.clno = 10;
But it isn't always possible to use a join to get the information you may need. For example, suppose you need all available information on the client with the largest account balance. The following query, which may look as if it should return the required information, instead returns an error:
SELECT fname, lname, city, job, company, account_balance FROM clients c, firms f WHERE c.clno = f.clno AND c.account_balance = MAX(c.account_balance);
The reason for the error — invalid use of group function
— is that the aggregate function, MAX
, is disallowed in the
WHERE
clause as shown. This is where the second method of getting
data from multiple tables in a single query — the subquery — comes
to the rescue. In this article, I'll briefly describe the subquery
functionality added to MySQL in version 4.1
Simply put, a subquery is a SELECT
statement that is written
inside another SQL statement (which is often, but does not have to be, another
SELECT
). To distinguish the subquery (or inner query) from its
enclosing query (or outer query), it must be enclosed within parentheses. Here
is an example:
SELECT * FROM clients WHERE clno IN -- outer query (SELECT clno FROM firms WHERE city = 'leduc'); -- inner query
This query will return all rows of the clients
table which have
the same clno
value as the rows of the firms
table
having a city
value equal to 'leduc'
. To get the
result, the DBMS first evaluates the inner query, to find the clno
value for every row in the firms
table where city
is
equal to 'leduc'
. It then compares these clno
values
to the rows of the clients
table, returning every row where the
clno
values match. Since there is only one row in
firms
which matches the subquery condition, the subquery example
— in effect — is equivalent to this query:
SELECT * FROM clients WHERE clno = 10;
The subquery example can also, of course, be written as a join:
SELECT c.clno, fname, lname, job, account_balance FROM clients c INNER JOIN firms f USING (clno) WHERE city = 'leduc';
However, as already noted, the same can not be said of this subquery
(translation: which client has the highest clno
value?):
SELECT fname, lname FROM clients WHERE clno = (SELECT MAX(clno) FROM firms);
If the inner query returns an empty set, the result of a subquery may appear to be incorrect. For example, consider this query:
SELECT * FROM clients WHERE clno = (SELECT clno FROM firms WHERE city = 'gibbons');
If the inner query is run alone, it is clear that the result is zero rows:
there are no rows in the firms
table with a city
equal to 'gibbons'
. But "empty set" cannot be compared to the
values of a column. The SQL Standard thus requires that the result of a
subquery, which evaluates to zero rows, is NULL
. And since nothing
is equal to NULL
, the query returns an "Empty set"
message (i.e. zero rows).
It is common to say that the subquery is nested in the outer query. MySQL supports the nesting of subqueries within other subqueries, to a great depth.
There are three types of subqueries, distinguished from one another by the result — how many rows and columns — they return.
If a subquery can return exactly one column and one row, it is known as
a scalar subquery. A scalar subquery is legal everywhere that a
regular scalar value (e.g. a column value or literal) is legal in an SQL
statement. It is usually found in a WHERE
clause, immediately
after a comparison operator.
If a subquery can return multiple columns and exactly one row, it is known as a row subquery. A row subquery is a derivation of a scalar subquery and can thus be used anywhere that a scalar subquery can be used.
Finally, if a subquery can return multiple columns and multiple rows,
it is known as a table subquery. A table subquery is legal everywhere
that a table reference is legal in an SQL statement, including the
FROM
clause of a SELECT
. It, too, is usually found in
a WHERE
clause, immediately after an IN
or
EXISTS
predicate or a quantified comparison operator. (A
quantified comparison operator is a comparison operator used with either the
SOME
, ALL
, or ANY
quantifiers.)
The difference between scalar and table subqueries can be subtle. Here's a problem that arises when a subquery is written as a scalar subquery, but the subquery result contains multiple rows. Assume our two tables have only these rows:
INSERT INTO clients VALUES (10, 'sam','smith','auditor',5525.75); INSERT INTO firms VALUES (10,'abc co','leduc'),(30,'ghi inc','nisku');
Since the firms
table has two rows, this query:
SELECT * FROM clients WHERE clno < (SELECT clno FROM firms);
fails with:
"Subquery returns more than 1 row"
There are two solutions to this. The first is to change the query to include
a table subquery quantified by ANY
, to compare the outer query
results with any subquery value:
SELECT * FROM clients WHERE clno < ANY (SELECT clno FROM firms);
In this case, the comparison for the first client is false (10 < 10), but
is true for the second client (10 < 30), and so the subquery result is
"true" for clno
10. The rules for ANY
are as
follows:
ANY
returns "true" if the comparison operator is "true" for at least one row returned by the subquery.ANY
returns "false" if the subquery returns zero rows or if the comparison operator is "false" for every row returned by the subquery.SOME
is a synonym for ANY
; using IN
is equivalent to using = ANY
.
The second solution to the problem is to change the query to include a table
subquery quantified by ALL
, to compare the outer query results
with every subquery value:
SELECT * FROM clients WHERE clno < ALL (SELECT clno FROM firms);
In this case, the comparison is once again false for the first client and
true for the second client — but this time, the subquery result is
"false" and so the query returns zero rows. The rules for ALL
are:
ALL
returns "true" if the subquery returns zero rows or if
the comparison operator is "true" for every row returned by the subquery.ALL
returns "false" if the comparison operator is "false" for
at least one row returned by the subquery.Sometimes, the only information needed from a subquery is whether it returns
any rows at all. The [NOT] EXISTS
predicate tests for a non-empty
set. EXISTS
returns "true" if the subquery returns at least one
row; otherwise, it returns "false". NOT EXISTS
is the complement
— it returns "true" if the subquery returns zero rows; otherwise, it
returns "false". By tradition, a subquery following [NOT] EXISTS
begins with SELECT *
. In this case, the asterisk (*
)
is not a shorthand for "list of all columns", instead it stands for
"some column" — and the result returned by each subquery is
normally correlated with the result of the outer query to which it belongs.
Here's an extremely trivial example, which returns all client values:
SELECT * FROM clients WHERE EXISTS (SELECT * FROM firms);
The WHERE
clause in this example is "true" only because the
firms
table is not empty. But [NOT] EXISTS
is usually used to form
more complicated queries. Assume you have the following tables:
CREATE TABLE passengers ( name VARCHAR(15), compartment INT); INSERT INTO passengers VALUES ('smith',20); INSERT INTO passengers VALUES ('jones',25); CREATE TABLE cars ( compartment INT, class VARCHAR(10)); INSERT INTO cars VALUES (20,'first');
Here's an example of the classic FORALL
question:
SELECT * FROM cars c1 WHERE NOT EXISTS (SELECT * FROM passengers p1 WHERE NOT EXISTS (SELECT * FROM cars c2 WHERE c2.compartment = p1.compartment AND c2.compartment = c1.compartment));
This query is asking for the car in which all existing passengers are
riding. To understand the result, consider that Smith
is in
car 20
and Jones
in car 25
— but
that the cars
table doesn't contain a row for car 25
.
This means that there is one passenger — Jones
— who
is riding in a non-existent car. (Of course, in a properly set up database,
this situation couldn't exist; one would define primary key/foreign key
relationships between the two tables to ensure data integrity). The second
NOT EXISTS
subquery in the example, thus, is always "true" for
passenger Jones.
In addition, of course, there is one passenger — Jones
,
again — who is not riding in car 20
, and therefore the first
NOT EXISTS
subquery in the example is "false". And since there are
no other cars to check, the result of the query is an empty set (zero rows) --
there are no cars in which every passenger is riding.
The SQL Standard, effective with SQL:1999, requires increased subquery support, which MySQL provides. The row subqueries alluded to earlier are an example. Thus, it is now possible to compare multiple columns at a time:
SELECT ROW ('smith', 'auditor') = (SELECT lname, job FROM clients WHERE clno = 10);
The subquery in this example returns a row containing the values
'smith'
and 'auditor'
. When these values are compared
to the ROW
values in the outer query, they are found to be equal
and so the query returns 1 (true).
You can also put a subquery, rather than a simple table name, in the
FROM
clause of a query (those of you familiar with Oracle will
recognize this usage as an inline view):
SELECT * FROM (SELECT * FROM clients WHERE job LIKE 'a%') AS cl;
To get the result of this query, the MySQL server first evaluates the
subquery and then associates the alias (cl
, in this case) with the
result set. It then evaluates the outer SELECT
. In effect, the
above example ends up being interpreted as:
SELECT * FROM cl;
with table cl
being a temporary result set created with the
subquery:
SELECT * FROM clients WHERE job LIKE 'a%';
When a subquery is placed in the FROM
clause, the AS
<alias>
portion of the syntax is mandatory; the interim result
table must be named because it is referenced by the containing
query.
Subqueries have one other use: they can be used to change the data in the
database. That is, you can put a subquery in a DELETE
,
INSERT
, UPDATE
, or REPLACE
statement.
Here is an example:
UPDATE clients SET account_balance = (SELECT SUM(amount) FROM accounts where clno=clients.clno);
This UPDATE
changes the account_balance
for each
client to the sum of the amounts recorded for that client in the
accounts
table.
There is one caveat: It is not currently possible to modify a table and select from the same table in a subquery.
= <> < <= > >=
— can be used with subqueries, as can the [NOT] IN
and [NOT] EXISTS
predicates.ANY/SOME
or ALL
.MySQL has added structure to SQL!