Table 11.20. Aggregate (GROUP BY
)
Functions
Name | Description |
---|---|
AVG() |
Return the average value of the argument |
BIT_AND() |
Return bitwise and |
BIT_OR() |
Return bitwise or |
BIT_XOR() (v4.1.1) |
Return bitwise xor |
COUNT(DISTINCT) |
Return the count of a number of different values |
COUNT() |
Return a count of the number of rows returned |
GROUP_CONCAT() (v4.1) |
Return a concatenated string |
MAX() |
Return the maximum value |
MIN() |
Return the minimum value |
STD() |
Return the population standard deviation |
STDDEV_POP() (v5.0.3) |
Return the population standard deviation |
STDDEV_SAMP() (v5.0.3) |
Return the sample standard deviation |
STDDEV() |
Return the population standard deviation |
SUM() |
Return the sum |
VAR_POP() (v5.0.3) |
Return the population standard variance |
VAR_SAMP() (v5.0.3) |
Return the sample variance |
VARIANCE() (v4.1) |
Return the population standard variance |
This section describes group (aggregate) functions that operate
on sets of values. Unless otherwise stated, group functions
ignore NULL
values.
If you use a group function in a statement containing no
GROUP BY
clause, it is equivalent to grouping
on all rows. For more information, see
Section 11.12.3, “GROUP BY
and HAVING
with Hidden
Columns”.
For numeric arguments, the variance and standard deviation
functions return a DOUBLE
value.
The SUM()
and
AVG()
functions return a
DECIMAL
value for exact-value
arguments (integer or DECIMAL
),
and a DOUBLE
value for
approximate-value arguments
(FLOAT
or
DOUBLE
).
The SUM()
and
AVG()
aggregate functions do not
work with temporal values. (They convert the values to numbers,
losing everything after the first nonnumeric character.) To work
around this problem, you can convert to numeric units, perform
the aggregate operation, and convert back to a temporal value.
Examples:
SELECT SEC_TO_TIME(SUM(TIME_TO_SEC(time_col
))) FROMtbl_name
; SELECT FROM_DAYS(SUM(TO_DAYS(date_col
))) FROMtbl_name
;
Functions such as SUM()
or
AVG()
that expect a numeric
argument cast the argument to a number if necessary. For
SET
or
ENUM
values, the cast operation
causes the underlying numeric value to be used.
Returns the average value of
. The
expr
DISTINCT
option can be used to return the
average of the distinct values of
expr
.
AVG()
returns
NULL
if there were no matching rows.
mysql>SELECT student_name, AVG(test_score)
->FROM student
->GROUP BY student_name;
Returns the bitwise AND
of all bits in
expr
. The calculation is
performed with 64-bit
(BIGINT
) precision.
This function returns
18446744073709551615
if there were no
matching rows. (This is the value of an unsigned
BIGINT
value with all bits
set to 1.)
Returns the bitwise OR
of all bits in
expr
. The calculation is
performed with 64-bit
(BIGINT
) precision.
This function returns 0
if there were no
matching rows.
Returns the bitwise XOR
of all
bits in expr
. The calculation is
performed with 64-bit
(BIGINT
) precision.
This function returns 0
if there were no
matching rows.
Returns a count of the number of non-NULL
values of expr
in the rows
retrieved by a SELECT
statement. The result is a
BIGINT
value.
COUNT()
returns
0
if there were no matching rows.
mysql>SELECT student.student_name,COUNT(*)
->FROM student,course
->WHERE student.student_id=course.student_id
->GROUP BY student_name;
COUNT(*)
is somewhat
different in that it returns a count of the number of rows
retrieved, whether or not they contain
NULL
values.
COUNT(*)
is optimized to
return very quickly if the
SELECT
retrieves from one
table, no other columns are retrieved, and there is no
WHERE
clause. For example:
mysql> SELECT COUNT(*) FROM student;
This optimization applies only to MyISAM
tables only, because an exact row count is stored for this
storage engine and can be accessed very quickly. For
transactional storage engines such as
InnoDB
, storing an exact row count is
more problematic because multiple transactions may be
occurring, each of which may affect the count.
COUNT(DISTINCT
expr
,[expr
...])
Returns a count of the number of rows with different
non-NULL
expr
values.
COUNT(DISTINCT)
returns
0
if there were no matching rows.
mysql> SELECT COUNT(DISTINCT results) FROM student;
In MySQL, you can obtain the number of distinct expression
combinations that do not contain NULL
by
giving a list of expressions. In standard SQL, you would
have to do a concatenation of all expressions inside
COUNT(DISTINCT ...)
.
This function returns a string result with the concatenated
non-NULL
values from a group. It returns
NULL
if there are no
non-NULL
values. The full syntax is as
follows:
GROUP_CONCAT([DISTINCT]expr
[,expr
...] [ORDER BY {unsigned_integer
|col_name
|expr
} [ASC | DESC] [,col_name
...]] [SEPARATORstr_val
])
mysql>SELECT student_name,
->GROUP_CONCAT(test_score)
->FROM student
->GROUP BY student_name;
Or:
mysql>SELECT student_name,
->GROUP_CONCAT(DISTINCT test_score
->ORDER BY test_score DESC SEPARATOR ' ')
->FROM student
->GROUP BY student_name;
In MySQL, you can get the concatenated values of expression
combinations. To eliminate duplicate values, use the
DISTINCT
clause. To sort values in the
result, use the ORDER BY
clause. To sort
in reverse order, add the DESC
(descending) keyword to the name of the column you are
sorting by in the ORDER BY
clause. The
default is ascending order; this may be specified explicitly
using the ASC
keyword. The default
separator between values in a group is comma
(“,
”). To specify a
separator explicitly, use SEPARATOR
followed by the string value that should be inserted between
group values. To eliminate the separator altogether, specify
SEPARATOR ''
.
The result is truncated to the maximum length that is given
by the group_concat_max_len
system variable, which has a default value of 1024. The
value can be set higher, although the effective maximum
length of the return value is constrained by the value of
max_allowed_packet
. The
syntax to change the value of
group_concat_max_len
at
runtime is as follows, where val
is an unsigned integer:
SET [GLOBAL | SESSION] group_concat_max_len = val
;
See also CONCAT()
and
CONCAT_WS()
:
Section 11.4, “String Functions”.
Returns the maximum value of
expr
.
MAX()
may take a string
argument; in such cases, it returns the maximum string
value. See Section 7.4.4, “How MySQL Uses Indexes”. The
DISTINCT
keyword can be used to find the
maximum of the distinct values of
expr
, however, this produces the
same result as omitting DISTINCT
.
MAX()
returns
NULL
if there were no matching rows.
mysql>SELECT student_name, MIN(test_score), MAX(test_score)
->FROM student
->GROUP BY student_name;
For MAX()
, MySQL currently
compares ENUM
and
SET
columns by their string
value rather than by the string's relative position in the
set. This differs from how ORDER BY
compares them. This is expected to be rectified in a future
MySQL release.
Returns the minimum value of
expr
.
MIN()
may take a string
argument; in such cases, it returns the minimum string
value. See Section 7.4.4, “How MySQL Uses Indexes”. The
DISTINCT
keyword can be used to find the
minimum of the distinct values of
expr
, however, this produces the
same result as omitting DISTINCT
.
MIN()
returns
NULL
if there were no matching rows.
mysql>SELECT student_name, MIN(test_score), MAX(test_score)
->FROM student
->GROUP BY student_name;
For MIN()
, MySQL currently
compares ENUM
and
SET
columns by their string
value rather than by the string's relative position in the
set. This differs from how ORDER BY
compares them. This is expected to be rectified in a future
MySQL release.
Returns the population standard deviation of
expr
. This is an extension to
standard SQL. The standard SQL function
STDDEV_POP()
can be used
instead.
This function returns NULL
if there were
no matching rows.
Returns the population standard deviation of
expr
. This function is provided
for compatibility with Oracle. The standard SQL function
STDDEV_POP()
can be used
instead.
This function returns NULL
if there were
no matching rows.
Returns the population standard deviation of
expr
(the square root of
VAR_POP()
). You can also use
STD()
or
STDDEV()
, which are
equivalent but not standard SQL.
STDDEV_POP()
returns
NULL
if there were no matching rows.
Returns the sample standard deviation of
expr
(the square root of
VAR_SAMP()
.
STDDEV_SAMP()
returns
NULL
if there were no matching rows.
Returns the sum of expr
. If the
return set has no rows, SUM()
returns NULL
. The
DISTINCT
keyword can be used in MySQL
5.4 to sum only the distinct values of
expr
.
SUM()
returns
NULL
if there were no matching rows.
Returns the population standard variance of
expr
. It considers rows as the
whole population, not as a sample, so it has the number of
rows as the denominator. You can also use
VARIANCE()
, which is
equivalent but is not standard SQL.
VAR_POP()
returns
NULL
if there were no matching rows.
Returns the sample variance of
expr
. That is, the denominator is
the number of rows minus one.
VAR_SAMP()
returns
NULL
if there were no matching rows.
Returns the population standard variance of
expr
. This is an extension to
standard SQL. The standard SQL function
VAR_POP()
can be used
instead.
VARIANCE()
returns
NULL
if there were no matching rows.
User Comments
For medians in almost any SQL:
SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)
assumes (COUNT(*)+1)/2 returns an int, if not
then use INT((COUNT(*)+1)/2)
There are variations for the varied defs of
median too.
See Rozenshtein's book on Optimizing T-SQL
Advanced Programming Techniques. Search Amazon.
Learn how to use encoded characteristic functions
(delta[]), then you can do all sorts of fun stuff
quickly - for instance you could pivot a table
into summary columns of totals where each column
counted different things in one table pass.
I have utilized this technique with great success
on large tables: a multi million row table using
the existing summary code took minutes while this
SQL trimmed that down to less than 3 seconds!
Once you get used to Delta functions, the code is
very readable too!
For what it's worth, the SUM() function does not
work on TIME values directly, although it appears
to work at first glance. To add a bunch of TIME
values (intervals, presumably), you need to do
something like this:
SEC_TO_TIME(SUM(TIME_TO_SEC(events.length)))
This will return another TIME value, suitable for
TIME_FORMAT() or whatever else.
A function for doing Base64 encoding would be a
good thing to have. This can be used with
encryption, but also, binary files can be
converted to a convenient universal character
format and stored neatly as text.
I would like to see a GROUP BY function 'ANY',
which just returns any of the values and optimises
the query instead. This could be quite useful.
I haven't tested this much, but I can say for
sure that in at least one single-table query
using SQL_CALC_FOUND_ROWS/FOUND_ROWS() is MUCH
slower than running the query with a LIMIT and
then running it again with a COUNT(*). This
particular query is run from PHP and returns
16720 rows with a LIMIT of 25. Running it with
two queries (i.e., one with LIMIT and the other
with COUNT(*)) gives a query time of around
450ms, while running it with SQL_CALC_FOUND_ROWS
(LIMIT 25)/FOUND_ROWS() takes slightly over
600ms. Pentium II 400, 512MB, InnoDB....
I agree that the limitations of GROUP BY when
used along with ORDER BY are very frustrating.
They've caused me quite a lot of grief. I think
the standard way to get around this limitation
is with a sub-query, which MySQL only supports in version 4.1 or greater. However, there is a way around this
problem that's already documented:
http://www.mysql.com/doc/en/example-Maximum-column-group-row.html
I don't know why it's not mentioned here. Even
though it's not a simple solution, it does work.
If you sort things the way you need them while
populating the temporary table (without GROUP
BY), then subsequent queries that join the
temporary table will provide suitable results.
Thus, the GROUP BY can in a round-about way be
used with ORDER BY.
NULL handling seems to be a bit iffy in the
HAVING clause, which you might compensate for by
using IFNULL:
Suppose you have a Users table with column
(of unique) Id and a Logins table with Id,
LastUse colums such that Id's in Logins may be
repeated or might not appear at all. Then if you
wanted to know only about users not active in the
last three weeks you might think to write:
SELECT Users.Id, UNIX_TIMESTAMP()-UNIX_TIMESTAMP
(MAX(LastUse)) AS Diff FROM Users LEFT JOIN
Logins USING (Id) GROUP BY Users.Id HAVING
(Diff IS NULL) OR (Diff>504*3600);
which will give you nothing. Solve
the problem by using IFNULL in either the SELECT
or HAVING clause:
SELECT Users.Id, IFNULL(UNIX_TIMESTAMP()-
UNIX_TIMESTAMP(MAX(LastUse)),0) AS Diff FROM
Users LEFT JOIN Logins USING (Id) GROUP BY
Users.Id HAVING (Diff=0 OR Diff>504*3600);
<a href="mailto:kangus@acats.com?
Subject=DUPLICATE"> Kevin A. </a>
<font color='red'>DUPLICATE RECORDS How to
find
them</font>
While DISTINCT works for finding duplicates <font
color='red'> IN ONE COLUMN
</front><font
color='black'>it does not
allow you to add other information to your query,
try:
CREATE TABLE Distinct_Address AS SELECT
StreetAddress, COUNT(StreetAddress)
AS
HouseHoldCount,First,Middle,Last,StreetAddress,City,
State,Zip,Phone
FROM Voters2002
GROUP BY StreetAddress
HAVING COUNT(Voters2002.StreetAddress)
>=1;
This will create a table with a HOUSE HOLD count
and only ONE address. This can be used as a mailing
list where you only want to send ONE piece of mail
to a house hold.</font>
FINDIND DUPLICATED RECORDS
------------------------------------------------
Well, I don't know toomuch about this but it works
for me, and fast enough.
SELECT columnname, COUNT(columnname)
FROM
tablename GROUP BY columnname HAVING COUNT
(columnname)>1
Bye community!
To select the latest 5 unique topics, using a datetime
field called time, use something like: "select topic,
max(time) as m from nuke_stories group by topic
order by m desc limit 5;"
Another take on MEDIAN: The SELECT statement
listed in a comment above does not work for all
lists of values (especially when the median value
is not unique in the list, or when the median
should be the average of the two middle-most
values in an even-sized list). Here is a
three-step process that has worked for me:
CREATE TEMPORARY TABLE medians SELECT x.val FROM
data x, data y GROUP BY x.val HAVING
((COUNT(*)-(SUM(SIGN(1-SIGN(y.val-x.val))))) <=
floor((COUNT(*) +1)/2)) and
((COUNT(*)-(SUM(SIGN(1+SIGN(y.val-x.val))))) <=
floor((COUNT(*) +1)/2));
The resulting medians table will either have 1 or
2 rows. If the size of your original data is an
odd number or the two middle-most data values are
the same there will
only be 1 row. If the size of
your original data is an even number and the
middle data values are not equal, there will be 2
rows. Take the average to get the true median:
SELECT AVG(val) AS median FROM medians;
The last step is to drop your temporary table
medians:
DROP TABLE medians;
I couldn't think of another way to compute the
MEDIAN without using a temporary table. If you
know how to optimize my technique, I would love to
hear from you.
Is it just me that finds the (lacking) description
of GROUP BY hard to follow?
> MySQL has extended the use of GROUP BY.
(okay)
> You can use columns or calculations in the
SELECT expressions that don't appear in the GROUP
BY part.
(fine)
> This stands for any possible value for this group.
(what?)
> Don't use this feature if the columns you omit
from the GROUP BY part aren't unique in the group!
You will get unpredictable results.
What does this last bit mean? "aren't unique in
the group" .. should that not read "aren't
IDENTICAL in the group"???
Good feature though, the ANSI GROUP BY is too
redundant, and limits you to 10 GROUP columns,
which while high, is still a limit.
It seems that MySQL 4.0.12 cannot handle
HAVING COUNT(DISTINCT col1)>=5 correctly because it returns 'Unknown column col1 ...', however
HAVING COUNT(col1)>=5 is ok.
Fortunately one can get the intended behaviour with
SELECT ... , COUNT(DISTINCT col1) AS cnt
...
GROUP BY ...
HAVING cnt>=5
Just a suggestion for the developers to add a non-distinct function, which is something I am always using a perl script to 'grep out' deletes from tables with. Having something like:
Delete from some_table where some_column is not distinct.
That would leave a single occurance and get rid of the duplicates.
I do it alot. I have a hunch others do as well.
Jeff
I was trying to get the percentage of a number rows contained in a table in one query. Normally you could do this with 2 queries:
SELECT COUNT(*) as total FROM table WHERE poll=1231 and question=2
SELECT COUNT(*) as result FROM table WHERE poll=1231 and question=2 and answer = 1
and then you get the percentage getting (result/total)*100
I looked over and over all the MySql documentation, and finally I tried this and works (thanks to Tim Ferrer):
SELECT (COUNT(DISTINCT T2.id)/COUNT(DISTINCT T1.id)*100) as percentage FROM table T1, table T2 WHERE T1.poll=1231 and T1.question=2 and T2.poll=1231 and T2.question=2 and T2.answer=1;
That works for me.
If you want to perform a GROUP BY, but you want values in rows that you aren't grouping by and don't appear first in the table, you can do so by joining the table to itself.
For example, in a table that has the fields "tile_id", "name", "orientation" and "imagename":
1) I want the fields "tile_id", "name" and "imagename" to be returned.
2) I want to group by the "name".
3) I want the rows with an orientation of "up" to show up.
4) I only want rows in which there is more than one "orientation" for a given "name".
You can do this with the following query:
SELECT t1.tile_id, t1.name, t1.imagename, count(*) as thecount
FROM `tiles` AS t1, tiles AS t2
WHERE t1.orientation != t2.orientation AND t1.orientation='up' AND t1.name = t2.name
GROUP BY t1.name
Because it's joining where the two tables' orientations are not equal AND where the names ARE equal, that automatically finds results in which there is more than one.
The SAME QUERY could be written as follows:
SELECT t1.tile_id, t1.name, t1.imagename, count(*) as thecount FROM `tiles` AS t1, tiles AS t2 WHERE t1.orientation='up' AND t1.name = t2.name GROUP BY t1.name HAVING thecount > 1
If you aren't picky and don't mind resuts with only one orientation, of course, you can simply remove the HAVING clause from the example above.
A wonderful alternative to group_concat for versions prior 4.1 is MyGroupConcat: A MySQL UDF aggregate function for string concatenation By Emmanuel Kartmann
http://www.codeproject.com/useritems/MyGroupConcat.asp
It only works for Windows though. :(
I found the above MEDIAN calculations to be far too slow. I don't understand how it can be efficient since its an O(n) join...
Here is a far more efficient method of finding medians (uses temporary tables, but some things you can't do in one sql statement). For my particular application, I needed to find medians for each group of rows. Here is a simple example (uses a temp table for the original data, so you can copy-paste it into a mysql term without needing to cleanup anything):
====================================
# setup example data
CREATE TEMPORARY TABLE data (group_id INT, value FLOAT);
INSERT INTO data VALUES (1,1),(1,5),(1,6),(2,1),(2,3),(2,4),(2,20);
# get all data we need to search
CREATE TEMPORARY TABLE temp_median (id INT AUTO_INCREMENT PRIMARY KEY) SELECT group_id, value FROM data ORDER BY group_id, value;
# find the row-id of all median values
CREATE TEMPORARY TABLE temp_median_ids SELECT ROUND(AVG(id)) AS id FROM temp_median GROUP BY group_id;
# select all median values
SELECT group_id, value FROM temp_median_ids LEFT JOIN temp_median USING (id);
===================
If you want to see both median values:
# setup example data
CREATE TEMPORARY TABLE data (group_id INT, value FLOAT);
INSERT INTO data VALUES (1,1),(1,5),(1,6),(2,1),(2,3),(2,4),(2,20);
# get all data we need to search
CREATE TEMPORARY TABLE temp_median (id INT AUTO_INCREMENT PRIMARY KEY) SELECT group_id, value FROM data ORDER BY group_id, value;
# find the row-id of all median values
CREATE TEMPORARY TABLE temp_median_ids SELECT FLOOR(AVG(id)) AS id_1, CEILING(AVG(id)) AS id_2 FROM temp_median GROUP BY group_id;
# select all median values
SELECT group_id, value FROM temp_median_ids LEFT JOIN temp_median ON (id=id_1 OR id=id_2);
Great extension indeed, except that the installation instruction is erroneous:
instead of:
CREATE FUNCTION group_concat RETURNS STRING SONAME "MyGroupConcat.dll";
do:
CREATE AGGREGATE FUNCTION group_concat RETURNS STRING SONAME "MyGroupConcat.dll";
I've looked at the code for the Windows version of the GROUP_CONCAT plugin for MySQL 4.0 and it was basically a direct copy of the udf_example.cc plugin that comes with the distribution... or at least, the header is.
If you simply take the MySQL_UDF.cpp, comment out:
#include "stdafx.h"
And then remove the '__declspec(dllexport)' from the function prototypes, and also remove the 'extern "C" __declspec(dllexport)' from all the function declarations, well, you've got a compilable module for your mysql server for unix :-)
All the rest of that crap in the zip file musta been added by MSVC++ or whatever.
Once you've edited the file, just pop it into the sql/ folder in your source distribution, rename it to MyGroupConcat.cc, and then edit the make file and copy any section having to do with udf_example.cc and adjust it to MyGroupConcat.cc.
Then, just type "make MyGroupConcat.so" and follow the instructions to install it into the server :-)
If you need something to calculate the median via MySQL, you should have a look at the UDFs provided here:
http://mysql-udf.sourceforge.net/
By embedding an IF() function within a COUNT() function, you can gather several counts on the same table at once. For instance, say you wanted to count the number of rows in a table, and also the number of rows that meet a certain criteria. Normally you'd have to use two queries:
SELECT COUNT(*) FROM mytable WHERE something="this";
SELECT COUNT(*) FROM mytable;
If you use an IF() function to evaluate the condition, and return NULL if it's false, you can collect both totals in a single query:
SELECT COUNT(IF(something="this",1, NULL)) AS somethingcount, COUNT(*) AS total FROM mytable;
Note that this isn't going to be optimised. If the column you're evaluating is indexed, it would be faster to use two queries. But if it's not indexed, SQL would have to evaluate each row individually anyway.
You can take the idea a stage further by using a WHERE clause on the query too. For instance, a discussion forum may want to count the number of items for a given topic ID, and the number of those items that have your userid against them:
SELECT COUNT(IF(userid="steve",1,NULL)) AS mypostcount, COUNT(*) AS total FROM forumitem WHERE topicid=123;
If this were done as two queries, SQL would have to evaluate topicid=123 twice. But in the query shown above, it only needs to evaluate it once.
Another way to count rows that match a criteria is to use SUM() instead of COUNT(). If you do this, you need to change the IF() function to return 0 instead of NULL on false:
SELECT SUM(IF(userid="steve",1,0)) AS mypostcount, COUNT(*) AS total FROM forumitem WHERE topicid=123;
Iranian users can visit the persian description of group by in Mysql on http://www.sanjeshearta.com/home.php
It isn't clear when the specific 'order by' for the group_concat function is needed. I.e. would it be needed if we added an explicit order by in the main sql statement for the concatenated column?
If the column is already indexed do we need to add order by to either the main sql statement or the group_concat function?
What is the overhead?
'GROUP BY' can be applied to more than 1 column at once. This does not only save time but also allows you to achieve things that you thought were impossible.
Example:
Table `bookstores_sales` with fields `id`, `store_id`, `book_id`
So, each sold book is stored in the table (as book_id) along with the store id (store_id).
Now, we want to know how many sales each store generated for each book ordered by the biggest sales first!
SELECT store_id, book_id, COUNT(id) AS sales
FROM bookstores_sales
GROUP BY store_id, book_id
ORDER BY sales DESC
Which will show something like:
store_id | book_id | sales
1 | 2 | 210
3 | 5 | 190
1 | 5 | 185
A couple of years ago, someone requested a way to delete duplicates. Subselects make it possible with a query like this in MySQL 4.1:
DELETE FROM some_table WHERE primaryKey NOT IN
(SELECT MIN(primaryKey) FROM some_table GROUP BY some_column)
Of course, you can use MAX(primaryKey) as well if you want to keep the newest record with the duplicate value instead of the oldest record with the duplicate value.
To understand how this works, look at the output of this query:
SELECT some_column, MIN(primaryKey) FROM some_table GROUP BY some_column
As you can see, this query returns the primary key for the first record containing each value of some_column. Logically, then, any key value NOT found in this result set must be a duplicate, and therefore it should be deleted.
If you want to COUNT FILLED IN products related to DISTINCT customers from Appointments,
USE this :
SELECT SUM(prod.id)/COUNT(app.customid) AS sum_prod
FROM appointments app, products prod
WHERE prod.customid = app.customid
GROUP BY app.customid
HAVING COUNT(app.customid) > 0
AND SUM(prod.id) > 0
COUNT(*) is internally converted to COUNT(0), however any other number or string produce the same result.
Here's a workaround for the lack of SUM(distinct expr) in MySQL v4.x:
SUM(distinct expr) should sum unique values of a group.
Since version 4 do not allow it, what you can do is divide a regular SUM(expr) by all count(distinct expr2) of other tables in the join which initially casued a cartezian result set.
for example, the following query
select a.a, sum(distinct b.b), sum(distinct c.c)
from a join b on a.id = b.aid
join c on a.id = c.id
where a.id = 1
can be implemented as :
select a.a, sum(b.b)/count(distinct c.id), sum(c.c)/count(distinct b.id)
from a join b on a.id = b.aid
join c on a.id = c.id
where a.id = 1
One gotcha with GROUP_CONCAT: At first glance, the ORDER BY part of the GROUP_CONCAT functions seems unneccessary, because GROUP BY should imply an ordering. But this is not the case; GROUP_CONCAT acts on a column which is not ordered by GROUP BY.
Median -- This technique works for strings or numbers.
It does not do a costly self-join.
Adjust the type of 'value' as needed.
Note that it finds both medians if there are an even number of items.
CREATE TEMPORARY TABLE tmp (
n INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
value VARCHAR(99) NOT NULL );
INSERT INTO tmp (value)
SELECT value FROM tbl ORDER BY 1;
SELECT @count := COUNT(*) FROM tmp;
SELECT DISTINCT value FROM tmp
WHERE n IN (FLOOR((@count+1)/2), CEIL((@count+1)/2));
DROP TEMPORARY TABLE tmp;
To make more explicit something that was already mentioned in the manual above, count(*) will include a row in its count even if ALL of the fields in that row contain NULL values.
5 rows in set (0.00 sec)For example, given the table "stocks":
You will get the following:
mysql> select count(*) from stocks;
1 row in set (0.00 sec)
However, if you include all of the column names explicitly in count() you will get the following:
mysql> select count(symbol or price or quantity) from stocks;
1 row in set (0.00 sec)
Consequently, I think it may be valuable to avoid thinking of the wildcard * as meaning "look for matches in all columns" and rather look at it as having a special meaning, i.e., "return the total number of rows regardless of whether the content is NULL or not," which of course is what the manual says in less words (and less explicitly) above. This might be confusing for some users since * in a SELECT statement does actually mean "return all columns." In general, the wildcard * seems to have different meanings in different contexts.
Also, I have tried this with both MyISAM and InnoDB tables, so I know that this is not an artifact of the MyISAM practice of storing the row count as an explicit value in the table.
GROUP_CONCAT can be used for tasks not easily achieved otherwise. For example, suppose you have a table created with:
CREATE TABLE 'orders' (
OrderID int,
CustomerID int,
OrderType CHAR(1),
PRIMARY KEY OrderID
)
You want to find, for each customer, the type of the first order (the order with the minimal ID).
What we would like to do is:
SELECT CustomerID, FIRST(OrderType ORDER BY OrderID) GROUP BY CustomerID
However, as far as I know (correct me if I'm wrong), there is no "FIRST" aggregate function in MySql.
So we can do:
SELECT CustomerID, SUBSTR(GROUP_CONCAT(OrderType ORDER BY OrderID) FROM 1 FOR 1) GROUP BY CustomerID
I will be happy to know if there is a better way to do it...
SUM(DISTINCT) is not as useful as it sounds, because MySQL applies the distinct AFTER converting strings to integers.
I want to join two tables, but sum only over one of the tables, but I can't find a way to do it.
A simple example (using MySQL 5.0.22):
create temporary table t1 (ID int, value int);
insert into t1 values (1,3),(2,3);
create temporary table t2 (ID int, value int);
insert into t2 values (1,4),(2,5);
select sum(value) from t1;
-> 6
select sum(distinct concat(t1.value,',',t1.ID)) from t1 JOIN t2;
-> 3
How can I get the answer 6 after joining the two tables?
I cannot do the sum in a subquery before joining because I want to use a more complicated query with GROUP BY WITH ROLLUP.
For example the statement :
SELECT
group_concat(distinct `object_plan`.`object_plan_id` separator ',') AS `object_plan_id`,
group_concat(distinct `object_plan`.`name` separator ',') AS `object_plan_name`
FROM object_plan;
returns object_plan_id as BLOB and object_plan_name as VARCHAR ...
So be carefull when using GROUP_CONCAT, especially with ColdFusion and its list processing functions ... It can lead to "ByteArray Object cannot be converted to String" errors since BLOBS are hadled by CF as binary objects ...
Trying to pick out the number of posts and then first AND last poster in a discussion forum using GROUP BY in a single select.
TABLE posts
post_id
head_id
post_text
author_name
whenAdded
try this:
select count(head_id), max(concat(whenAdded,","author_name)) as lastAuthor, min(concat(whenAdded,",",author_name) as firstAuthor from posts group by post_id
Splitting the date from the author_name is relatively trivial
Saul Dobney
The method to find the median of a set of data posted by Barbara Kreaseck is a very very good method. Base on her method, the following one is a modified version.
SELECT AVG(val) FROM (
SELECT x.val FROM
data x, data y GROUP BY x.val HAVING
((SUM(SIGN(1-SIGN(y.val-x.val))))>=floor((COUNT(*)+1)/2)) and
((SUM(SIGN(1+SIGN(y.val-x.val))))>=floor((COUNT(*)+1)/2)));
Table bodyfat format:
bid INT :autoinc PK record ID
current_fat INT
points INT : if user lowers fat awarded a point
entry_date DATE
uid INT : User ID
For a company wide report we needed the starting body fat, the current body fat and the sum of the points for the month group by user.
SELECT `uid`, bid,sum(miles) as miles, `current_fat`, mid(max(concat(bid,":",current_fat)),instr(max(concat(bid,":",current_fat)),":")+1) as max_uid, mid(min(concat(bid,":",current_fat)),instr(min(concat(bid,":",current_fat)),":")+1) as min_uid
FROM `bodyfat`
WHERE `uid` IN (2,47,48,52,54,57,58,59,60,65,66,67,68,69,70,71,72,73,88,90,96,108,117,123,164,177)
AND date_format(`entry_date`,'%m/%Y') = '05/2007'
GROUP BY uid;
The key was finding the post showing concat. Thanks
Another median example using ranking and variables, took 0.35 secs for 200K values.
SET @R:=0;select val from (select val,(@R:=@R+1) r from table order by val asc) t where r=((select count(*)+1 from table) DIV 2);
A litte problem I had with the group_concat function was when selecting integers only I was getting a blob instead of a string to solve this I used a cast to convert the integers first:
SELECT GROUP_CONCAT(CAST(myInt as CHAR)) myInts FROM aTable;
If you need to SUM specific group:
select month,
if(job="1" or job="3" ,"Sum of job 1 and 3",
if(job="2" or job="6" ,"Sum of job 2 and 6",
if(job="4" or job="7" ,"Sum of job 4 and 7",
if(job="5" or job="9" ,"Sum of job 5 and 9",
if(job="8" ,"Sum of job 8"," "))))) as name_of_job,
sum(if((pl_porezi.vrstaporeza="Porez") ,saldo ,0)) as Porez ,
"15%" as Percent,
sum(if((pl_porezi.vrstaporeza="Prirez") ,saldo ,0)) as Prirez
from pl_porezi
left outer join kadarp on (pl_porezi.sifra=kadarp.sifra)
where pl_porezi.month="11"
group by month,
(job="1" or job="3"),
(job="2" or job="6"),
(job="4" or job="7"),
(job="5" or job="9"),
(job="8" )
The following method has been proposed for calculating medians but it does not allow the calculation of medians per group:
SELECT AVG(fld) FROM (
SELECT x.fld FROM
tbl AS x, tbl AS y GROUP BY x.fld HAVING
((SUM(SIGN(1-SIGN(y.fld-x.fld))))>=floor((COUNT(*)+1)/2)) and
((SUM(SIGN(1+SIGN(y.fld-x.fld))))>=floor((COUNT(*)+1)/2))
But there is another way that is easy to understand and implement:
1) Create table ordered by the grouping variable and the frequency.
2) Add counter to it. That is enough information to allow the median per group to be calculated.
3) We filter to retain the middle record(s)
4) Then get the AVG of that per group (if one record, no harm done but returns correct result if a draw).
e.g. with City as the grouping variable and value as the variable we want the median of
Counter City Value
11 New York 12
12 New York 26
13 New York 111
14 New York 123
The middle record(s) for the New York group are 11 + FLOOR((max-min)/2) AND 11 + CEIL((max-min)/2)
i.e. 11 + FLOOR((14-11)/2) AND 11 + CEIL((14-11)/2)
i.e. 11 + FLOOR((1.5) AND 11 + CEIL(1.5)
i.e. 11 + 1 AND 11 + 2
i.e. 12 and 13, which is correct
And it works if there is an odd number of records too:
e.g. if there was a fifth record
15 New York 176
11 + FLOOR((15-11)/2) AND 11 + CEIL((15-11)/2)
11 + 2, 11 + 2 which is record 13, which is correct
Implementation:
1) Make table but NB to sort by grouping variable, and by value we want median of. For this example the id we have added is intID.
2) Add counter:
Run "SET @cnt := 0;"
Run "UPDATE your_table_name_here
SET intID = @cnt := @cnt + 1"
3) Run a nested query
Text of Inner Query:
"SELECT grouping_var_name_here,
value_we_want_median_of,
MIN(intID) + FLOOR((MAX(intID) - MIN(intID))/2) As tag1,
MIN(intID) + CEIL((MAX(intID) - MIN(intID))/2) As tag2
FROM your_table_name_here
GROUP BY grouping_var_name_here"
Text of Outer Query:
"SELECT your_table_name_here.grouping_var_name_here,
FORMAT(AVG(value_we_want_median_of),1) AS median_of_values
FROM (text_of_inner_query_goes_here) AS qryid_tags INNER JOIN
your_table_name_here
USING(grouping_var_name_here)
WHERE intID IN(tag1, tag2)
GROUP BY grouping_var_name_here
ORDER BY grouping_var_name_here"
The best thing about this method is you can check it step by step and see why it should work.
-- (mostly) High Performance scaling MEDIAN function per group
-- Median defined in http://en.wikipedia.org/wiki/Median
--
-- by Peter Hlavac
-- 06.11.2008
--
-- Example Table:
DROP table if exists table_median;
CREATE TABLE table_median (id INTEGER(11),val INTEGER(11));
COMMIT;
INSERT INTO table_median (id, val) VALUES
(1, 7), (1, 4), (1, 5), (1, 1), (1, 8), (1, 3), (1, 6),
(2, 4),
(3, 5), (3, 2),
(4, 5), (4, 12), (4, 1), (4, 7);
-- Calculating the MEDIAN
SELECT @a := 0;
SELECT
id,
AVG(val) AS MEDIAN
FROM (
SELECT
id,
val
FROM (
SELECT
-- Create an index n for every id
@a := (@a + 1) mod o.c AS shifted_n,
IF(@a mod o.c=0, o.c, @a) AS n,
o.id,
o.val,
-- the number of elements for every id
o.c
FROM (
SELECT
t_o.id,
val,
c
FROM
table_median t_o INNER JOIN
(SELECT
id,
COUNT(1) AS c
FROM
table_median
GROUP BY
id
) t2
ON (t2.id = t_o.id)
ORDER BY
t_o.id,val
) o
) a
WHERE
IF(
-- if there is an even number of elements
-- take the lower and the upper median
-- and use AVG(lower,upper)
c MOD 2 = 0,
n = c DIV 2 OR n = (c DIV 2)+1,
-- if its an odd number of elements
-- take the first if its only one element
-- or take the one in the middle
IF(
c = 1,
n = 1,
n = c DIV 2 + 1
)
)
) a
GROUP BY
id;
-- Explanation:
-- The Statement creates a helper table like
--
-- n id val count
-- ----------------
-- 1, 1, 1, 7
-- 2, 1, 3, 7
-- 3, 1, 4, 7
-- 4, 1, 5, 7
-- 5, 1, 6, 7
-- 6, 1, 7, 7
-- 7, 1, 8, 7
--
-- 1, 2, 4, 1
-- 1, 3, 2, 2
-- 2, 3, 5, 2
--
-- 1, 4, 1, 4
-- 2, 4, 5, 4
-- 3, 4, 7, 4
-- 4, 4, 12, 4
-- from there we can select the n-th element on the position: count div 2 + 1
When you need to calculate average value of items stored in a row (where some values can be NULLs) you would find COALESCE() function very useful.
Exemplary usage when we have three items per row would be:
SELECT *,
(COALESCE(V.rank_0, 0)
+ COALESCE(V.rank_1, 0)
+ COALESCE(V.rank_2, 0))
/
(3 -
(COALESCE(V.rank_0 - V.rank_0, 1)
+ COALESCE(V.rank_1 - V.rank_1, 1)
+ COALESCE(V.rank_2 - V.rank_2, 1))
) AS row_avg FROM voting V
I talk about it in more detail on my blog post at:
http://tech-blog.borychowski.com/index.php/2009/02/mysql/average-value-in-a-row/
As is stated here, the default number of characters returned by GROUP_CONCAT is 1024.
On our Linux system, we were able to change the setting easily by altering the file located at:
/etc/my.cnf
And adding this line under the [mysqld] group heading:
group_concat_max_len=99999
"99999" is the new number of characters that will be returned.
Also, a restart of mysql is required.
I have a new way to calculate the median:
select a.rownum, a.id, a.value
FROM
(SELECT @rownum:=@rownum+1 AS rownum,id, value
FROM (SELECT @rownum:=0) r, DATA
order by value) a,
(select round(count(*)/2) medio FROM DATA) b
WHERE a.rownum= b.medio;
If you need find the median by some agrupation:
select b.group_id,a.rownum, a.id, a.value
FROM
(SELECT if(@group=group_id,@rownum:=@rownum+1,@rownum:=0) AS rownum,
@group:=group_id as group,id, value
FROM DATA
order by group_id, valor ) a,
(select group_id,round(count(*)/2) medio
FROM DATA
group by group_id) b
WHERE a.rownum= b.medio and a.group=b.group_id;
If you want more information about this in spanish: http://senderodigital.wordpress.com
Further to Pepe Juarez's code, you can do the following to average 2 values to get the median where there are an even number of rows (also works as expected with an odd number of rows). The row number will end with .5 if there's an even number of rows.
select avg(a.rownum), avg(a.value)
FROM (SELECT @rownum:=@rownum+1 AS rownum,value
FROM (SELECT @rownum:=0) r, DATA
order by value
) a,
(select count(*)/2 median FROM DATA) b
WHERE a.rownum between (b.median - 0.5) and (b.median +0.5)
Terry's code is not quite right. change
select count(*)/2 median
to
select 0.5+count(*)/2 median
so that an odd number of rows yields an integer rownum.
simple median in two selects : (median date in this example)
// get number of rows that fits our filtering need (bigger then a certain date for example)
select count(*) as counter
from table
where date_field > 1/1/2000
// now save the row count in count variable lets say $rows_count in php
<?php
...
$rows_count = $row['counter'];
?>
// now get the median row (the one the is in the middle of the list, filtered and ordered by our desired field
select date_field as median_date
from table
where date_field > 1/1/2000
order by date_field
limit '.$rows_count/2.',1
$row['median_date'] will contain the median date :)
enjoy.
Add your own comment.