Jay Pipes is the Community Relations Manager for North America at MySQL. He joined MySQL in January, 2006, and is co-author of Pro MySQL (Apress, 2005). Frequently speaking at conferences about MySQL, Jay lives in Columbus, Ohio, with his wife Julie, and his extremely needy two cats and two dogs.
Damien Seguy had an interesting observation on my use of self joins for generating rolling sums:
What about using a user variable to make the rolling sum from the beginning? Something simple enough like this:
set @sum := 159; // to start with the original number of post, or 0 if you prefer. set @mn := 0; // for the month number select @mn := @mn + 1 as MonthNumber Month, Added, @sum := @sum + Added as RunningTotal from ResultListing1;
Damien, of course, is correct, that I could have used a user variable like the above to generate the rolling sum, instead of using a self join. However, there are some complications. First, let's look at what happens if I rewrite Lenz' original GROUP BY statement to include a user variable to calculate running totals:
mysql> SELECT COUNT(*) INTO @running_total -> FROM feeds WHERE created < '2007-01-01'; Query OK, 1 row affected (0.00 sec) mysql> SELECT -> MONTHNAME(created) AS Month -> , COUNT(*) AS Added -> , @running_total := @running_total + COUNT(*) AS RunningTotal -> FROM feeds -> WHERE created >= '2007-01-01' -> GROUP BY MONTH(created); +----------+-------+--------------+ | Month | Added | RunningTotal | +----------+-------+--------------+ | January | 1 | 160 | | February | 1 | 160 | | March | 11 | 170 | | April | 8 | 167 | | May | 18 | 177 | | June | 4 | 163 | +----------+-------+--------------+ 6 rows in set (0.00 sec)
As you can see, we have a bit of a problem. Clearly, the @running_total user variable is not being calculated correctly when used both as an output and in an assignment within the same column. Let's see what happens if I break the user variable assignment out into a separate column...
mysql> SELECT COUNT(*) INTO @running_total -> FROM feeds WHERE created < '2007-01-01'; Query OK, 1 row affected (0.00 sec) mysql> SELECT @running_total; +----------------+ | @running_total | +----------------+ | 159 | +----------------+ 1 row in set (0.00 sec) mysql> SELECT -> MONTHNAME(created) AS Month -> , COUNT(*) AS Added -> , @running_total := @running_total + COUNT(*) -> , @running_total AS RunningTotal -> FROM feeds -> WHERE created >= '2007-01-01' -> GROUP BY MONTH(created); +----------+-------+---------------------------------------------+--------------+ | Month | Added | @running_total := @running_total + COUNT(*) | RunningTotal | +----------+-------+---------------------------------------------+--------------+ | January | 1 | 160 | 159 | | February | 1 | 160 | 159 | | March | 11 | 170 | 159 | | April | 8 | 167 | 159 | | May | 18 | 177 | 159 | | June | 4 | 163 | 159 | +----------+-------+---------------------------------------------+--------------+ 6 rows in set (0.00 sec)
Still no dice. However, in the above output, you can more clearly see what is actually happening in the statement: the COUNT(*) calculation is made during the GROUP BY phase, which occurs after the selection of data rows. Unfortunately, the calculation of user variable assignment is done during the row selection phase, and since the assignment of @running_total to itself plus the grouped COUNT(*) requires information not available during the row selection, the assignment is not done according to our expectations. To further demonstrate this, take a look at what happens if we add a constant number, and not the COUNT(*) to the @running_total variable in a separate column:
mysql> SELECT COUNT(*) INTO @running_total -> FROM feeds WHERE created < '2007-01-01'; Query OK, 1 row affected (0.00 sec) mysql> SELECT @running_total; +----------------+ | @running_total | +----------------+ | 159 | +----------------+ 1 row in set (0.00 sec) mysql> SELECT -> MONTHNAME(created) AS Month -> , COUNT(*) AS Added -> , @running_total := @running_total + 1 -> , @running_total AS RunningTotal -> FROM feeds -> WHERE created >= '2007-01-01' -> GROUP BY MONTH(created); +----------+-------+--------------------------------------+--------------+ | Month | Added | @running_total := @running_total + 1 | RunningTotal | +----------+-------+--------------------------------------+--------------+ | January | 1 | 160 | 159 | | February | 1 | 161 | 159 | | March | 11 | 162 | 159 | | April | 8 | 163 | 159 | | May | 18 | 164 | 159 | | June | 4 | 165 | 159 | +----------+-------+--------------------------------------+--------------+ 6 rows in set (0.01 sec)
As you can see above, adding a constant to the user variable works as expected, resulting in an incrementing running total. But since the COUNT(*) expression is evaluated during the GROUP BY phase, the running total assignment is done per group, and therefore does not correspond to our expectation of a running total.
So, how do we get around this issue, but still use a user variable to do running sums? Well, we could go back to our old friend, the derived table, to force the GROUP BY calculations to first be performed, and then apply Damien's technique. And, as you can see below, this works just fine:
mysql> SELECT COUNT(*) INTO @running_total -> FROM feeds WHERE created < '2007-01-01'; Query OK, 1 row affected (0.01 sec) mysql> SELECT @running_total; +----------------+ | @running_total | +----------------+ | 159 | +----------------+ 1 row in set (0.00 sec) mysql> SELECT -> Month -> , Added -> , @running_total := @running_total + Added AS RunningTotal -> FROM ( -> SELECT -> MONTHNAME(created) AS Month -> , COUNT(*) AS Added -> FROM feeds -> WHERE created >= '2007-01-01' -> GROUP BY MONTH(created) -> ) AS t; +----------+-------+--------------+ | Month | Added | RunningTotal | +----------+-------+--------------+ | January | 1 | 160 | | February | 1 | 161 | | March | 11 | 172 | | April | 8 | 180 | | May | 18 | 198 | | June | 4 | 202 | +----------+-------+--------------+ 6 rows in set (0.00 sec)
Whether you consider the above to be easier or more readable than the self join technique is entirely up to you. Both methods produce identical results.
But, there's one more reason I like to stick with the self-join technique, and that is for situations where you want more than just a running sum, but also want other running aggregates. A common example would be if I wanted to also see the running average of feeds added to PlanetMySQL over the lifetime of the Planet's history. Using a self-join technique, the code looks like this:
mysql> SELECT -> x1.YearMonth -> , x1.MonthName -> , x1.Added -> , SUM(x2.Added) AS RunningTotal -> , FLOOR(AVG(x2.Added)) AS RunningAvg -> FROM -> ( -> SELECT -> DATE_FORMAT(created, '%Y-%m') AS YearMonth -> , MONTHNAME(created) AS MonthName -> , COUNT(*) AS Added -> FROM feeds -> GROUP BY DATE_FORMAT(created, '%Y-%m') -> ) AS x1 -> INNER JOIN ( -> SELECT -> DATE_FORMAT(created, '%Y-%m') AS YearMonth -> , MONTH(created) AS MonthNo -> , MONTHNAME(created) AS MonthName -> , COUNT(*) AS Added -> FROM feeds -> GROUP BY DATE_FORMAT(created, '%Y-%m') -> ) AS x2 -> ON x1.YearMonth >= x2.YearMonth -> GROUP BY x1.YearMonth; +-----------+-----------+-------+--------------+------------+ | YearMonth | MonthName | Added | RunningTotal | RunningAvg | +-----------+-----------+-------+--------------+------------+ | 2005-07 | July | 32 | 32 | 32 | | 2005-08 | August | 10 | 42 | 21 | | 2005-09 | September | 15 | 57 | 19 | | 2005-10 | October | 6 | 63 | 15 | | 2005-11 | November | 5 | 68 | 13 | | 2005-12 | December | 7 | 75 | 12 | | 2006-01 | January | 1 | 76 | 10 | | 2006-02 | February | 4 | 80 | 10 | | 2006-03 | March | 20 | 100 | 11 | | 2006-04 | April | 7 | 107 | 10 | | 2006-05 | May | 5 | 112 | 10 | | 2006-06 | June | 12 | 124 | 10 | | 2006-07 | July | 5 | 129 | 9 | | 2006-08 | August | 12 | 141 | 10 | | 2006-09 | September | 6 | 147 | 9 | | 2006-10 | October | 5 | 152 | 9 | | 2006-11 | November | 4 | 156 | 9 | | 2006-12 | December | 3 | 159 | 8 | | 2007-01 | January | 1 | 160 | 8 | | 2007-02 | February | 1 | 161 | 8 | | 2007-03 | March | 11 | 172 | 8 | | 2007-04 | April | 8 | 180 | 8 | | 2007-05 | May | 18 | 198 | 8 | | 2007-06 | June | 4 | 202 | 8 | +-----------+-----------+-------+--------------+------------+ 24 rows in set (0.00 sec)
The above SQL statement is a slightly modified version of the running totals SQL with a self join that I demonstrated in yesterday's article. The only changes are that I removed the WHERE clause in the subqueries which limited the sets to only feeds from 2007 and I changed MONTH(created) to DATE_FORMAT(created, '%Y-%m') to also include the year.
Notice that other than the above unrelated changes, all I had to do to get a running average with the same self join was to add this line to the SELECT clause:
, FLOOR(AVG(x2.Added)) AS RunningAvg
The FLOOR() function simply truncates any decimal points in the average and rounds down. But, it's just that simple to add a running average to this kind of query. How would we have done this using user variables? Well, it gets fairly complicated, since averages rely on the total number of items being averaged, which needs to be stored in a separate user variable. Here is how to do it using user variables:
mysql> SET @num_rows := 0; Query OK, 0 rows affected (0.00 sec) mysql> SET @running_total := 0; Query OK, 0 rows affected (0.00 sec) mysql> SELECT -> YearMonth -> , Added -> , @running_total := @running_total + Added AS RunningTotal -> , @running_avg := FLOOR(((@running_total - Added) + Added) -> / (@num_rows := @num_rows + 1)) AS RunningAvg -> FROM ( -> SELECT -> DATE_FORMAT(created, '%Y-%m') AS YearMonth -> , COUNT(*) AS Added -> FROM feeds -> GROUP BY DATE_FORMAT(created, '%Y-%m') -> ) AS t; +-----------+-------+--------------+------------+ | YearMonth | Added | RunningTotal | RunningAvg | +-----------+-------+--------------+------------+ | 2005-07 | 32 | 32 | 32 | | 2005-08 | 10 | 42 | 21 | | 2005-09 | 15 | 57 | 19 | | 2005-10 | 6 | 63 | 15 | | 2005-11 | 5 | 68 | 13 | | 2005-12 | 7 | 75 | 12 | | 2006-01 | 1 | 76 | 10 | | 2006-02 | 4 | 80 | 10 | | 2006-03 | 20 | 100 | 11 | | 2006-04 | 7 | 107 | 10 | | 2006-05 | 5 | 112 | 10 | | 2006-06 | 12 | 124 | 10 | | 2006-07 | 5 | 129 | 9 | | 2006-08 | 12 | 141 | 10 | | 2006-09 | 6 | 147 | 9 | | 2006-10 | 5 | 152 | 9 | | 2006-11 | 4 | 156 | 9 | | 2006-12 | 3 | 159 | 8 | | 2007-01 | 1 | 160 | 8 | | 2007-02 | 1 | 161 | 8 | | 2007-03 | 11 | 172 | 8 | | 2007-04 | 8 | 180 | 8 | | 2007-05 | 18 | 198 | 8 | | 2007-06 | 4 | 202 | 8 | +-----------+-------+--------------+------------+ 24 rows in set (0.00 sec)
As you can see, it is possible to do running averages with user variables, but it involves a tricky double user variable assignment in the column calculation of the RunningAvg column:
, @running_avg := FLOOR(((@running_total - Added) + Added) / (@num_rows := @num_rows + 1)) AS RunningAvg
The first assignment is actually done for the @num_rows user variable in an internal parentheses. The second (outer) assignment is done for the @running_avg user variable. We have to do this because to correctly calculate the average, we need to divide the total added feeds by the number of months (stored in the incrementing @num_rows) variable...
Personally, for reasons such as this, I prefer to use the self-join technique as it leads to slightly more readable code in the long run. In a later article, I'll look at benchmarking the preformance of each method to see if there's any appreciable difference...
If you'd like to comment on this article, Jay welcomes comments on his blog, http://jpipes.com.
Read and post comments on this article in the MySQL Forums. There are currently 2 comments.