Aug. 26, 2008 at 8:45amStats in MySQL Pt. I: Outliers

Finding spikes in your normal data

With most large sets of data, especially numerical data, statistical analysis plays a key role. You can't be bothered to look at every record yourself; that's what computers are for.

One useful tool in any statistical analysis is the identification of outliers. Assuming you have a normally distributed set of data, outliers can help to identify user error in the data entry process, or genuine spikes in the data. Once found, these numbers can be set aside for closer analysis or eliminated to normalize the data set.

There are many different methods for identifying outliers, with varying levels of rigor. Here I'll just demonstrate one of the simplest definitions: an outlier is any value greater than three standard deviations away from the mean.First we'll need some data. To demonstrate, I've created the simplest of tables:
CREATE TABLE randomNumbers (
    value INT(11)
);
And populated it with something vaguely resembling normal data, about 10,000 rows of:
INSERT INTO randomNumbers(value) VALUES
(
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000)
)

The formula for finding the number of standard deviations a point is from the sample's mean (in terms of MySQL aggregate functions) is (value - AVG(value) ) / STDDEV(value). Unfortunately, due to the way aggregate functions work, this doesn't quite fly:
mysql> SELECT value, (value - AVG(value)) / STDDEV(value)
-> FROM randomNumbers;
ERROR 1140 (42000): Mixing of GROUP columns (MIN(),MAX(),COUNT(),...) with no
GROUP columns is illegal if there is no GROUP BY clause
Since we need to work with the full data set, we can't reasonably use GROUP BY, so we need to get the mean and standard deviation by some other method. Assuming that the data set is still growing, we don't want to risk grabbing this information separately, so instead we will join it in via a derived table (a.k.a. a subquery in the FROM clause):
mysql> SELECT num.value, agro.mean, agro.dev
    -> FROM randomNumbers num
    -> CROSS JOIN (
-> SELECT AVG(value) AS mean, STDDEV(value) AS dev
-> FROM randomNumbers
-> ) agro
    -> ORDER BY num.value;
+-------+-----------+----------+
| value | mean      | dev      |
+-------+-----------+----------+
|  1462 | 5006.7004 | 908.8870 |
|  1827 | 5006.7004 | 908.8870 |
|  1980 | 5006.7004 | 908.8870 |

...

|  8087 | 5006.7004 | 908.8870 |
|  8094 | 5006.7004 | 908.8870 |
|  8212 | 5006.7004 | 908.8870 |
+-------+-----------+----------+
10166 rows in set (0.04 sec)
Now that this aggregate information is available to each row in the result set, we can calculate the number of deviations for each value:
mysql> SELECT num.value, (value - mean) / dev AS num_devs
    -> FROM randomNumbers num
    -> CROSS JOIN (
-> SELECT AVG(value) AS mean, STDDEV(value) AS dev
-> FROM randomNumbers
-> ) agro
    -> ORDER BY num.value;
+-------+-------------+
| value | num_devs    |
+-------+-------------+
|  1462 | -3.90004522 |
|  1827 | -3.49845514 |
|  1980 | -3.33011739 |

...

|  5006 | -0.00077061 |
|  5007 |  0.00032963 |

...

|  8094 |  3.39679146 |
|  8212 |  3.52662058 |
+-------+-------------+
10166 rows in set (0.07 sec)
Since we've defined an outlier as any value that is more than three standard deviations away from the mean, we filter on the absolute value of num_devs to get all outliers in both directions:
mysql> SELECT num.value, (num.value - agro.mean) / agro.dev AS num_devs 
    -> FROM randomNumbers num
    -> CROSS JOIN (
-> SELECT AVG(value) AS mean, STDDEV(value) AS dev
-> FROM randomNumbers
-> ) agro
    -> WHERE ABS( num.value - agro.mean ) / agro.dev > 3
    -> ORDER BY num.value;
+-------+-------------+
| value | num_devs    |
+-------+-------------+
|  1462 | -3.90004522 |
|  1827 | -3.49845514 |
|  1980 | -3.33011739 |
|  2071 | -3.22999493 |
|  2142 | -3.15187741 |
|  2187 | -3.10236630 |
|  2254 | -3.02864977 |
|  2254 | -3.02864977 |
|  7816 |  3.09092285 |
|  7888 |  3.17014062 |
|  7958 |  3.24715790 |
|  7962 |  3.25155888 |
|  7997 |  3.29006752 |
|  8076 |  3.37698702 |
|  8077 |  3.37808726 |
|  8087 |  3.38908973 |
|  8094 |  3.39679146 |
|  8212 |  3.52662058 |
+-------+-------------+
18 rows in set (0.05 sec)
If you're just looking for spikes on one extreme or the other, you can ditch the ABS in the WHERE clause (and change it to < -3 to find the low end). And of course you can adjust the starting point (currently 3) if you want a larger or smaller definition of an outlier.

Other methods

There are certain weaknesses to this method. If memory serves me, use of aggregate functions does not work with the query cache, so the AVG and STDDEV are calculated at every row, despite being the same all the way through. If you are working with a transactional storage engine using the REPEATABLE READ isolation level, you can start a transaction, assign the mean and deviation to user variables, and work with the saved values rather than relying on a subquery.

There are also other mathematical methods. If you calculate the interquartile range of the data (the difference between the quartiles, or the 75th and the 25th percentiles), you can treat any data more than 1.5*IQR outside of the aforementioned quartiles as an outlier. I won't get too deeply into that, as I'd really just be copying and pasting Roland Bouman's article on finding the nth percentile.

Anybody have any other interesting techniques to share?

There are no issues between aggregates and the query cache.
The query cache works on entire result sets, not an execution plan.

Btw try SELECT 1 AS blah, ... FROM .... GROUP BY blah
See if that'll work without the error. MySQL is actually quite loose (by default) with its group by, it will allow aggregates and non-aggregates unless you have the ONLY_FULL_GROUP_BY setting in sql_mode.
The above is just a quick tweak to give you a column to group by, while still making the whole set as the one group.
Tell what you find!

Left by Arjen Lentz | Aug. 26, 2008 at 3:31pm

That's what I get for talking about query cache functionality on the day I left my trusty High Performance MySQL 2nd Edition at home ;)

As for your suggestion, It does run without the error, but doesn't give the results that I'm looking for.

The first try, using SELECT 1 AS blah, ... FROM .... GROUP BY blah resulted in only one row. Since every row had the same value for blah, the GROUP BY blah condensed everything down to one row, essentially serving the same function as if I had called AVG and STDDEV without any non-aggregate columns (there was a column for value, but it was just the value of the first row in the table, since nothing was explicitly ordered).

From there I tried having different values for blah, like so:

SET @blah = 1;
SELECT (@blah := @blah+1) AS blah, value, AVG(value), STDDEV(value)
FROM randomNumbers
GROUP BY blah

But as you can probably guess, having all different values for blah served me no better. Each row was grouped with only itself, meaning that for each result row, the mean was equal to the value, and standard deviation was zero.

Left by Joe at SC | Aug. 26, 2008 at 4:15pm

Oh, and as an interesting side note, in that last query the blah column was incrementing by two, rather than 1. I wonder if this is something in the way aggregated queries are executed, which causes the @blah := @blah+1 to get executed twice before the row is output. Adding more aggregate functions into the mix, the incrementing stays at +2, but once they are stripped out (just SELECT (@blah := @blah+1) AS blah, value FROM ...) it goes back to the expected incrementing by 1.

Left by Joe at SC | Aug. 26, 2008 at 4:46pm

Nice example, but I found this did not work until I added AS before the aliases:

SELECT num.value, agro.mean, agro.dev
FROM randomNumbers AS num
    CROSS JOIN (
        SELECT AVG(value) AS mean, STDDEV(value) AS dev
        FROM randomNumbers
    ) AS agro
ORDER BY num.value;

Left by John | Feb. 21, 2010 at 11:57am

actually, it may just have been a formatting issue, both seem to work now :)

Left by John | Feb. 21, 2010 at 12:00pm

Leave a Comment

Remember me

Name:

Email:

URL:

Comment: * No HTML, http:// will auto-link
* required
Comment Guidelines