Stats in MySQL Pt. I: Outliers

Aug26
Missing Image
By Joe Izenman

With most large sets of data, especially numerical data, statistical analysis plays a key role. You can't be bothered to look at every record yourself; that's what computers are for.

One useful tool in any statistical analysis is the identification of outliers. Assuming you have a normally distributed set of data, outliers can help to identify user error in the data entry process, or genuine spikes in the data. Once found, these numbers can be set aside for closer analysis or eliminated to normalize the data set.

There are many different methods for identifying outliers, with varying levels of rigor. Here I'll just demonstrate one of the simplest definitions: an outlier is any value greater than three standard deviations away from the mean.

First we'll need some data. To demonstrate, I've created the simplest of tables:
CREATE TABLE randomNumbers (
    value INT(11)
);
And populated it with something vaguely resembling normal data, about 10,000 rows of:
INSERT INTO randomNumbers(value) VALUES
(
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000) +
    ROUND(RAND() * 1000)
)

The formula for finding the number of standard deviations a point is from the sample's mean (in terms of MySQL aggregate functions) is (value - AVG(value) ) / STDDEV(value). Unfortunately, due to the way aggregate functions work, this doesn't quite fly:
mysql> SELECT value, (value - AVG(value)) / STDDEV(value)
-> FROM randomNumbers;
ERROR 1140 (42000): Mixing of GROUP columns (MIN(),MAX(),COUNT(),...) with no
GROUP columns is illegal if there is no GROUP BY clause
Since we need to work with the full data set, we can't reasonably use GROUP BY, so we need to get the mean and standard deviation by some other method. Assuming that the data set is still growing, we don't want to risk grabbing this information separately, so instead we will join it in via a derived table (a.k.a. a subquery in the FROM clause):
mysql> SELECT num.value, agro.mean, agro.dev
    -> FROM randomNumbers num
    -> CROSS JOIN (
-> SELECT AVG(value) AS mean, STDDEV(value) AS dev
-> FROM randomNumbers
-> ) agro
    -> ORDER BY num.value;
+-------+-----------+----------+
| value | mean      | dev      |
+-------+-----------+----------+
|  1462 | 5006.7004 | 908.8870 |
|  1827 | 5006.7004 | 908.8870 |
|  1980 | 5006.7004 | 908.8870 |

...

|  8087 | 5006.7004 | 908.8870 |
|  8094 | 5006.7004 | 908.8870 |
|  8212 | 5006.7004 | 908.8870 |
+-------+-----------+----------+
10166 rows in set (0.04 sec)
Now that this aggregate information is available to each row in the result set, we can calculate the number of deviations for each value:
mysql> SELECT num.value, (value - mean) / dev AS num_devs
    -> FROM randomNumbers num
    -> CROSS JOIN (
-> SELECT AVG(value) AS mean, STDDEV(value) AS dev
-> FROM randomNumbers
-> ) agro
    -> ORDER BY num.value;
+-------+-------------+
| value | num_devs    |
+-------+-------------+
|  1462 | -3.90004522 |
|  1827 | -3.49845514 |
|  1980 | -3.33011739 |

...

|  5006 | -0.00077061 |
|  5007 |  0.00032963 |

...

|  8094 |  3.39679146 |
|  8212 |  3.52662058 |
+-------+-------------+
10166 rows in set (0.07 sec)
Since we've defined an outlier as any value that is more than three standard deviations away from the mean, we filter on the absolute value of num_devs to get all outliers in both directions:
mysql> SELECT num.value, (num.value - agro.mean) / agro.dev AS num_devs 
    -> FROM randomNumbers num
    -> CROSS JOIN (
-> SELECT AVG(value) AS mean, STDDEV(value) AS dev
-> FROM randomNumbers
-> ) agro
    -> WHERE ABS( num.value - agro.mean ) / agro.dev > 3
    -> ORDER BY num.value;
+-------+-------------+
| value | num_devs    |
+-------+-------------+
|  1462 | -3.90004522 |
|  1827 | -3.49845514 |
|  1980 | -3.33011739 |
|  2071 | -3.22999493 |
|  2142 | -3.15187741 |
|  2187 | -3.10236630 |
|  2254 | -3.02864977 |
|  2254 | -3.02864977 |
|  7816 |  3.09092285 |
|  7888 |  3.17014062 |
|  7958 |  3.24715790 |
|  7962 |  3.25155888 |
|  7997 |  3.29006752 |
|  8076 |  3.37698702 |
|  8077 |  3.37808726 |
|  8087 |  3.38908973 |
|  8094 |  3.39679146 |
|  8212 |  3.52662058 |
+-------+-------------+
18 rows in set (0.05 sec)
If you're just looking for spikes on one extreme or the other, you can ditch the ABS in the WHERE clause (and change it to < -3 to find the low end). And of course you can adjust the starting point (currently 3) if you want a larger or smaller definition of an outlier.

Other methods

There are certain weaknesses to this method. If memory serves me, use of aggregate functions does not work with the query cache, so the AVG and STDDEV are calculated at every row, despite being the same all the way through. If you are working with a transactional storage engine using the REPEATABLE READ isolation level, you can start a transaction, assign the mean and deviation to user variables, and work with the saved values rather than relying on a subquery.

There are also other mathematical methods. If you calculate the interquartile range of the data (the difference between the quartiles, or the 75th and the 25th percentiles), you can treat any data more than 1.5*IQR outside of the aforementioned quartiles as an outlier. I won't get too deeply into that, as I'd really just be copying and pasting Roland Bouman's article on finding the nth percentile.

Anybody have any other interesting techniques to share?

Dev

Back To Feed