Aug26

With most large sets of data, especially numerical data, statistical analysis plays a key role. You can't be bothered to look at every record yourself; that's what computers are for.

One useful tool in any statistical analysis is the identification of outliers. Assuming you have a normally distributed set of data, outliers can help to identify user error in the data entry process, or genuine spikes in the data. Once found, these numbers can be set aside for closer analysis or eliminated to normalize the data set.

There are many different methods for identifying outliers, with varying levels of rigor. Here I'll just demonstrate one of the simplest definitions: an outlier is any value greater than three standard deviations away from the mean.

First we'll need some data. To demonstrate, I've created the simplest of tables:

The formula for finding the number of standard deviations a point is from the sample's mean (in terms of MySQL aggregate functions) is (value - AVG(value) ) / STDDEV(value). Unfortunately, due to the way aggregate functions work, this doesn't quite fly:

Other methods

There are certain weaknesses to this method. If memory serves me, use of aggregate functions does not work with the query cache, so the AVG and STDDEV are calculated at every row, despite being the same all the way through. If you are working with a transactional storage engine using the REPEATABLE READ isolation level, you can start a transaction, assign the mean and deviation to user variables, and work with the saved values rather than relying on a subquery.

There are also other mathematical methods. If you calculate the interquartile range of the data (the difference between the quartiles, or the 75th and the 25th percentiles), you can treat any data more than 1.5*IQR outside of the aforementioned quartiles as an outlier. I won't get too deeply into that, as I'd really just be copying and pasting Roland Bouman's article on finding the nth percentile.

Anybody have any other interesting techniques to share?

One useful tool in any statistical analysis is the identification of outliers. Assuming you have a normally distributed set of data, outliers can help to identify user error in the data entry process, or genuine spikes in the data. Once found, these numbers can be set aside for closer analysis or eliminated to normalize the data set.

There are many different methods for identifying outliers, with varying levels of rigor. Here I'll just demonstrate one of the simplest definitions: an outlier is any value greater than three standard deviations away from the mean.

First we'll need some data. To demonstrate, I've created the simplest of tables:

CREATE TABLE randomNumbers (And populated it with something vaguely resembling normal data, about 10,000 rows of:

value INT(11)

);

INSERT INTO randomNumbers(value) VALUES

(

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000) +

ROUND(RAND() * 1000)

)

The formula for finding the number of standard deviations a point is from the sample's mean (in terms of MySQL aggregate functions) is (value - AVG(value) ) / STDDEV(value). Unfortunately, due to the way aggregate functions work, this doesn't quite fly:

mysql> SELECT value, (value - AVG(value)) / STDDEV(value)Since we need to work with the full data set, we can't reasonably use GROUP BY, so we need to get the mean and standard deviation by some other method. Assuming that the data set is still growing, we don't want to risk grabbing this information separately, so instead we will join it in via a derived table (a.k.a. a subquery in the FROM clause):

-> FROM randomNumbers;

ERROR 1140 (42000): Mixing of GROUP columns (MIN(),MAX(),COUNT(),...) with no

GROUP columns is illegal if there is no GROUP BY clause

mysql> SELECT num.value, agro.mean, agro.devNow that this aggregate information is available to each row in the result set, we can calculate the number of deviations for each value:

-> FROM randomNumbers num

-> CROSS JOIN (

-> SELECT AVG(value) AS mean, STDDEV(value) AS dev

-> FROM randomNumbers

-> ) agro

-> ORDER BY num.value;

+-------+-----------+----------+

| value | mean | dev |

+-------+-----------+----------+

| 1462 | 5006.7004 | 908.8870 |

| 1827 | 5006.7004 | 908.8870 |

| 1980 | 5006.7004 | 908.8870 |

...

| 8087 | 5006.7004 | 908.8870 |

| 8094 | 5006.7004 | 908.8870 |

| 8212 | 5006.7004 | 908.8870 |

+-------+-----------+----------+

10166 rows in set (0.04 sec)

mysql> SELECT num.value, (value - mean) / dev AS num_devsSince we've defined an outlier as any value that is more than three standard deviations away from the mean, we filter on the absolute value of num_devs to get all outliers in both directions:

-> FROM randomNumbers num

-> CROSS JOIN (

-> SELECT AVG(value) AS mean, STDDEV(value) AS dev

-> FROM randomNumbers

-> ) agro

-> ORDER BY num.value;

+-------+-------------+

| value | num_devs |

+-------+-------------+

| 1462 | -3.90004522 |

| 1827 | -3.49845514 |

| 1980 | -3.33011739 |

...

| 5006 | -0.00077061 |

| 5007 | 0.00032963 |

...

| 8094 | 3.39679146 |

| 8212 | 3.52662058 |

+-------+-------------+

10166 rows in set (0.07 sec)

mysql> SELECT num.value, (num.value - agro.mean) / agro.dev AS num_devsIf you're just looking for spikes on one extreme or the other, you can ditch the ABS in the WHERE clause (and change it to < -3 to find the low end). And of course you can adjust the starting point (currently 3) if you want a larger or smaller definition of an outlier.

-> FROM randomNumbers num

-> CROSS JOIN (

-> SELECT AVG(value) AS mean, STDDEV(value) AS dev

-> FROM randomNumbers

-> ) agro

-> WHERE ABS( num.value - agro.mean ) / agro.dev > 3

-> ORDER BY num.value;

+-------+-------------+

| value | num_devs |

+-------+-------------+

| 1462 | -3.90004522 |

| 1827 | -3.49845514 |

| 1980 | -3.33011739 |

| 2071 | -3.22999493 |

| 2142 | -3.15187741 |

| 2187 | -3.10236630 |

| 2254 | -3.02864977 |

| 2254 | -3.02864977 |

| 7816 | 3.09092285 |

| 7888 | 3.17014062 |

| 7958 | 3.24715790 |

| 7962 | 3.25155888 |

| 7997 | 3.29006752 |

| 8076 | 3.37698702 |

| 8077 | 3.37808726 |

| 8087 | 3.38908973 |

| 8094 | 3.39679146 |

| 8212 | 3.52662058 |

+-------+-------------+

18 rows in set (0.05 sec)

Other methods

There are certain weaknesses to this method. If memory serves me, use of aggregate functions does not work with the query cache, so the AVG and STDDEV are calculated at every row, despite being the same all the way through. If you are working with a transactional storage engine using the REPEATABLE READ isolation level, you can start a transaction, assign the mean and deviation to user variables, and work with the saved values rather than relying on a subquery.

There are also other mathematical methods. If you calculate the interquartile range of the data (the difference between the quartiles, or the 75th and the 25th percentiles), you can treat any data more than 1.5*IQR outside of the aforementioned quartiles as an outlier. I won't get too deeply into that, as I'd really just be copying and pasting Roland Bouman's article on finding the nth percentile.

Anybody have any other interesting techniques to share?

Dev