Stats In MySQL Pt. III: Alphabetical Distribution
Not as snazzy-looking as my last two, but a useful query nonetheless. One the applications we built over the years, among other things, takes a list of names and addresses (currently a little over 2600), and selects around 200 of them at random. The client has been keeping track of the alphabetical distribution of last names with each random batch, and was concerned that the apparent weighting of the list toward the beginning of the alphabet was evidence that the lists were not truly random.
Fortunately I was able to get a quick statistical breakdown of the overall list, using the following:
SELECT SUBSTRING(ln.last_name, 1, 1) AS lInitial,
COUNT(ln.id) AS lTotal,
100*COUNT(ln.id)/ncount.total AS lPercent
FROM names ln
CROSS JOIN (
SELECT COUNT(*) AS total
FROM names
) AS ncount
GROUP BY lInitial
SUBSTRING(ln.last_name, 1, 1) gets the first letter of the last name. Grouping on this expression via the alias initial ensures that subsequent COUNT(ln.id) result is divided up on a by-initial basis.
If you just want totals this is enough, but for statistical information like percentage you need the total COUNT, which you can't grab directly because of the GROUP BY that was needed for the earlier part. In comes the derived table, via CROSS JOIN (SELECT COUNT(*) AS total FROM names) AS ncount. In this case, the table is MyISAM, so COUNT(*) is a pre-stored constant, preventing any significant performance hit from calculating the count repeatedly. From here it is a simple step to make a readable percentage, using ROUND(100*COUNT(ln.id)/ncount.total, 2), which produces numbers like 25.34, 1.20, etc, which is generally nicer than 0.015823 and the like.
Just because that was a little boring, here's an added problem: statistical distribution of each letter for both first and last initial in one shot.
This one gets a bit more complex, as you need another set of COUNT statements, this time grouping on the first initial of the first name. This means another subquery
LEFT JOIN (I had hoped to use the lInitial alias in the ON clause, but that didn't fly, unfortunately, so I was forced to recreate the SUBSTRING expression.
SELECT SUBSTRING(n.first_name, 1, 1) AS fInitial,
COUNT(p.id) AS fTotal
FROM names n
GROUP BY fInitial
) fn ON fn.fInitial = SUBSTRING(ln.last_name, 1, 1)
Now that we have the numerical distribution of first initials, we can perform the same mathematical operations as we did on the last initials, and come out with this final query:
SELECT SUBSTRING(ln.last_name, 1, 1) AS lInitial,
COUNT(ln.id) AS lTotal,
ROUND((COUNT(on.id)/ncount.total)*100, 2) AS lPercent,
fn.fTotal,
ROUND((fn.fTotal/pcount.total)*100, 2) AS fPercent
FROM names fn
LEFT JOIN (
SELECT SUBSTRING(n.first_name, 1, 1) AS fInitial,
COUNT(n.id) AS ftotal
FROM names n
GROUP BY fInitial
) fn ON fn.fInitial = SUBSTRING(ln.last_name, 1, 1)
CROSS JOIN (
SELECT COUNT(*) total FROM names
) AS ncount
GROUP BY lInitial
It's not terribly pretty, but it does the job. Since none of the subqueries are correlated, they shouldn't impact performance significantly. Feel free to contribute thoughts on cleaning it up a bit.
(Oh, in case you were wondering, J was the most common first initial, and S generally dominated last names. Guess we've got a strong John Smith population here in western Washington.)



Could you pkease direct me to an analysis of freqency of first letter for surnames?
Kind regards
Left by Ray Christmas | Jun. 12, 2009 at 10:27pm
That is, for all known surnames.
Left by Ray Christmas | Jun. 12, 2009 at 10:27pm
Leave a Comment