SiteCrafting, Inc.
Dumping Duplicates
One of the more recent additions to the SiteCrafting CMS arsenal is a comprehensive error logger, tracking all PHP and MySQL errors (by default... other error types can be created on a case by case basis) that occur in new sites we build. Errors are stored in our own intranet system with a timestamp, error body and a site ID (assigned to each client at a different stage of our project workflow). The table looks something like this:
+----+---------------------+---------------+---------+
| id | logTime | text | project |
+----+---------------------+---------------+---------+
| 2 | 2008-05-14 14:42:15 | A PHP Error | 1 |
| 3 | 2008-05-14 14:42:26 | A PHP Error | 1 |
| 4 | 2008-05-14 14:42:34 | A PHP Error | 1 |
| 5 | 2008-05-14 14:42:47 | A MySQL Error | 1 |
| 6 | 2008-05-14 14:42:56 | A MySQL Error | 1 |
| 7 | 2008-05-14 14:43:05 | A PHP Error | 2 |
| 8 | 2008-05-14 14:43:10 | A PHP Error | 2 |
| 9 | 2008-05-14 14:43:21 | A MySQL Error | 2 |
+----+---------------------+---------------+---------+
8 rows in set (0.00 sec)
Obviously sometimes we get duplicate errors coming through. Aside from being mere mortals who aren't always fast enough to correct an error before it recurs, one of the first stages of debugging is to try and replicate the error. These are often pretty easy to manage. It's pretty trivial to search for all matching errors, check them all, and delete them. Sometimes, though, this just doesn't cut it. Like when there are a few different errors with 10,000 occurrences apiece.
First off: how does an error show up 10,000 times before we catch it? Infinite loops. Recently this bit of code gave us some trouble:
while(!feof($handle)) {
$line = fread($handle);
}
This is great until your $handle file fails to open, or closes unexpectedly, at which point each execution of feof throws an error AND returns false, allowing fread to execute and throw a slightly different error, and so on.
There are at least two ways to handle this. We can just update the timestamp of an error every time it is repeated rather than inserting. This turns out to be impractical because it is sometimes important to see the breadth of timing for a given error. Did it happen all at once? Was it clustered at a few moments of downtime? Or was it an ongoing problem?
So instead we need a way to, at our discretion, delete all but one copy of a given (or all) errors. I say all but one because it can be worthwhile to keep a copy for the records, or if you want to clear the logger of duplicates of an as yet unresolved query (for example, if a sudden batch of 20,000 errors starts clogging the system and slowing down your queries). My first inclination was to use a subquery, like so:
DELETE FROM errors
WHERE (id, text) NOT IN (
SELECT id, text
FROM errors
GROUP BY text
)
However, as I was quickly reminded, MySQL does not allow you to delete rows in a table from which you are currently selecting in a subquery. Which is fine, because I'm sure this is far from the quickest solution.
My next try had much more luck:
DELETE e2.*
FROM errors e
INNER JOIN errors e2 ON
e2.text = e.text AND
e2.project = e.project AND
e2.id < e.id
This works best with an auto-increment errors.id field, because that guarantees that we are saving the most recent instance of the error. The timestamp is unreliable for this purpose, since in examples like the infinite loop problem, errors are rolling in far too quickly to have unique stamps.
From an administrative standpoint, there are faster and more robust possibilities. You can populate a new table with the results of:
SELECT *
FROM errors
GROUP BY text, project
and then delete the original table and rename them, for example. However, this doesn't work from an application user standpoint, as the account coming in from PHP should never, at least in this context, have CREATE TABLE privileges.
As long as order of entries doesn't matter to you, this method should work with any non-incrementing primary key as well.
by Joe Izenman | 5/14/2008 4:15pm | Comments (0)
No comments found.