SiteCrafting Blah Blah Blog
Oct. 27, 2006 at 11:34am
Cross Table Content Search
One of the more interesting adventures working with data storage is trying to aggregate information meaningfully from multiple very different data structures. Imagine you've got a website filled with content (say, a few hundred pages). All the content is stored and output dynamically - who wants to create and maintain 500 static html pages, anyway? And of course, you need a bit of variety, so all this content is spread across five different page designs, each requiring its own template and data structure.
Now you say you want to search your site? All the content? And you want the results all together in one big happy sorted-by-relevancy list? How on earth am I going to do that?
Well... like this.
Background
Our basic data structures are pretty well cemented for dynamic front-end content. Every page has an entry in a core table (we'll call it page). The page entry has all the data that is universal. ID number, meta information, title, as well as parent and order numbers (for navigation hierarchy. In addition, we have a template id and a content id.
The template id points to an entry in page_templates. Here you find the template names and, more importantly, the content table names. Remember that each template needs its own table as content block layout varies. With content table name and content id handy, you can leap to the actual page content.
The Search
Searching a single table's content is a relatively easy matter. Put a FULLTEXT index on the relevant fields and run
SELECT id
FROM table
WHERE MATCH(search_columns) AGAINST (search_string)
Since you designed the table, you know what fields to search on, right? Heck, this would work even for multiple tables. Just loop through the searches, writing a custom set of search indices for each straight into your queries.
This is all well and good when you're working on a search for a single website. But working for a fairly prolific development company, who conveniently use the same basic table layout for the vast majority of their content managed front-end sites, the smart move is to write a function that discovers these indices on its own. This can be achieved using SHOW INDEX FROM content_table. From there, parse the relevant indices like so:
while($row = mysql_fetch_assoc($res))
{
if($row['Index_type'] == 'FULLTEXT')
{
$search_cols .= $row['Column_name'].', ';
}
}
Stick $search_cols up there in your single-table search, and suddenly you've got a dynamic FULLTEXT search that you can use on any table.
Aggregating the Data
Unfortunately, on a seven template site you've still got seven different sets of search results. The FULLTEXT search will return each result set in weighted order, but the user doesn't care what's using what template. They want the whole list to be in weighted order. To handle this, I tossed in a bit of array processing. First, restructure the query a bit to actually return the weight value itself:
SELECT id, MATCH(search_columns) AGAINST('search_string') AS weight
FROM content_table
WHERE MATCH(search_columns) AGAINST('search_string')
I ran this for each content table, and appended all the results onto a single array. From there I just wrote a simple php usort comparison function to sort on the table weight (and obviously the queries were expanded a bit to draw some actual useful information, beyond just the page id).
Of course, with anything that works, the big question is always "How can I make it better?" So... how can I make it better? Right now our production servers are running MySQL 4.0 (to paraphrase Peter Zaistev, as long as it works, why put a whole lot of data in danger for an upgrade that isn't necessary?). I've got some notions regarding putting more of the load on the database side, but those will take stored routines (5.0) and/or subqueries (4.1). But with where we're at now, does anyone else have a notion of how to improve this solution? Or perhaps a completely different solution to the same problem?
Posted in MySQL, PHP by Joe Izenman
Comments (0)
Add your comment below