MaxxCAT Pro Tips

Mar16
By Phil Price, Software Architect

Recently, SiteCrafting implemented a MaxxCAT search appliance to do the heavy-lifting for websites with deep-search needs.

Today, our MaxxCAT hardware is crawling and querying results for 59,000+ documents across 25 client websites. Along the path of getting everything configured for success, we worked with MaxxCAT and discovered some very helpful bits of functionality we’d like to share with you here.

 

Tip #1: Limit your crawls

Some links are static and predictable, but others are dynamically generated by code - and that’s where a crawl designed to follow links can get unpredictable. A good example of this is the pair of previous/next links on a calendar; a perfectly usable interface for a human, but when a crawl designed to follow all links gets in there, it can click “next” virtually forever. This condition creates a runaway crawl that never quite finishes because it’s following your infinite loop, resulting in a crawl that cannot be queried or diagnosed.

That’s where the CONFIG.MAX_DOCUMENTS crawl setting comes in handy. It limits the crawl to the number of links you specify, thus avoiding runaway crawls altogether. For example, if you know a client is not very likely to have more than 5000 valid links on their website, you would specify:

CONFIG.MAX_DOCUMENTS=5000

Using this setting, you can diagnose crawls that may contain an infinite loop or some other condition that prevents a good crawl. If you observe a crawl bumping up against it’s limit (i.e. ‘4999 Docs Crawled’), you can take the following steps to diagnose it:

  1. Submit a query through the browser URL bar, using a search term that is expected to return as many results as possible.
    http://maxxcat.domain.com/query.cgi?&query=city&collection=colid7&resStart=0&resLength=5000&callBack=NULL

  2. View the page source to present the JSON in a nice and readable form

  3. Scroll and scan through the results, looking for undesired URL patterns. Configure your crawl to ignore these links by adding ‘EXCLUDE’ configurations. For example:

a. EXCLUDE=events/events
...would sidestep infinite loops created by our erroneous relative links
b. EXCLUDE=events/?m=
...would sidestep infinite loops created by our ‘Previous month’ / ‘Next month’ links
c. EXCLUDE=twitter.com
…would sidestep erroneous links that were considered valid because the client domain was found in the link as a referrer parameter (e.g. http://www.twitter.com/somepage/?refer=http://www.clientdomain.com/somepage/)

By following this process, you can configure clean crawls that return essential and accurate search results.

 

Tip #2: Specify your snippets

When MaxxCAT returns search results, each result comes with four pieces of information: url, title, meta, and snippet (a preview of some of the text found at the link). By default, MaxxCAT formulates a snippet by parsing the document, extracting content, and assembling a snippet out of that content. This works well for binary documents (PDF, Word, etc) but for webpages you wanted to trim out the content that is repeated on every page (e.g. navigation, header, footer) so search results are as accurate as possible.

To accomplish this, start by specifying marker labels in the crawl’s advanced settings. You can use ‘mc-header’ and ‘mc-footer’ because they are simple and memorable.

Next, update your website software to surround relevant page content with corresponding comment markers. You may find it easiest to place them in the website page templates, surrounding the page-specific content like so:

 

Content we want to show in the snippet

           

Now when this page is crawled, MaxxCAT will trim out everything above and everything below before determining the snippet of text to associate with this page.

 

Tip #3: Implement meta-tag filtering

With MaxxCAT, it is possible to filter your search results by meta tag, similar to how Google will break down a set of results by Web, Images, Videos, News, Books, etc. For example, one website we built allows visitors to refine their search results further by clicking buttons labeled “Events”, “Tourism”, “Restaurants”, or “Hotels”. When one is clicked, the site re-submits the query to MaxxCAT and any only pages with that have been tagged with that particular meta tag will be returned. There are a few steps to setting this up:

  1. Open your crawl’s advanced settings and add:
    CONFIG.TEXT_PART_0=searchMetaTag||1|1|/|

  2. Decide on your meta filter values. We chose simple one word values and prepended them with ‘fv’ (for filter value) so that the filter value would never be confused with any words in the content, making: fvevents, fvtourism, fvrestaurants, and fvhotels.

  3. Update your web software to include the meta tags in the head markup. If there are multiple tags associated with a page, simple separate them with commas:

  4. Update your query call in the back end to include filter value. It works by inserting a key=value string like this...
    searchMetaTag=fvrestaurants+
    … in between the query= and your search term, making:
    http://maxxcat.domain.com/query.cgi?&query=searchMetaTag=fvrestaurants+barbeque&collection=colid1&snippetStart=mc-header&snippetEnd=mc-footer&resStart=0&resLength=250&callBack=NULL

This turns a basic search for the word “barbeque” everywhere on a site into a search that only returns actual restaurant pages tagged with the ‘fvrestaurants’ filter value. One thing to note is that you can give a page multiple filter values, but as far as I know you can not query for multiple filter values.


That’s all we have for now. Happy developing!


Dev

Back To Feed