Matcher Design Notes

This document is incomplete at present. It lacks explanation of the min-heap used to keep the best N M-set items (Managing Gigabytes describes this technique well), and there's as yet no documentation of sort_bands.

The PostList Tree

Pairs of PostLists are merged into a "virtual" PostList. This process is repeated to form a single virtual PostList accessed by the Match object behind which hangs a tree of PostLists.

This tree of PostLists is a binary tree. This is more efficient than a n-ary tree in terms of the number of comparisons which need to be performed: <insert proof> - proof may only be valid for equal sized posting lists without optimisations, in which case there may be a more efficient way to do this and we may wish to change the code.

The tree is deliberately built in an uneven way, such that we minimise the likely number of times a posting has to be passed up a level. For a group of OR operations, the PostLists with fewest entries are furthest down the tree, minimising the amount of information needing to be passed up the tree. For AND operations the PostLists with most entries are furthest down, allowing maximally sized skip_to's to be performed.

There are several types of virtual PostList. Each type can be treated as boolean or probabilistic - the only difference is whether the weights are ignored or not. The types are:

[Note: You can use AndNotPostList to apply an inverted boolean filter to a probabilistic query]

All the symmetric operators (i.e. OR, AND, XOR) are coding for maximum efficiency when the right branch has fewer postings in than the left branch.

There are 2 main optimisations which the best match performs: autoprune and operator decay.

autoprune

For example, if a branch in the match tree is "A OR B", when A runs out then "A OR B" is replaced by "B". Similar reductions occur for XOR, ANDNOT, and ANDMAYBE (if the right branch runs out). Other operators (AND, FILTER, and ANDMAYBE (when the left branch runs out) simply return "at_end" and this is dealt with somewhere further up the tree as appropriate.

An autoprune is indicated by the next or skip_to method returning a pointer to the PostList object to replace the postlist being read with.

operator decay

The matcher tracks the minimum weight needed for a document to make it into the m-set (this decreases monotonically as the m-set forms). This can be used to replace on boolean operator with a stricter one. E.g. consider A OR B - when maxweight(A) < minweight and maxweight(B) < minweight then only documents matching both A and B can make it into the m-set so we can replace the OR with an AND. Operator decay is flagged using the same mechanism as autoprune, by returning the replacement operator from next or skip_to.

Possible decays:

A related optimisation is that the Match object may terminate early if maxweight for the whole tree is less than the smallest weight in the mset.