This document is incomplete at present. It lacks explanation of the min-heap used to keep the best N M-set items (Managing Gigabytes describes this technique well), and there's as yet no documentation of sort_bands.
Pairs of PostLists are merged into a "virtual" PostList. This process is repeated to form a single virtual PostList accessed by the Match object behind which hangs a tree of PostLists.
This tree of PostLists is a binary tree. This is more efficient than a n-ary tree in terms of the number of comparisons which need to be performed: <insert proof> - proof may only be valid for equal sized posting lists without optimisations, in which case there may be a more efficient way to do this and we may wish to change the code.
The tree is deliberately built in an uneven way, such that we minimise the likely number of times a posting has to be passed up a level. For a group of OR operations, the PostLists with fewest entries are furthest down the tree, minimising the amount of information needing to be passed up the tree. For AND operations the PostLists with most entries are furthest down, allowing maximally sized skip_to's to be performed.
There are several types of virtual PostList. Each type can be treated as boolean or probabilistic - the only difference is whether the weights are ignored or not. The types are:
[Note: You can use AndNotPostList to apply an inverted boolean filter to a probabilistic query]
All the symmetric operators (i.e. OR, AND, XOR) are coding for maximum efficiency when the right branch has fewer postings in than the left branch.
There are 2 main optimisations which the best match performs: autoprune and operator decay.
For example, if a branch in the match tree is "A OR B", when A runs out then "A OR B" is replaced by "B". Similar reductions occur for XOR, ANDNOT, and ANDMAYBE (if the right branch runs out). Other operators (AND, FILTER, and ANDMAYBE (when the left branch runs out) simply return "at_end" and this is dealt with somewhere further up the tree as appropriate.
An autoprune is indicated by the next or skip_to method returning a pointer to the PostList object to replace the postlist being read with.
The matcher tracks the minimum weight needed for a document to make it into the m-set (this decreases monotonically as the m-set forms). This can be used to replace on boolean operator with a stricter one. E.g. consider A OR B - when maxweight(A) < minweight and maxweight(B) < minweight then only documents matching both A and B can make it into the m-set so we can replace the OR with an AND. Operator decay is flagged using the same mechanism as autoprune, by returning the replacement operator from next or skip_to.
Possible decays:
A related optimisation is that the Match object may terminate early if maxweight for the whole tree is less than the smallest weight in the mset.
One of the properties of a query is the collapse_key, which can be used to refer to document values (quick-access document properties, useful for query-time operations such as sorting, collapsing etc). If the collapse_key is set, then as documents are added to the m-set a check is made to see if another document in the m-set has the same value for this key. If this is the case, then the highest weighted hit is kept, and the other is discarded.
One use for this is to remove from results duplicate documents which may be available under different URLs. A document id can be indexed as one of the document values, and then when the search is made, the collapse_key set to this key number. Duplicate results will then be removed. The omega cgi parameter COLLAPSE can be used to pass collapse_key during matching
Another use for this is to group under one hit results of a similar type. In some scenarios one type of document appears very often in the results, but all documents of that type are not of interest to the user. In such a case the results of that type could be said to obscure other results. An example might be a case where documents from different information providers are indexed together, but where the range of information from one provider is very large. Thus, whatever is searched for, this provider may have many hits near the top relevance. It is reasonable for the user to think "But I don't want information from that provider, I want some OTHER results." Another case might be where there are many different documents in different sections with the same terms; for eaxmple there may be many PC Java games, also many PDA Java games as well as documentation on writing Java games. If all results were returned together it may be hard to see results from one as they are masked by results from another section with slightly more relevance due to accidents of wording.
To overcome this problems in the presentation of results, it makes sense to perform result collapsing on a common key value, perhaps information provider, or site section. Only instead of merely throwing away documents sharing the same value we want to be aware, and count, these documents. In the results we want to know if there were other documents of that type, and be able to provide a full search limited to documents of that type.
To provide an expanded search is simple enough, the common key-value used to collapse on would also need to be indexed as a prefixed term. Then to provide the expanded search over all documents of this type a boolean filter of the prefixed term is added to the terms of the previous search.
The OmMSetIterator, used for iterating over results provides get_collapse_count() which returns 0 for no relevant collapses and >0 for at least one relevant collapse, where the actual number is the number of actual collapses that took place without respect to relevance and knowing that perhaps further collapses would have taken place if the search had been exhaustive. If collapse_count is 0 it does not mean there would be no collapses, it just means the search terminated (optimised) before they were found. If a larger m-set is produced (as it will be for subsequent result pages) further results may be obtained and also collapsed thus producing a >0 collapse count. This will be of little benefit or consolation if the new m-set is used to show subsequent result pages which do not contain the top collapsed document for then the user will never know there were any collapsed documents of that type and will also have further results hidden as they will still be collapsed when found.
To sum up the dangers of group-collapsing for later expanded searchines: