Once the Quartz backend is stable and this document has caught up, this warning should go away.
Xapian can access information stored in various different formats. For example, there are the legacy formats which can be read by Xapian but not written to (at present, the Muscat 3.6 formats "DA" and "DB"), and the InMemory format which is held in memory.
Each of these formats is comprised by a set of classes providing an interface to a Database object and several other related objects (PostList, TermList, etc...).
Quartz is simply the name of a backend which is currently being developed, which draws from all our past experience to satisfy the following criteria:
Different backends can be optionally compiled into the Xapian library (by specifying appropriate options to the configure script). Quartz is compiled by default.
Why do we call it Quartz - where does the name come from?
Well, we had to call it something, and Quartz was simply the first name we came up with which we thought we could live with...
These tables consist of a set of key-tag pairs, which I shall often refer to these as items or entries. Items may be accessed randomly by specifying a key and reading the item pointed to, or in sorted order by creating a cursor pointing to a particular item. The sort order is a lexicographical ordering based on the contents of the keys. Only one instance of a key may exist in a single table - inserting a second item with the same key as an existing item will overwrite the existing item.
Positioning of cursors may be performed even when a full key isn't known, by attempting to access an item which doesn't exist: the cursor will then be set to point to the first item with a key before that requested.
The QuartzTable class defines the standard interface to tables. This has two subclasses - the QuartzDiskTable and QuartzBufferedTable interface. The former provides direct access to the table as stored on disk, and the latter provides access via a large buffer in order to use the memory as a write cache and greatly speed the process of indexing.
There are six tables comprising a quartz database.
It also stores a couple of special fields - one containing the next document ID to use when adding a document (document IDs are allocated in increasing order, starting at 1, and are currently never reused), and one field containing the total length of the documents in the database. This latter quantity is used for calculation of the average document length, and hence for calculation of normalised document lengths.
Currently, there is one item for each document in the database, which consists of a list of keyno's and attribute values for that document. The alternative implementation is to store an item for each attribute, whose key is a combination of the document ID and the keyno, and whose tag is the attribute value.
Which implementation is better depends on the access pattern: if a document is being passed across a network link, all the attributes for a document are read - if a document is being dealt with locally, usually only some of the attributes will be read.
Documents will usually have very few attribute values, so the current implementation may actually be the most suitable.
For each term in the database, this has an entry whose key is the term, and whose contents are the term ID for that term, and the term frequency for that term. This is intended to be a fast access table, so that processes can determine the term ID or weight of a term quickly, whether for the purpose of doing an expand or for the purpose of selecting important terms.
The term IDs stored in the lexicon are used (only) in the keys used to access items in the PostList and PositionList table. The termIDs allow the size of the keys for these tables to be bounded, and may also allow better compression. It is debatable whether the termIDs are a good idea, however, or whether it would be better simply to use the termnames instead.
This table also has a special entry, which stores the next term ID to allocate. Like document IDs, term IDs are allocated in increasing order, starting at 1, and are currently never reused.
Speed of access to this table is likely to be critical - attempting to ensure that the most frequently used part of the lexicon is always cached in memory to a large extent is likely to give high rewards.
The list first stores the document length, and the number of entries in the termlist (this latter value is stored for quick access - it could also be determined by running through the termlist). It then stores a set of entries: each entry in the list consists of a term (as a string), and the wdf (within document frequency - how many times the term appears in the document) of that term.
In a non-modifiable database, the term frequency could be stored in the termlist for each entry in each list. This would enable query expansion operations to occur significantly faster by avoiding the need for a large number of extra lookups - however, this cannot be implemented in a writable database without causing any modifications to modify a very large proportion of the database.
(The current implementation doesn't make use of any real compression. However, it has been designed to scan through lists of data in such a way as to facilitate introducing proper compression.)
It is planned to use various compression techniques, for example:
The only compression currently performed is, wherever an unsigned integer is stored in a table, to represent it in a simply encoded form as:
To deal with this, we store posting lists in small chunks, each the right size to be stored in a single B-tree block, and hence to be accessed with a minimal amount of disk latency.
The key for the first chunk in a posting list is the term ID of the term whose posting list it is. The key in subsequent chunks is the term ID followed by the document ID of the first document in the chunk. This allows the cursor methods to be used to scan through the chunks in order, and also to jump to the chunk containing a particular document ID.
It is quite possible that data in other tables (eg, termlist and possibly position lists) would benefit from being split into chunks in this way.
(The code currently uses a temporary hack to implement tables by reading the contents of an unstructured file into memory, modifying it, and writing it all back again. Martin is working on a B-tree manager, which is complete apart from some API issues and will be added into the CVS repository within the next couple of weeks.)
In some situations, the use of a different structure would be appropriate - in particular for the lexicon where key ordering is irrelevant, and a hashing scheme would likely provide more memory and time efficient access. This will be investigated once the initial version is fully functional.
A B-tree is a fairly standard structure for storing this kind of data, so I will not describe it in detail - see a reference book on database design and algorithms for that. The essential points are that it is a block-based multiply branching tree structure, storing keys in the internal blocks and key-tag pairs in the leaf blocks.
Our implementation is fairly standard, except for its revision scheme, which allows modifications to be applied atomically whilst other processes are reading the database. This scheme involves copying each block in the tree which is involved in a modification, rather than modifying it in place, so that a complete new tree structure is built up whilst the old structure is unmodified (although this new structure will typically share a large number of blocks with the old structure). The modifications can then be atomically applied by writing the new root block and making it active.
After a modification is applied successfully, the old version of the table is still fully intact, and can be accessed. The old version only becomes invalid when a second modification is attempted (and it becomes invalid whether or not that second modification succeeds).
There is no need for a process which is writing the database to know whether any processes are reading previous versions of the database. As long as only one update is performed before the reader closes (or reopens) the database, no problem will occur. If more than one update occurs whilst the table is still open, the reader will notice that the database has been changed whilst it has been reading it by comparing a revision number stored at the start of each block with the revision number it was expecting. An appropriate action can then be taken (for example, to reopen the database and repeat the operation).
An alternative approach would be to obtain a read-lock on the revision being accessed. A write would then have to wait until no read-locks existed on the old revision before modifying the database.
As a result, the QuartzBufferedTable object is available. This simply stores a set of modified entries in memory, applying them to disk only when the apply method is called.
In fact, a QuartzBufferedTable object has to have two handles open on the table - one for reading and one for writing. This is simply because the interface for writing a table is more limited than that for reading a table (in particular, cursor operations are not available).
(Note: the current temporary implementation of quartz tables doesn't use two handles. It doesn't implement Cursor operations yet either.)
The revisioning scheme described earlier comes to the rescue! By carefully making sure that we open all the tables at the same revision, and by ensuring that at least one such consistent revision always exists, we can extend the scope of atomicity to cover all the tables. In detail: