The Quartz database backend

Status

WARNING: This document describes a piece of code which is still in development. As such, some of the descriptions within it will be inconsistent with the code to be found in our releases and CVS repository. In such cases, the code is out of date if written before this document (ie, before 2000-10-20), and this document is out of date if the code was written after this document. If you find such inconsistencies, especially this document being out of date, please alert us to the problem using the discussion mailing list, at <xapian-discuss@lists.sourceforge.net>.

Once the Quartz backend is stable and this document has caught up, this warning should go away.

Why Quartz?

What is this thing called Quartz? How does it fit in with the Xapian library?

Xapian can access information stored in various different formats. For example, there are the legacy formats which can be read by Xapian but not written to (at present, the Muscat 3.6 formats "DA" and "DB"), and the InMemory format which is held in memory.

Each of these formats is comprised by a set of classes providing an interface to a Database object and several other related objects (PostList, TermList, etc...).

Quartz is simply the name of a backend which is currently being developed, which draws from all our past experience to satisfy the following criteria:

Different backends can be optionally compiled into the Xapian library (by specifying appropriate options to the configure script). Quartz is compiled by default.

Why do we call it Quartz - where does the name come from?

Well, we had to call it something, and Quartz was simply the first name we came up with which we thought we could live with...

Tables

A Quartz database consists of several tables, each of which stores a different type of information: for example, one table stores the user-defined data associated with each document, and another table stores the posting lists (the lists of documents which particular terms occur in).

These tables consist of a set of key-tag pairs, which I shall often refer to these as items or entries. Items may be accessed randomly by specifying a key and reading the item pointed to, or in sorted order by creating a cursor pointing to a particular item. The sort order is a lexicographical ordering based on the contents of the keys. Only one instance of a key may exist in a single table - inserting a second item with the same key as an existing item will overwrite the existing item.

Positioning of cursors may be performed even when a full key isn't known, by attempting to access an item which doesn't exist: the cursor will then be set to point to the first item with a key before that requested.

The QuartzTable class defines the standard interface to tables. This has two subclasses - the QuartzDiskTable and QuartzBufferedTable interface. The former provides direct access to the table as stored on disk, and the latter provides access via a large buffer in order to use the memory as a write cache and greatly speed the process of indexing.

The contents of the tables

We shall worry about the implementation of the tables later, but first we shall look at what is stored within each table.

There are six tables comprising a quartz database.

(Note that the PositionList and PostList are not yet implemented.)

Representation of integers, strings, etc.

It is well known that in modern computers there are many, many CPU cycles for each disk read, or even memory read. It is therefore important to minimise disk reads, and can be advantageous to do so even at the expense of a large amount of computation. In other words, Compression is good. ;-)

(The current implementation doesn't make use of any real compression. However, it has been designed to scan through lists of data in such a way as to facilitate introducing proper compression.)

It is planned to use various compression techniques, for example:

The only compression currently performed is, wherever an unsigned integer is stored in a table, to represent it in a simply encoded form as:

  1. First byte: if integer is < 128, store integer, otherwise store integer modulo 128, but with top bit set.
  2. Shift integer right 7 places.
  3. Second byte: if integer is < 128, store integer, otherwise store integer modulo 128, but with top bit set.
  4. Shift integer right 7 places.
  5. etc...

PostLists and chunks.

Posting lists can grow to be very large - some terms occur in a very large proportion of the documents, and their posting lists can represent a significant fraction of the size of the whole database. Therefore, we do not wish to read an entire posting list into memory at once. (Indeed, we'd rather only read a small portion of it at all, but that's a different story - see the documentation on optimisation of the matcher [which doesn't exist yet, ed.]).

To deal with this, we store posting lists in small chunks, each the right size to be stored in a single B-tree block, and hence to be accessed with a minimal amount of disk latency.

The key for the first chunk in a posting list is the term ID of the term whose posting list it is. The key in subsequent chunks is the term ID followed by the document ID of the first document in the chunk. This allows the cursor methods to be used to scan through the chunks in order, and also to jump to the chunk containing a particular document ID.

It is quite possible that data in other tables (eg, termlist and possibly position lists) would benefit from being split into chunks in this way.

Btree implementation

The tables are currently all planned to be implemented as B-trees.

(The code currently uses a temporary hack to implement tables by reading the contents of an unstructured file into memory, modifying it, and writing it all back again. Martin is working on a B-tree manager, which is complete apart from some API issues and will be added into the CVS repository within the next couple of weeks.)

In some situations, the use of a different structure would be appropriate - in particular for the lexicon where key ordering is irrelevant, and a hashing scheme would likely provide more memory and time efficient access. This will be investigated once the initial version is fully functional.

A B-tree is a fairly standard structure for storing this kind of data, so I will not describe it in detail - see a reference book on database design and algorithms for that. The essential points are that it is a block-based multiply branching tree structure, storing keys in the internal blocks and key-tag pairs in the leaf blocks.

Our implementation is fairly standard, except for its revision scheme, which allows modifications to be applied atomically whilst other processes are reading the database. This scheme involves copying each block in the tree which is involved in a modification, rather than modifying it in place, so that a complete new tree structure is built up whilst the old structure is unmodified (although this new structure will typically share a large number of blocks with the old structure). The modifications can then be atomically applied by writing the new root block and making it active.

After a modification is applied successfully, the old version of the table is still fully intact, and can be accessed. The old version only becomes invalid when a second modification is attempted (and it becomes invalid whether or not that second modification succeeds).

There is no need for a process which is writing the database to know whether any processes are reading previous versions of the database. As long as only one update is performed before the reader closes (or reopens) the database, no problem will occur. If more than one update occurs whilst the table is still open, the reader will notice that the database has been changed whilst it has been reading it by comparing a revision number stored at the start of each block with the revision number it was expecting. An appropriate action can then be taken (for example, to reopen the database and repeat the operation).

An alternative approach would be to obtain a read-lock on the revision being accessed. A write would then have to wait until no read-locks existed on the old revision before modifying the database.

Buffered tables

If each change to a table (ie, modification of a key-tag pair) was immediately written to disk, there would be two problems.
  1. The system would be very slow. If a disk read and then a write was required each time an item was changed, the indexing process would spend most of its time waiting for the disk's write head to seek to the appropriate block. By buffering up a large set of changes, and then writing them all out in a sorted order, seeking is minimised.
  2. Operations which involve modifying more than one key-tag pair would not be atomic. For example, when adding a document to a database the record table has three items updated - one containing the data stored in the document, one which stores the next document ID to allocate, and one which stores the sum of all the document lengths in the database. These three items must be updated at the same time, so that a consistent state is always seen by other processes, and a consistent state remains if the system is terminated unexpectedly.

As a result, the QuartzBufferedTable object is available. This simply stores a set of modified entries in memory, applying them to disk only when the apply method is called.

In fact, a QuartzBufferedTable object has to have two handles open on the table - one for reading and one for writing. This is simply because the interface for writing a table is more limited than that for reading a table (in particular, cursor operations are not available).

(Note: the current temporary implementation of quartz tables doesn't use two handles. It doesn't implement Cursor operations yet either.)

Applying changes to all the tables simultaneously

This is all wonderful: we have tables storing arbitrary bits of useful data, we can update bits of them as we like, and we can then call a method and have all the modifications applied to the table atomically. Unfortunately, we need more than that - we need to be able to apply modifications as a single atomic transaction across multiple tables, so that the tables are always accessed in a mutually consistent state.

The revisioning scheme described earlier comes to the rescue! By carefully making sure that we open all the tables at the same revision, and by ensuring that at least one such consistent revision always exists, we can extend the scope of atomicity to cover all the tables. In detail:

This scheme guarantees that modifications are atomic across all the tables - essentially we have made the modification get committed only when the final table is committed.

Items to be added to this document

Endnote

The system as described could, no doubt, be improved in several ways. If you can think of such ways then suggest it to us, so we can have a discussion of the improvement to see whether it would help: if it would we will add it to the design (and eventually the code) - if not, we'll add a discussion about it to this document.