Xapian can access information stored in various different formats. For example, there are the legacy formats which can be read by Xapian but not written to (at present, the Muscat 3.6 formats "DA" and "DB"), and the InMemory format which is held in memory.
Each of these formats is comprised by a set of classes providing an interface to a Database object and several other related objects (PostList, TermList, etc...).
Quartz is simply the name of only currently supported disk-based backend which allows databases to be dynamically updated. The design of Quartz draws on all our past experience to satisfy the following criteria:
Different backends can be optionally compiled into the Xapian library (by specifying appropriate options to the configure script). Quartz is compiled by default.
Why do we call it Quartz - where does the name come from?
Well, we had to call it something, and Quartz was simply the first name we came up with which we thought we could live with...
These tables consist of a set of key-tag pairs, which I shall often refer to these as items or entries. Items may be accessed randomly by specifying a key and reading the item pointed to, or in sorted order by creating a cursor pointing to a particular item. The sort order is a lexicographical ordering based on the contents of the keys. Only one instance of a key may exist in a single table - inserting a second item with the same key as an existing item will overwrite the existing item.
Positioning of cursors may be performed even when a full key isn't known, by attempting to access an item which doesn't exist: the cursor will then be set to point to the first item with a key before that requested.
The QuartzTable class defines the standard interface to tables. This has two subclasses - the QuartzDiskTable and QuartzBufferedTable interface. The former provides direct access to the table as stored on disk, and the latter provides access via a large buffer in order to use the memory as a write cache and greatly speed the process of indexing.
There are six tables comprising a quartz database.
Key: lsb ... msb of the docid, until all remaining bytes are zero
The record table also holds a couple of special values, stored under the key consisting of a single zero byte (this isn't a valid encoded docid). The first value is the next document ID to use when adding a document (document IDs are allocated in increasing order, starting at 1, and are currently never reused). The other values is the total length of the documents in the database, which is used for calculation of the average document length, and hence for calculation of normalised document lengths.
Key: lsb ... msb of the docid, until all remaining bytes are zero
Currently, there is one B-tree entry for each document in the database that has one or more values associated with it. This entry consists of a list of value_no-s and values for that document.
An alternative implementation would be to store an item for each value, whose key is a combination of the document ID and the keyno, and whose tag is the value. Which implementation is better depends on the access pattern: if a document is being passed across a network link, all the values for a document are read - if a document is being dealt with locally, usually only some of the values will be read.
Documents will usually have very few values, so the current implementation may actually be the most suitable.
Key: lsb ... msb of the docid, until all remaining bytes are zero
The list first stores the document length, and the number of entries in the termlist (this latter value is stored for quick access - it could also be determined by running through the termlist). It then stores a set of entries: each entry in the list consists of a term (as a string), and the wdf (within document frequency - how many times the term appears in the document) of that term.
In a non-modifiable database, the term frequency could be stored in the termlist for each entry in each list. This would enable query expansion operations to occur significantly faster by avoiding the need for a large number of extra lookups - however, this cannot be implemented in a writable database without causing any modifications to modify a very large proportion of the database.
Key: pack_uint(did) + tname
The current implementation uses simple compression - we're investigating more effective schemes - these are (FIXME: this is slightly out of date now):
To deal with this, we store posting lists in small chunks, each the right size to be stored in a single B-tree block, and hence to be accessed with a minimal amount of disk latency.
The key for the first chunk in a posting list is the term ID of the term whose posting list it is. The key in subsequent chunks is the term ID followed by the document ID of the first document in the chunk. This allows the cursor methods to be used to scan through the chunks in order, and also to jump to the chunk containing a particular document ID.
It is quite possible that data in other tables (eg, termlist and possibly position lists) would benefit from being split into chunks in this way.
In some situations, the use of a different structure could be appropriate - in particular for the lexicon where key ordering is irrelevant, and a hashing scheme might provide more space and time efficient access. This is an area for future investigation.
A B-tree is a fairly standard structure for storing this kind of data, so I will not describe it in detail - see a reference book on database design and algorithms for that. The essential points are that it is a block-based multiply branching tree structure, storing keys in the internal blocks and key-tag pairs in the leaf blocks.
Our implementation is fairly standard, except for its revision scheme, which allows modifications to be applied atomically whilst other processes are reading the database. This scheme involves copying each block in the tree which is involved in a modification, rather than modifying it in place, so that a complete new tree structure is built up whilst the old structure is unmodified (although this new structure will typically share a large number of blocks with the old structure). The modifications can then be atomically applied by writing the new root block and making it active.
After a modification is applied successfully, the old version of the table is still fully intact, and can be accessed. The old version only becomes invalid when a second modification is attempted (and it becomes invalid whether or not that second modification succeeds).
There is no need for a process which is writing the database to know whether any processes are reading previous versions of the database. As long as only one update is performed before the reader closes (or reopens) the database, no problem will occur. If more than one update occurs whilst the table is still open, the reader will notice that the database has been changed whilst it has been reading it by comparing a revision number stored at the start of each block with the revision number it was expecting. An appropriate action can then be taken (for example, to reopen the database and repeat the operation).
An alternative approach would be to obtain a read-lock on the revision being accessed. A write would then have to wait until no read-locks existed on the old revision before modifying the database.
As a result, the QuartzBufferedTable object is available. This simply stores a set of modified entries in memory, applying them to disk only when the apply method is called.
In fact, a QuartzBufferedTable object has to have two handles open on the table - one for reading and one for writing. This is simply because the interface for writing a table is more limited than that for reading a table (in particular, cursor operations are not available).
(Note: the current temporary implementation of quartz tables doesn't use two handles. It doesn't implement Cursor operations yet either.)
The revisioning scheme described earlier comes to the rescue! By carefully making sure that we open all the tables at the same revision, and by ensuring that at least one such consistent revision always exists, we can extend the scope of atomicity to cover all the tables. In detail:
I'm not sure about the name 'Btree' that runs through all this, since the fact that it is all implemented as a B-tree is surely irrelevant. I have not been able to think of a better name though ...
Some of the constants mentioned below depend upon a byte being 8 bits, but this assumption is not built into the code.
In the B-tree key-tag pairs are ordered, and the order is the ASCII collating order of the keys. Very precisely, if key1 and key2 point to keys with lengths key1_len, key2_len, key1 is before/equal/after key2 according as the following procedure returns a value less than, equal to or greater than 0,
static int compare_keys(const byte * key1, int key1_len, const byte * key2, int key2_len) { int smaller = key1_len < key2_len ? key1_len : key2_len; for (int i = 0; i < smaller; i++) { int diff = (int) key1[i] - key2[i]; if (diff != 0) return diff; } return key1_len - key2_len; }
[This is okay, but none of the code fragments below have been checked.]
Any large-scale operation on the B-tree will run very much faster when the keys have been sorted into ASCII collating order. This fact is critical to the performance of the B-tree software.
A key-tag pair is called an 'item'. The B-tree consists therefore of a list of items, ordered by their keys:
I1 I2 ... Ij-1 Ij Ij+1 ... In-1 In
Item Ij has a 'previous' item, Ij-1, and a 'next' item, Ij+1.
When the B-tree is created, a single item is added in with null key and null tag. This is the 'null item'. The null item may be searched for, and it's possible, although perhaps not useful, to replace the tag part of the null item. But the null item cannot be deleted, and an attempt to do so is merely ignored.
A key must not exceed 252 bytes in length.
A tag may have length zero. There is an upper limit on the length of a tag, but it is quite high. Roughly, the tag is divided into items of size L - kl, where L is a a few bytes less than a quarter of the block size, and kl is length of its key. You can then have 64K such items. So even with a block size as low as 2K and key length as large as 100, you could have a tag of 2.5 megabytes. More realistically, with a 16K block size, the upper limit on the tag size is about 256 megabytes.
The B-tree has a revision number, and each time it is updated, the revision number increases. In a single transaction on the B-tree, it is first opened, its revision number, R is found, updates are made, and then the B-tree is closed with a supplied revision number. The supplied revision number will typically be R+1, but any R+k is possible, where k > 0.
If this sequence fails to complete for some reason, revision R+k of the B-tree will not, of course, be brought into existence. But revision R will still exist, and it is that version of the B-tree that will be the starting point for later revisions.
If this sequence runs to a successful termination, the new revision, R+k, supplants the old revision, R. But it is still possible to open the B-tree at revision R. After a successful revision of the B-tree, in fact, it will have two valid versions: the current one, revision R+k, and the old one, revision R.
You might want to go back to the old revision of a B-tree if it is being updated in tandem with second B-tree, and the update on the second B-tree fails. Suppose B1 and B2 are two such B-trees. B1 is opened and its latest revision number is found to be R1. B2 is opened and its latest revision number is found to be R2. If R1 > R2, it must be the case that the previous transaction on B1 succeeded and the previous transaction on B2 failed. Then B1 needs to opened at its previous revision number, which must be R1.
The calls using revision numbers described below are intended to handle this type of contingency.
Corresponding to baseA and baseB are two files bitmapA and bitmapB. Bit n is set in the bitmap if block n is in use in the corresponding revision of the B-tree.
void Btree::create(const string & name, int block_size);
Creates a new B-tree with the given name and block size. On error, throws an exception.Btree::create("/home/martin/develop/btree/", 8192); Btree::create("X-", 6000); /* files will be X-bitmapA, X-DB etc */The block size must be less than 64K, where K = 1024. It is unwise to use a small block size (less than 1024 perhaps), but it is not at present forbidden.
Thereafter there are two modes for accessing the B-tree: update and retrieval.
void Btree::open_to_write(const string & name);
The name is the same as the one used in creating. Throws an exception in case of error. E.g:Btree::open_to_write("/home/martin/develop/btree/");
string::size_type Btree::max_key_len;
This gives the upper limit of the size of a key that may be presented to the B-tree (252 bytes with the present implementation).
uint4 Btree::revision_number;
This gives the revision number of the B-tree. (The type uint4 is an unsigned integral type at least 4 bytes in size).
uint4 Btree::other_revision_number;
Similarly, this gives the revision number held in the other base file of the B-tree, or is zero if there is only one base file, and
bool Btree::both_bases;
is true iff both files baseA and baseB exist as valid bases.
bool Btree::open_to_write(const string & name, unsigned long revision);
Like Btree::open_to_write, but open at the given revision number, and returns false if the database couldn't be opened at the requested revision.
bool Btree::add(const string &key, const string &tag);
Adds the given key-tag pair to the B-tree. e.g.ok = B->add("TODAY", "Mon 9 Oct 2000"); ok = B->add(string(k + 1, k[0]), string(t + 1, t[0]));The key and tag are C++ strings.If key and tag are empty, then the null item is replaced. If key.length() exceeds the the limit on key sizes an error condition occurs. The result is false if the key is already in the B-tree, otherwise true. Treated as an integer, the result also measures the increase in the total number of keys in the B-tree.
bool Btree::del(const string &key);
If key.empty() nothing happens, and the result is false.Otherwise this deletes the key and its tag from the B-tree, if found. e.g.
ok = B->del("TODAY") ok = B->del(string(k + 1, k[0]));The result is then false if the key is not in the B-tree, true if it is. Treated as an interger, the result also measures the decrease in the total number of keys in the B-tree.
bool Btree::find_key(const string &key);
The result is true iff the specified key is found.
bool Btree::find_tag(const string &key, string * tag);
The same result, but when the key is found the tag is copied to tag. If the key is not found tag is left unchanged. e.g.string t; B->find_tag("TODAY", &t); /* get today's date */
int4 Btree::item_count;
This returns the number of items in the B-tree, not including the ever-present item with null key.
void Btree::open_to_read(const string & name); uint4 Btree::revision_number; void Btree::open_to_read(const string & name, uint4 revision);
These are the same as for update mode, except that that the opened B-tree is not modifiable.
Bcursor::Bcursor(Btree *B);
Creates a cursor, which can be used to remember a position inside the B-tree. The position is simply the item (key and tag) to which the cursor points. A cursor is either positioned or unpositioned, and is initially unpositioned.NB: You must make sure that the Bcursor is destroyed before the Btree it is attached to.
bool Bcursor::find_key(const string & key);
The result is true iff the specified key is found in the Btree.If found, the cursor is made to point to the item with the given key, and if not found, it is made to point to the last item in the B-tree whose key is <= the key being searched for, The cursor is then set as 'positioned'. Since the B-tree always contains a null key, which precedes everything, a call to Bcursor::find_key always results in a valid key being pointed to by the cursor.
bool Bcursor::next();
If cursor BC is unpositioned, the result is simply false.If cursor BC is positioned, and points to the very last item in the Btree the cursor is made unpositioned, and the result is false. Otherwise the cursor BC is moved to the next item in the B-tree, and the result is true.
Effectively, Bcursor::next() loses the position of BC when it drops off the end of the list of items. If this is awkward, one can always arrange for a key to be present which has a rightmost position in a set of keys, e.g.
B->add("\xFF", ""); /* all other keys have first char < xF0, and a fortiori < xFF */
bool Bcursor::prev();
This is like Bcursor::next, but BC is taken to the previous rather than next item.
bool Bcursor::get_key(string * key);
If cursor BC is unpositioned, the result is simply false.If BC is positioned, the key of the item is copied into key and the result is then true.
For example,
Bcursor BC(&B); string key; /* Now we'll print all the keys in the B-tree (assuming they have a simple form */ BC.find_key(""); // must give result true while (BC.next()) { // skipping the null item BC.get_key(&key); cout << key << endl; }
bool Bcursor::get_tag(string * tag);
If cursor BC is unpositioned, the result is simply bool.If BC is positioned, the tag of the item at cursor BC is copied into tag. BC is then moved to the next item as if Bcursor::next() had been called - this may leave BC unpositioned. The result is true iff BC is left positioned.
For example,
Bcursor BC(&B); string key; string tag; /* Now do something to each key-tag pair in the Btree */ have a simple form */ BC.find_key(""); // must give result true while (BC.get_key(&key)) { BC.get_tag(&tag); do_something_to(key, tag); } /* when BC is unpositioned by Bcursor::get_tag, Bcursor::get_key gives result true the next time it called */
Bcursor::~Bcursor();
Loses cursor BC.
void BtreeCheck::check(const string & name, int opts);
BtreeCheck::check(s, opts) is essentially equivalent toBtree B; B.open_to_write(s); { // do a complete integrity check of the B-tree, // reporting according to the bitmask opts }The option bitmask may consist of any of the following values |-ed together:The options control what is reported - the entire B-tree is always checked as well as reporting the information.
- OPT_SHORT_TREE - short summary of entire B-tree
- OPT_FULL_TREE - full summary of entire B-tree
- OPT_SHOW_BITMAP - print the bitmap
- OPT_SHOW_STATS - print the basic information (revision number, blocksize etc.)
Let us say an item is 'new' if it is presented for addition to the B-tree and its key is not already in the B-tree. Then presenting a long run of new items ordered by key causes the B-tree updating process to switch into a mode where much higher compaction than 75% is achieved - about 90%. This is called 'sequential' mode. It is possible to force an even higher compaction rate with the procedure
void Btree::full_compaction(bool parity);So
B.full_compaction(true);switches full compaction on, and
B.full_compaction(false);switches it off. Full compaction may be switched on or off at any time, but it only affects the compaction rate of sequential mode. In sequential mode, full compaction gives around 98-99% block usage - it is not quite 100% because keys are not split across blocks.
The downside of full compaction is that block splitting will be heavy on the next update. However, if a B-tree is created with no intention of being updated, full compaction is very desirable.
To make a really fast structure for retrieval therefore, create a new B-tree, open it for updating, set full compaction mode, and add all the items in a single transaction, sorted on keys. After closing, do not update further. Further updates can be prevented quite easily by deleting (or moving) the bitmap files. These are required in update mode but ignored in retrieval mode.
Here is a program fragment to unload B-tree B/ and reform it in Bnew/ as a fully compact B-tree with revision number 1:
{ Btree B; B.open_to_read("B/"); Bcursor BC(&B); string key, tag; Btree::create("Bnew/", 8192); Btree new_B; new_B.open_to_write("Bnew/"); new_B.set_full_compaction(true); BC.find_key(""); while (BC.get_key(&key)) { BC.get_tag(&tag); new_B.add(key, tag); } new_B.commit(1); }For a complete program which does this, see the quartzcompact utility.
This may change in the future with code redesign, but meanwhile not that a K term query that needs k <= K cursors open at once to process, will demand 2KB bytes of memory in the B-tree manager.
It is possible to do retrieval while the B-tree is being updated. If the updating process overwrites a part of the B-tree required by the retrieval process, the flag
bool Btree::overwitten;is set to true. This may be detected, and suitable action taken. Here is a model scheme:
static Btree * reopen(Btree * B) { uint4 revision = B->revision_number; // Get the revision number. This will return the correct value, even when // B->overwritten is detected during opening delete B; /* close the B-tree */ B = new Btree; B->open_to_read(s); /* and reopen */ if (revision == B->revision_number) { // The revision number ought to have gone up from last time, // so if we arrive here, something has gone badly wrong ... printf("Possible database corruption - complain to Xapian\n"); exit(1); } return B; } .... char * s = "database/"; Btree * B = new Btree; uint4 revision = 0; B->open_to_read(s); /* open the B-tree */ string t; do { while (B->overwritten) { B = reopen(s); } ... B->find_tag("brunel", &t); /* look up some keyword */ } while (B->overwitten); ...It may happen that B->overwitten is set to true in updating mode. This would mean that there were two updating processes at work. If the code is correct this does not need to be tested for, and in any case simultaneous updating is an error that cannot generally be trapped in this way.
In retrieval mode B->overwitten should be tested after the following procedures,
Btree::open_to_read(name); Btree::open_to_read(name, revision); Bcursor::next(); Bcursor::prev(); Bcursor::find_key(const string &key); Bcursor::get_tag(string * tag);The test is not required after any of the following:
Bcursor::Bcursor(Btree * B); Bcursor::get_key(string * key);Note particularly that opening the B-tree can set B->overwritten, and that Bcursor::get_key(..) will not set B->overwritten.
The procedures described above report errors in two ways. (A) A non-zero result. Btree::close() returns an int result which is 0 if successful, otherwise an error number. (B) The error is placed in B->error, where B is the Btree structure used in the call, or the Btree structure from which the Bcursor structure used in the call derives. Then B->error == 0 means no error, otherwise it is a positive number (greater than 2) giving the error number.
Some procedures cannot give an error. Here is a summary:
Error method procedure (A)(B) error condition given by: --------------------------------------------------- * n = Btree::find_key(key) * n = Btree::find_tag(key, kt) * n = Btree::add(key, tag) * n = Btree::del(key) * B = Btree::open_to_write(s) * B = Btree::open_to_write(s, rev) * n = Btree::close(B, rev) * n = Btree::create(s, block_size) [throws exception] * B = Btree::open_to_read(s) * B = Btree::open_to_read(s, rev) Bcursor::Bcursor() * n = Bcursor::find_key(key) * n = Bcursor::next() * n = Bcursor::prev() * n = Bcursor::get_key(kt) * n = Bcursor::get_tag(kt) Btree::full_compaction(parity) (A) non-zero result (B) B.error == trueB.error is not cleared after being set true. B.error can, as a side effect, set B.overwritten to true, but this should not matter since the test for B.error should always be done first. The procedures that give no error can still have the test for error (B) applied. Most of these errors should be rare, except for creating and opening the Btree, which will of course fail when the necessary files cannot be created or found.
Errors have a consistent set of values, defined in btree.h. They are:
BTREE_ERROR_BLOCKSIZE block size too large or too small during creation BTREE_ERROR_SPACE malloc, calloc failure BTREE_ERROR_BASE_CREATE For the base files, failure to create BTREE_ERROR_BASE_DELETE - failure to delete BTREE_ERROR_BASE_READ - failure to read BTREE_ERROR_BASE_WRITE - failure to write BTREE_ERROR_BITMAP_CREATE For the bit map files, failure to create BTREE_ERROR_BITMAP_READ - failure to read BTREE_ERROR_BITMAP_WRITE - failure to write BTREE_ERROR_DB_CREATE For the DB file, failure to create BTREE_ERROR_DB_OPEN - failure to open BTREE_ERROR_DB_CLOSE - failure to close BTREE_ERROR_DB_READ - failure to read BTREE_ERROR_DB_WRITE - failure to write BTREE_ERROR_KEYSIZE - key_len too large (programmer error) BTREE_ERROR_TAGSIZE - tag_len too large BTREE_ERROR_REVISION - rev too small in Btree_close (programmer error)See 'Keys and tags' above for the upper limit on tag sizes.