Priority |
Difficulty |
Area |
Task |
Release |
Owner |
|
|
|
Update the PLATFORMS file
|
* |
James |
|
|
|
make sure the non-autogenerated docs are kept up-to-date
|
* |
James |
Medium |
3 |
API |
Implement methods to iterate through all the documents in the database.
Possibly via a special term which indexes all documents.
|
0.6 |
|
Medium |
3 |
Databases |
Change all internal references to net/network backend to remote backend (in
step with external naming)
|
0.6 |
|
Medium |
2 |
General |
Check for zero byte cleanness wherever strings are used. There are a
number of c_str()s in the code, but I believe all in the core library
(excluding the bindings) are harmless at 2002-04-29. There may be other zero
byte issues though. xapian-applications/dbtools also uses c_str() where it
should probably use data() and length().
|
0.6 |
|
Medium |
2 |
OmQuery |
Move all serialisation of OmQuery into OmQuery (out of socketcommon.cc and
localmatch): modification of omquery requires changes in 3 separate parts
of the code, at present.
|
0.6 |
|
Medium |
3 |
Stemming |
Replace our own stemming code with Martin Porter's snowball stemmers (with a
thin OmStem wrapper).
|
0.6 |
|
Low |
4 |
API |
Allow custom weighting functions.
|
0.6 |
|
Low |
3 |
Quartz |
Make quartz database autoflush when enough changes have been performed based
on the memory used up as a proportion of that available, rather than simply
when a count of changes is reached. Remove hardcoded count of 1000 changes.
|
0.6 |
|
Verylow |
4 |
General |
Make backends / weighting schemes / indexer modules register themselves
automatically. At runtime / linktime? (ie, replace current conditional
compilation scheme) - actually, we can use sub-classing
and factory classes to do this more cleanly.
|
0.6 |
|
|
|
|
Fix up examples and make sure they are actually instructive.
I've made a start. delve is a reasonable example. msearch probably needs
simplifying to just do a probabilistic search, or to use OmQueryParser.
|
0.6 |
Olly |
|
|
|
Replace all uses of OmSettings.
The arguments for OmSettings are not as compelling as we
originally thought, and it has definite drawbacks - one major one being
that there's no easy way to check for typos which can lead to users of
the library spending hours trying to sort out a bug which is just a
typo in an OmSettings value. Another argument for it was to allow passing
values to user weighting objects, etc, but I think it's best just to
implement these with a clone() method, and pass an example one in. And
we might as well make built-in weighting objects work the same way rather
than being a special case. Backends can be done similarly, though an
explicit factory is needed as there's more than one class to build.
|
0.6 |
Olly |
|
|
|
indexgraph -> extra (needs to build as a support library?)
|
0.6 |
|
|
|
Documentation |
Finish reading through generated docs to ensure they read well in collated form.
|
0.6 |
Olly |
Medium |
5 |
Documentation |
Ensure that API documentation covers entirety of API (i.e. that all methods and
classes in the API have documentation comments) -- see doxygen generated file
docs/doxygen_api_warnings for a list of undocumented methods. Then read
through generated API docs, and rewrite doc comments to improve clarity and
make them more coherent.
|
1.0 |
|
Medium |
4 |
General |
Allow setting of the document length in OmDocument. (Currently defined to
be the sum of the wdfs).
|
1.0 |
|
Medium |
5 |
Porting |
Produce Microsoft Windows version, probably cross-compiling to mingw.
|
1.0 |
James |
Medium |
2 |
Quartz |
Ensure that quartz databases don't have a problem if there is no positional
information entry available for a term / document combination.
|
1.0 |
|
Low |
3 |
Documentation |
Add notes about catching exceptions throughout userman, particularly in
examples (eg, search engine example)
|
1.0 |
|
|
|
|
.deb built, control files via autoconf
|
1.0 |
Olly |
High |
3 |
API |
Put Om into its own namespace, to ensure lack of symbol conflicts.
|
|
|
High |
5 |
API |
Write bindings for other languages (Java, C, perl, python, php4, etc.). [In
progress, apart from C].
|
|
C |
High |
2 |
Matcher |
Pass around partially created postlists and termlists as AutoPtrs?
(for exception safety)
|
|
|
High |
5 |
Performance |
Write (speed) performance test suite.
|
|
|
Medium |
5 |
Bindings |
Ensure that (Java in particular) bindings throw correct exception types.
|
|
|
Medium |
5 |
DA backend |
Autodetect heavy-duty vs flimsy (3 byte vs 2 byte)
|
|
|
Medium |
1 |
DA/DB |
Add get_all_terms DB databases. Needs extra code in dbread.[ch].
|
|
|
Medium |
5 |
Debug |
Try to find some way to write a thread identifier into the debug log, while
not depending of pthreads. Try dlsym() on pthread_self?
(pthread_t pthread_self(void)).
|
|
|
Medium |
5 |
Documentation |
Document backend API (database, postlist, termlist, document, etc) in same
way as enquire API.
|
|
|
Medium |
3 |
Documentation |
Patch doxygen, so that todo items in the body of methods get displayed.
|
|
|
Medium |
5 |
Exceptions |
Check that it is safe for an exception to be thrown and caught within a
destructor, when that destructor is being called due to an exception
unwinding the stack. eg, a database is destroyed due to an exception,
database's destructor calls internal_end_session() which throws an exception
(which is caught and handled by the destructor): is this safe - two
exceptions exist simultaneously.
|
|
|
Medium |
3 |
Exceptions |
Make exceptions work with shared libs on solaris / find an alternative. (gcc
=> DISABLE_SHARED on Solaris)
|
|
|
Medium |
2 |
General |
Make all errors return a context if appropriate.
|
|
|
Medium |
1 |
Iterators |
Write tests for copying term and postlist iterators.
|
|
|
Medium |
3 |
Matcher |
Add synonym postlists. Need to be able to take underlying postlists which
aren't necessarily just postlists for single terms, and to be able to
estimate termfrequency of combined postlists.
|
|
|
Medium |
4 |
Matcher |
Allow negative relevance judgements? Will need to check that this doesn't
cause assumptions to be violated. (eg, unsigned integers going negative.)
|
|
|
Medium |
3 |
Matcher |
Check that negative term weights don't mess up matcher's optimisations - if
they do we need to either disallow negative term weights, or fix/disable the
optimisations for the case of negative term weights.
|
|
|
Medium |
4 |
Matcher |
Create a synonym postlist, which represents a set of postlists merged together,
such that each document that occurs in any of the sublists occurs in the list,
the term frequency is the number of documents that one or more of the terms
occurs in, and the term weight corresponds.
Will need approximation schemes for determining the term frequency.
|
|
|
Medium |
3 |
Matcher |
Implement collapse keys for duplicate removal - which only fire if the
two documents have the same weight.
|
|
|
Medium |
4 |
Matcher |
Treat FILTER and AND as equivalent from the point of view of building
optimal AND trees. Also add a variant on FilterPostList where the left
branch is boolean and the right probabilistic. Resist urge to call
it RETFIL.
|
|
|
Medium |
5 |
Matcher |
Write tests to check that setting the parameters used in the BM25 and
traditional weighting schemes works.
|
|
|
Medium |
3 |
Postlists |
Add OP_FILTER_TERM_WITH_EXACT_WEIGHT query operator (with better name), which
will perform a restriction of the LHS term based on the RHS query, but use the
exact termfrequency for the combined term to calculate the weight. This will
share some techniques from implementing synonym postlists.
|
|
|
Medium |
5 |
Postlists |
Add get_termfreq_exact() methods, for calculating the exact termfreq. This
will be particularly useful when trying to do evaluations to check up on the
approximations being made.
Also, add get_termfreq_better_est() methods, which give an approximation to the
exact termfreq based on the first N items in the postlist.
This may require adding a reset() method, to move a postlist's position back to the beginning.
|
|
|
Low |
4 |
API |
Provide explicit support for range searches.
|
|
|
Low |
4 |
API |
Provide fake term which indexes all documents. This would be used for a
real "NOT" operator, and also for allowing searches to be scored based on
location (would give weight from location for this term, with a custom
weighting scheme).
|
|
|
Low |
3 |
API |
Re-implement OmBatchEnquire, and add back into the system.
|
|
|
Low |
5 |
General |
Audit for exception safety.
|
|
c |
Low |
5 |
Matcher |
Clustering algorithms.
|
|
|
Low |
5 |
Matcher |
OP_ELITE_SET should never select groups of terms which don't match any
documents. (Currently, will exclude those for which termfreq_max() is 0,
but this may still result in a bad choice)
|
|
|
Low |
4 |
Matcher |
OP_ELITE_SET should probably reduce the querysize by the number of terms
removed. When making a contribution to querysize, could just use the lesser
of the number of terms, and elite_set_size.
|
|
|
Low |
5 |
Positional |
Passage retrieval.
|
|
|
Low |
3 |
Quartz |
Clean up interaction of AllTermsIterator for quartz with QuartzPostList.
Need QuartzPostListTermsIterator class? (But with a snappier name. ;-) )
|
|
|
Low |
2 |
Website |
Put PS/PDF documentation on website.
|
|
|
Low |
3 |
Weighting |
Allow for a non-zero minimum value for the ndl (normalised doc len).
|
|
|
Verylow |
3 |
Backends |
Split database definition files into database/postlist/termlist files.
|
|
|
Verylow |
4 |
General |
A couple of classes get copied a lot - look into doing copy-on-write for
them. Notably ExpandBits and term names (currently strings so this happens,
but may change)
|
|
|
Verylow |
5 |
General |
Improve performance using SIMD instructions
|
|
|
Dubious |
3 |
API |
Do allow boolean subqueries in OmQuery constructors, where
it makes sense (or note in documentation to use FILTER).
|
|
|
Dubious |
3 |
Decision functors |
Return a sensible value for OmMSet::matches_lower_bound when a decision
functor is present. This has to be the number of documents that the decision
functor tested and approved, as we know there are at least that many and
can't know if there are more. matches_upper_bound can be reduced by the
number of documents that the functor rejected, and matches_estimated
can be adjusted somehow - perhaps look at the reject rate of the functor?
Partly done I believe.
|
|
|
Dubious |
3 |
Exceptions |
Add error handlers to (at least) OmDatabase. Implement more carefully in
MergePostlist.
|
|
|
Dubious |
5 |
Matcher |
Boolean filters result in collection statistics being for the wrong set of
documents (should be appropriate subset). Hard (impossible?) to implement
efficiently.
|
|
|
|
|
|
"make install" on omega should install CGI binary somewhere more helpful
|
|
|
|
|
|
Check for swig version.
|
|
|
|
|
|
Clean up xapian-core/backends/quartz/z_make, etc.
|
|
|
|
|
|
Find paper about "illusion of control" that boolean operators give. It's
makes some good points which ought to be more widely aired.
|
|
Olly |
|
|
|
Finish off automated testing across CF machines
|
|
James |
|
|
|
Get nightly snapshot builds set up again
|
|
James |
|
|
|
Investigate and find a proper fix for FILTER problem.
|
|
Olly |
|
|
|
Java bindings in com/muscat/om should probably move to org/xapian before the
bindings are actually released.
|
|
|
|
|
|
Language bindings: Python, PHP and Java (as a minimum IMHO).
All can be done using SWIG, and it's probably easier to do so
even though some languages (eg Python) have better tools
available just because it's less overall work.
|
|
Sam/James |
|
|
|
Look at getting the btree code to use pread and pwrite or similar calls
where available (e.g. on Linux and Solaris). These combine a seek and
read or write into a single syscall, which halves the syscall overhead and
can make an observable difference to performance.
|
|
Olly |
|
|
|
Look at replacing btree implementation, or at least tidying it
|
|
Richard |
|
|
|
Look at reworking StatsGatherer mechanism to be simpler and clearer.
|
|
Olly |
|
|
|
Move "min_hits" into matcher?
|
|
Olly |
|
|
|
Replace Omega with a simpler PHP or Python-based
system, once the bindings are in place. Python would be good
because we could use it for omindex as well, and I suspect the code
would be much cleaner, easier to work with, and generally
understandable. For something that should be halfway between a
reasonably large-scale application for Xapian and a complex
example, this can only be a good thing. This probably needs a
query parser library, although it raises questions of consistency
of term generation (word breaks and stemming) across the index
and query tools ... we may want the query parser to have a callback
to deal with that, which can be done in the bindings although it's
a little fiddly in some languages I believe.
|
|
Sam |
|
|
|
Should Omega have a make static target? Or just document configure runes?
Need some more stuff in
omindex (--add-term, --add-field) IMHO. Also could do with more
fields as standard, and probably support for subsite as key for
collapse.
|
|
|
|
|
|
Tests for bindings.
|
|
|
|
|
|
Think about using hashing instead of a btree for the backend? Long term
project.
|
|
Olly |
|
|
|
Use valgrind in the testsuite - waiting for Julian Seward to add the hooks
needed.
|
|
Olly |
|
|
|
We talked about use of local vs global databases, and decided it
would be useful to support Unix sockets for local machine databases
so the library can select() on all databases in complex cases. This
is probably something we can leave for a while, and probably
doesn't need to be automatic - so the local process can be fired
by the application, not the library - but at some point should be
thought through and documented properly. A longer term project.
|
|
|
|
|
|
xapian.org: schema pages (not crucial, but would be nice)
|
|
James |
|
|
API |
Should OmWritableDatabase have a default ctor? Consider for all API classes...
|
|
Olly |
|
|
API |
Should it be possible to specify an arbitrary docid for a document (perhaps to
match numeric docids in another system?) Currently replace_document() fails
if the document id doesn't exist already (at least with quartz).
|
|
Olly |
|
|
Documentation |
Convert text docs to HTML? Link together docs aimed at those developing
Xapian itself.
|
|
Olly |
|
|
Matcher |
Check that sort bands are correct for borderline cases (for e.g. 2 bands, the
bands are now 100% >= p > 50% and 50% >= p > 0%).
|
|
Olly |
|
|
Matcher |
Optimisation: Consider using hash_map instead of map in various places -
two possible such locations are i) doing collapsem (in match.cc) and ii)
in the inmemory database.
|
|
|
|
|
Quartz |
Shouldn't stall just because a stale db_lock exists - instead of just an
empty file, put the hostname and pid in the file (or use a symlink with the
info in the target since that can be created atomically) and check the details
- that way we can spot a stale lock from a process on the same machine.
Or touch the lock periodically to keep it?
|
|
Olly |