|
xapian-core
1.5.1
|
Abstract base class for weighting schemes. More...
#include <weight.h>
Public Member Functions | |
| Weight () | |
| Default constructor, needed by subclass constructors. | |
| virtual | ~Weight () |
| Virtual destructor, because we have virtual methods. | |
| virtual Weight * | clone () const =0 |
| Clone this object. | |
| virtual std::string | name () const |
| Return the name of this weighting scheme, e.g. | |
| virtual std::string | serialise () const |
| Return this object's parameters serialised as a single string. | |
| virtual Weight * | unserialise (const std::string &serialised) const |
| Unserialise parameters. | |
| virtual double | get_sumpart (Xapian::termcount wdf, Xapian::termcount doclen, Xapian::termcount uniqterms, Xapian::termcount wdfdocmax) const =0 |
| Calculate the weight contribution for this object's term to a document. | |
| virtual double | get_maxpart () const =0 |
| Return an upper bound on what get_sumpart() can return for any document. | |
| virtual double | get_sumextra (Xapian::termcount doclen, Xapian::termcount uniqterms, Xapian::termcount wdfdocmax) const |
| Calculate the term-independent weight component for a document. | |
| virtual double | get_maxextra () const |
| Return an upper bound on what get_sumextra() can return for any document. | |
| virtual Weight * | create_from_parameters (const char *params) const |
| Create from a human-readable parameter string. | |
Static Public Member Functions | |
| static const Weight * | create (const std::string &scheme, const Registry ®=Registry()) |
| Return the appropriate weighting scheme object. | |
Protected Types | |
| enum | stat_flags { COLLECTION_SIZE = 0 , RSET_SIZE = 0 , AVERAGE_LENGTH = 4 , TERMFREQ = 1 , RELTERMFREQ = 1 , QUERY_LENGTH = 0 , WQF = 0 , WDF = 2 , DOC_LENGTH = 8 , DOC_LENGTH_MIN = 16 , DOC_LENGTH_MAX = 32 , WDF_MAX = 64 , COLLECTION_FREQ = 1 , UNIQUE_TERMS = 128 , TOTAL_LENGTH = 256 , WDF_DOC_MAX = 512 , UNIQUE_TERMS_MIN = 1024 , UNIQUE_TERMS_MAX = 2048 , DB_DOC_LENGTH_MIN = 4096 , DB_DOC_LENGTH_MAX = 8192 , DB_UNIQUE_TERMS_MIN = 16384 , DB_UNIQUE_TERMS_MAX = 32768 , DB_WDF_MAX = 65536 , IS_BOOLWEIGHT_ = static_cast<int>(0x80000000) } |
| Stats which the weighting scheme can use (see need_stat()). More... | |
Protected Member Functions | |
| void | need_stat (stat_flags flag) |
| Tell Xapian that your subclass will want a particular statistic. | |
| virtual void | init (double factor)=0 |
| Allow the subclass to perform any initialisation it needs to. | |
| Weight (const Weight &) | |
| Don't allow copying. | |
| Xapian::doccount | get_collection_size () const |
| The number of documents in the collection. | |
| Xapian::doccount | get_rset_size () const |
| The number of documents marked as relevant. | |
| Xapian::doclength | get_average_length () const |
| The average length of a document in the collection. | |
| Xapian::doccount | get_termfreq () const |
| The number of documents which this term indexes. | |
| Xapian::doccount | get_reltermfreq () const |
| The number of relevant documents which this term indexes. | |
| Xapian::termcount | get_collection_freq () const |
| The collection frequency of the term. | |
| Xapian::termcount | get_query_length () const |
| The length of the query. | |
| Xapian::termcount | get_wqf () const |
| The within-query-frequency of this term. | |
| Xapian::termcount | get_doclength_upper_bound () const |
| An upper bound on the maximum length of any document in the shard. | |
| Xapian::termcount | get_doclength_lower_bound () const |
| A lower bound on the minimum length of any document in the shard. | |
| Xapian::termcount | get_wdf_upper_bound () const |
| An upper bound on the wdf of this term in the shard. | |
| Xapian::totallength | get_total_length () const |
| Total length of all documents in the collection. | |
| Xapian::termcount | get_unique_terms_upper_bound () const |
| A lower bound on the number of unique terms in any document in the shard. | |
| Xapian::termcount | get_unique_terms_lower_bound () const |
| An upper bound on the number of unique terms in any document in the shard. | |
| Xapian::termcount | get_db_doclength_upper_bound () const |
| An upper bound on the maximum length of any document in the database. | |
| Xapian::termcount | get_db_doclength_lower_bound () const |
| A lower bound on the minimum length of any document in the database. | |
| Xapian::termcount | get_db_unique_terms_upper_bound () const |
| A lower bound on the number of unique terms in any document in the database. | |
| Xapian::termcount | get_db_unique_terms_lower_bound () const |
| An upper bound on the number of unique terms in any document in the database. | |
| Xapian::termcount | get_db_wdf_upper_bound () const |
| An upper bound on the wdf of this term in the database. | |
Abstract base class for weighting schemes.
|
protected |
Stats which the weighting scheme can use (see need_stat()).
| Enumerator | |
|---|---|
| COLLECTION_SIZE | Number of documents in the collection. |
| RSET_SIZE | Number of documents in the RSet. |
| AVERAGE_LENGTH | Average length of documents in the collection. |
| TERMFREQ | How many documents the current term is in. |
| RELTERMFREQ | How many documents in the RSet the current term is in. |
| QUERY_LENGTH | Sum of wqf for terms in the query. |
| WQF | Within-query-frequency of the current term. |
| WDF | Within-document-frequency of the current term in the current document. |
| DOC_LENGTH | Length of the current document (sum wdf). |
| DOC_LENGTH_MIN | Lower bound on (non-zero) document lengths. This bound is for the current shard and is suitable for using to calculate upper bounds to return from get_maxpart() and get_maxextra(). |
| DOC_LENGTH_MAX | Upper bound on document lengths. This bound is for the current shard and is suitable for using to calculate upper bounds to return from get_maxpart() and get_maxextra(). If you need a bound for calculating a returned weight from get_sumpart() or get_sumextra() then you should use DB_DOC_LENGTH_MIN instead. |
| WDF_MAX | Upper bound on wdf. This bound is for the current shard and is suitable for using to calculate upper bounds to return from get_maxpart() and get_maxextra(). If you need a bound for calculating a returned weight from get_sumpart() or get_sumextra() then you should use DB_DOC_LENGTH_MAX instead. |
| COLLECTION_FREQ | Sum of wdf over the whole collection for the current term. |
| UNIQUE_TERMS | Number of unique terms in the current document. |
| TOTAL_LENGTH | Sum of lengths of all documents in the collection. This gives the total number of term occurrences. |
| WDF_DOC_MAX | Maximum wdf in the current document.
|
| UNIQUE_TERMS_MIN | Lower bound on number of unique terms in a document. This bound is for the current shard and is suitable for using to calculate upper bounds to return from get_maxpart() and get_maxextra(). If you need a bound for calculating a returned weight from get_sumpart() or get_sumextra() then you should use DB_UNIQUE_TERMS_MIN instead.
|
| UNIQUE_TERMS_MAX | Upper bound on number of unique terms in a document. This bound is for the current shard and is suitable for using to calculate upper bounds to return from get_maxpart() and get_maxextra(). If you need a bound for calculating a returned weight from get_sumpart() or get_sumextra() then you should use DB_UNIQUE_TERMS_MAX instead.
|
| DB_DOC_LENGTH_MIN | Lower bound on (non-zero) document lengths. This is a suitable bound for calculating a returned weight from get_sumpart() or get_sumextra().
|
| DB_DOC_LENGTH_MAX | Upper bound on document lengths. This is a suitable bound for calculating a returned weight from get_sumpart() or get_sumextra().
|
| DB_UNIQUE_TERMS_MIN | Lower bound on number of unique terms in a document. This is a suitable bound for calculating a returned weight from get_sumpart() or get_sumextra();
|
| DB_UNIQUE_TERMS_MAX | Upper bound on number of unique terms in a document. This is a suitable bound for calculating a returned weight from get_sumpart() or get_sumextra();
|
| DB_WDF_MAX | Upper bound on wdf of this term. This is a suitable bound for calculating a returned weight from get_sumpart().
|
|
protected |
Don't allow copying.
This would ideally be private, but that causes a compilation error with GCC 4.1 (which appears to be a bug).
References Weight().
|
pure virtual |
Clone this object.
This method allocates and returns a copy of the object it is called on.
If your subclass is called FooWeight and has parameters a and b, then you would implement FooWeight::clone() like so:
FooWeight * FooWeight::clone() const { return new FooWeight(a, b); }
Note that the returned object will be deallocated by Xapian after use with "delete". If you want to handle the deletion in a special way (for example when wrapping the Xapian API for use from another language) then you can define a static operator delete method in your subclass as shown here: https://trac.xapian.org/ticket/554#comment:1
Implemented in Xapian::CoordWeight, and Xapian::DiceWeight.
References Weight().
|
static |
Return the appropriate weighting scheme object.
| scheme | the string containing a weighting scheme name and may also contain the parameters required by that weighting scheme. E.g. "bm25 1.0 0.8" |
| reg | Xapian::Registry object to allow users to add their own custom weighting schemes (default: standard registry). |
References Weight().
|
virtual |
Create from a human-readable parameter string.
| params | string containing weighting scheme parameter values. |
Reimplemented in Xapian::BB2Weight, Xapian::BM25PlusWeight, Xapian::BM25Weight, Xapian::BoolWeight, Xapian::CoordWeight, Xapian::DiceWeight, Xapian::DLHWeight, Xapian::DPHWeight, Xapian::IfB2Weight, Xapian::IneB2Weight, Xapian::InL2Weight, Xapian::LM2StageWeight, Xapian::LMAbsDiscountWeight, Xapian::LMDirichletWeight, Xapian::LMJMWeight, Xapian::PL2PlusWeight, Xapian::PL2Weight, and Xapian::TfIdfWeight.
References Weight().
|
inlineprotected |
A lower bound on the minimum length of any document in the database.
This bound does not include any zero-length documents.
|
inlineprotected |
An upper bound on the maximum length of any document in the database.
|
inlineprotected |
An upper bound on the number of unique terms in any document in the database.
|
inlineprotected |
A lower bound on the number of unique terms in any document in the database.
This bound does not include any zero-length documents.
|
inlineprotected |
An upper bound on the wdf of this term in the database.
|
inlineprotected |
A lower bound on the minimum length of any document in the shard.
This bound does not include any zero-length documents.
This should only be used by get_maxpart() and get_maxextra().
|
inlineprotected |
An upper bound on the maximum length of any document in the shard.
This should only be used by get_maxpart() and get_maxextra().
|
virtual |
Return an upper bound on what get_sumextra() can return for any document.
The default implementation always returns 0 (in Xapian < 2.0.0 this was a pure virtual method).
This information is used by the matcher to perform various optimisations, so strive to make the bound as tight as possible.
Reimplemented in Xapian::BM25PlusWeight, Xapian::BM25Weight, Xapian::LM2StageWeight, Xapian::LMAbsDiscountWeight, and Xapian::LMDirichletWeight.
References DOC_LENGTH.
|
pure virtual |
Return an upper bound on what get_sumpart() can return for any document.
This information is used by the matcher to perform various optimisations, so strive to make the bound as tight as possible.
Implemented in Xapian::BB2Weight, Xapian::BM25PlusWeight, Xapian::BM25Weight, Xapian::BoolWeight, Xapian::CoordWeight, Xapian::DiceWeight, Xapian::DLHWeight, Xapian::DPHWeight, Xapian::IfB2Weight, Xapian::IneB2Weight, Xapian::InL2Weight, Xapian::LM2StageWeight, Xapian::LMAbsDiscountWeight, Xapian::LMDirichletWeight, Xapian::LMJMWeight, Xapian::PL2PlusWeight, Xapian::PL2Weight, and Xapian::TfIdfWeight.
|
virtual |
Calculate the term-independent weight component for a document.
The default implementation always returns 0 (in Xapian < 2.0.0 this was a pure virtual method).
The parameter gives information about the document which may be used in the calculations:
| doclen | The document's length (unnormalised). You need to call need_stat(DOC_LENGTH) if you use this value. |
| uniqterms | Number of unique terms in the document. You need to call need_stat(UNIQUE_TERMS) if you use this value. |
| wdfdocmax | Maximum wdf value in the document. You need to call need_stat(WDF_DOC_MAX) if you use this value. |
Reimplemented in Xapian::BM25PlusWeight, Xapian::BM25Weight, Xapian::LM2StageWeight, Xapian::LMAbsDiscountWeight, and Xapian::LMDirichletWeight.
|
pure virtual |
Calculate the weight contribution for this object's term to a document.
The parameters give information about the document which may be used in the calculations:
| wdf | The within document frequency of the term in the document. You need to call need_stat(WDF) if you use this value. |
| doclen | The document's length (unnormalised). You need to call need_stat(DOC_LENGTH) if you use this value. |
| uniqterms | Number of unique terms in the document. You need to call need_stat(UNIQUE_TERMS) if you use this value. |
| wdfdocmax | Maximum wdf value in the document. You need to call need_stat(WDF_DOC_MAX) if you use this value. |
You can rely of wdf <= doclen if you call both need_stat(WDF) and need_stat(DOC_LENGTH) - this is trivially true for terms, but Xapian also ensure it's true for OP_SYNONYM, where the wdf is approximated.
Implemented in Xapian::BB2Weight, Xapian::BM25PlusWeight, Xapian::BM25Weight, Xapian::BoolWeight, Xapian::CoordWeight, Xapian::DiceWeight, Xapian::DLHWeight, Xapian::DPHWeight, Xapian::IfB2Weight, Xapian::IneB2Weight, Xapian::InL2Weight, Xapian::LM2StageWeight, Xapian::LMAbsDiscountWeight, Xapian::LMDirichletWeight, Xapian::LMJMWeight, Xapian::PL2PlusWeight, Xapian::PL2Weight, and Xapian::TfIdfWeight.
|
inlineprotected |
An upper bound on the number of unique terms in any document in the shard.
This should only be used by get_maxpart() and get_maxextra().
|
inlineprotected |
A lower bound on the number of unique terms in any document in the shard.
This bound does not include any zero-length documents.
This should only be used by get_maxpart() and get_maxextra().
|
inlineprotected |
An upper bound on the wdf of this term in the shard.
This should only be used by get_maxpart() and get_maxextra().
|
protectedpure virtual |
Allow the subclass to perform any initialisation it needs to.
| factor | Any scaling factor (e.g. from OP_SCALE_WEIGHT). If the Weight object is for the term-independent weight supplied by get_sumextra()/get_maxextra(), then init(0.0) is called (starting from Xapian 1.2.11 and 1.3.1 - earlier versions failed to call init() for such Weight objects). |
Implemented in Xapian::CoordWeight.
References Weight().
|
virtual |
Return the name of this weighting scheme, e.g.
"bm25+".
This is the name that the weighting scheme gets registered under when passed to Xapian:Registry::register_weighting_scheme().
As a result:
For 1.4.x and earlier we recommended returning the full namespace-qualified name of your class here, but now we recommend returning a just the name in lower case, e.g. "foo" instead of "FooWeight", "bm25+" instead of "Xapian::BM25PlusWeight".
If you don't want to support creation via Weight::create() or the remote backend, you can use the default implementation which simply returns an empty string.
Reimplemented in Xapian::BB2Weight, Xapian::BM25PlusWeight, Xapian::BM25Weight, Xapian::BoolWeight, Xapian::CoordWeight, Xapian::DiceWeight, Xapian::DLHWeight, Xapian::DPHWeight, Xapian::IfB2Weight, Xapian::IneB2Weight, Xapian::InL2Weight, Xapian::LM2StageWeight, Xapian::LMAbsDiscountWeight, Xapian::LMDirichletWeight, Xapian::LMJMWeight, Xapian::PL2PlusWeight, Xapian::PL2Weight, and Xapian::TfIdfWeight.
|
inlineprotected |
Tell Xapian that your subclass will want a particular statistic.
Some of the statistics can be costly to fetch or calculate, so Xapian needs to know which are actually going to be used. You should call need_stat() from your constructor for each statistic needed by the weighting scheme you are implementing (possibly conditional on the values of parameters of the weighting scheme).
Some of the statistics are currently available by default and their constants above have value 0 (e.g. COLLECTION_SIZE). You should still call need_stat() for these (the compiler should optimise away these calls and any conditional checks for them).
Some statistics are currently fetched together and so their constants have the same numeric value - if you need more than one of these statistics you should call need_stat() for each one. The compiler should optimise this too.
Prior to 2.0.0, it was assumed that if get_maxextra() returned a non-zero value then get_sumextra() needed the document length even if need(DOC_LENGTH) wasn't called - the logic was that get_sumextra() could only return a constant value if it didn't use the document length. However, this is no longer valid since it can also use the number of unique terms in the document, so now you need to specify explicitly.
| flag | The stat_flags value for a required statistic. |
Referenced by Xapian::BB2Weight::BB2Weight(), Xapian::BM25PlusWeight::BM25PlusWeight(), Xapian::BM25Weight::BM25Weight(), Xapian::BoolWeight::BoolWeight(), Xapian::DiceWeight::DiceWeight(), Xapian::DPHWeight::DPHWeight(), Xapian::IfB2Weight::IfB2Weight(), Xapian::IneB2Weight::IneB2Weight(), Xapian::InL2Weight::InL2Weight(), Xapian::LM2StageWeight::LM2StageWeight(), Xapian::LMAbsDiscountWeight::LMAbsDiscountWeight(), Xapian::LMDirichletWeight::LMDirichletWeight(), Xapian::LMJMWeight::LMJMWeight(), Xapian::PL2PlusWeight::PL2PlusWeight(), Xapian::PL2Weight::PL2Weight(), and Xapian::TfIdfWeight::TfIdfWeight().
|
virtual |
Return this object's parameters serialised as a single string.
If you don't want to support the remote backend, you can use the default implementation which simply throws Xapian::UnimplementedError.
Reimplemented in Xapian::BB2Weight, Xapian::BM25PlusWeight, Xapian::BM25Weight, Xapian::BoolWeight, Xapian::CoordWeight, Xapian::DiceWeight, Xapian::DLHWeight, Xapian::DPHWeight, Xapian::IfB2Weight, Xapian::IneB2Weight, Xapian::InL2Weight, Xapian::LM2StageWeight, Xapian::LMAbsDiscountWeight, Xapian::LMDirichletWeight, Xapian::LMJMWeight, Xapian::PL2PlusWeight, Xapian::PL2Weight, and Xapian::TfIdfWeight.
|
virtual |
Unserialise parameters.
This method unserialises parameters serialised by the serialise() method and allocates and returns a new object initialised with them.
If you don't want to support the remote backend, you can use the default implementation which simply throws Xapian::UnimplementedError.
Note that the returned object will be deallocated by Xapian after use with "delete". If you want to handle the deletion in a special way (for example when wrapping the Xapian API for use from another language) then you can define a static operator delete method in your subclass as shown here: https://trac.xapian.org/ticket/554#comment:1
| serialised | A string containing the serialised parameters. |
Reimplemented in Xapian::BB2Weight, Xapian::BM25PlusWeight, Xapian::BM25Weight, Xapian::BoolWeight, Xapian::CoordWeight, Xapian::DiceWeight, Xapian::DLHWeight, Xapian::DPHWeight, Xapian::IfB2Weight, Xapian::IneB2Weight, Xapian::InL2Weight, Xapian::LM2StageWeight, Xapian::LMAbsDiscountWeight, Xapian::LMDirichletWeight, Xapian::LMJMWeight, Xapian::PL2PlusWeight, Xapian::PL2Weight, and Xapian::TfIdfWeight.
References Weight().