The default ordering of a result set is left up to the server, which inside Zebra means sorting in ascending document ID order. This is not always the order humans want to browse the sometimes quite large hit sets. Ranking and sorting comes to the rescue.
In cases where a good presentation ordering can be computed at
indexing time, we can use a fixed static ranking
scheme, which is provided for the alvis
indexing filter. This defines a fixed ordering of hit lists,
independently of the query issued.
There are cases, however, where relevance of hit set documents is
highly dependent on the query processed.
Simply put, dynamic relevance ranking
sorts a set of retrieved records such that those most likely to be
relevant to your request are retrieved first.
Internally, Zebra retrieves all documents that satisfy your
query, and re-orders the hit list to arrange them based on
a measurement of similarity between your query and the content of
each record.
Finally, there are situations where hit sets of documents should be
sorted
during query time according to the
lexicographical ordering of certain sort indexes created at
indexing time.
Zebra uses internally inverted indexes to look up term frequencies
in documents. Multiple queries from different indexes can be
combined by the binary boolean operations AND
,
OR
and/or NOT
(which
is in fact a binary AND NOT
operation).
To ensure fast query execution
speed, all indexes have to be sorted in the same order.
The indexes are normally sorted according to document
ID
in
ascending order, and any query which does not invoke a special
re-ranking function will therefore retrieve the result set in
document
ID
order.
If one defines the
staticrank: 1
directive in the main core Zebra configuration file, the internal document
keys used for ordering are augmented by a preceding integer, which
contains the static rank of a given document, and the index lists
are ordered
first by ascending static rank,
then by ascending document ID
.
Zero
is the ``best'' rank, as it occurs at the
beginning of the list; higher numbers represent worse scores.
The experimental alvis
filter provides a
directive to fetch static rank information out of the indexed XML
records, thus making all hit sets ordered
after ascending static
rank, and for those doc's which have the same static rank, ordered
after ascending doc ID
.
See Chapter 8, ALVIS XML Record Model and Filter Module for the gory details.
In order to fiddle with the static rank order, it is necessary to invoke additional re-ranking/re-ordering using dynamic ranking or score functions. These functions return positive integer scores, where highest score is ``best''; hit sets are sorted according to descending scores (in contrary to the index lists which are sorted according to ascending rank number and document ID).
Dynamic ranking is enabled by a directive like one of the following in the zebra configuration file (use only one of these a time!):
rank: rank-1 # default TDF-IDF like rank: rank-static # dummy do-nothing
Dynamic ranking is done at query time rather than
indexing time (this is why we
call it ``dynamic ranking'' in the first place ...)
It is invoked by adding
the BIB-1 relation attribute with
value ``relevance'' to the PQF query (that is,
@attr 2=102
, see also
The BIB-1 Attribute Set Semantics, also in
HTML).
To find all articles with the word Eoraptor
in
the title, and present them relevance ranked, issue the PQF query:
@attr 2=102 @attr 1=4 Eoraptor
The default rank-1
ranking module implements a
TF/IDF (Term Frequecy over Inverse Document Frequency) like
algorithm. In contrast to the usual definition of TF/IDF
algorithms, which only considers searching in one full-text
index, this one works on multiple indexes at the same time.
More precisely,
Zebra does boolean queries and searches in specific addressed
indexes (there are inverted indexes pointing from terms in the
dictionary to documents and term positions inside documents).
It works like this:
First, the boolean query is dismantled into its principal components, i.e. atomic queries where one term is looked up in one index. For example, the query
@attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
is a boolean AND between the atomic parts
@attr 2=102 @attr 1=1010 Utah
and
@attr 2=102 @attr 1=1018 Springer
which gets processed each for itself.
Second, for each atomic query, the hit list of documents is computed.
In this example, two hit lists for each index
@attr 1=1010
and
@attr 1=1018
are computed.
Third, each document in the hit list is assigned a score (_if_ ranking is enabled and requested in the query) using a TF/IDF scheme.
In this example, both atomic parts of the query assign the magic
@attr 2=102
relevance attribute, and are
to be used in the relevance ranking functions.
It is possible to apply dynamic ranking on only parts of the PQF query:
@and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
searches for all documents which have the term 'Utah' on the body of text, and which have the term 'Springer' in the publisher field, and sort them in the order of the relevance ranking made on the body-of-text index only.
Fourth, the atomic hit lists are merged according to the boolean conditions to a final hit list of documents to be returned.
This step is always performed, independently of the fact that dynamic ranking is enabled or not.
Fifth, the total score of a document is computed as a linear combination of the atomic scores of the atomic hit lists
Ranking weights may be used to pass a value to a ranking
algorithm, using the non-standard BIB-1 attribute type 9.
This allows one branch of a query to use one value while
another branch uses a different one. For example, we can search
for utah
in the
@attr 1=4
index with weight 30, as
well as in the @attr 1=1010
index with weight 20:
@attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
The default weight is sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score is 1000 and the bottom score is 0, encoded in integers.
The ranking-weight feature is experimental. It may change in future releases of zebra.
Finally, the final hit list is re-ordered according to scores.
The rank-1
algorithm
does not use the static rank
information in the list keys, and will produce the same ordering
with or without static ranking enabled.
Dynamic ranking
is not compatible
with estimated hit sizes
, as all documents in
a hit set must be accessed to compute the correct placing in a
ranking sorted list. Therefore the use attribute setting
@attr 2=102
clashes with
@attr 9=integer
.
Dynamic ranking can be enabled during sever side CQL
query expansion by adding @attr 2=102
chunks to the CQL config file. For example
relationModifier.relevant = 2=102
invokes dynamic ranking each time a CQL query of the form
Z> querytype cql Z> f alvis.text =/relevant house
is issued. Dynamic ranking can also be automatically used on specific CQL indexes by (for example) setting
index.alvis.text = 1=text 2=102
which then invokes dynamic ranking each time a CQL query of the form
Z> querytype cql Z> f alvis.text = house
is issued.
Zebra sorts efficiently using special sorting indexes
(type=s
; so each sortable index must be known
at indexing time, specified in the configuration of record
indexing. For example, to enable sorting according to the BIB-1
Date/time-added-to-db
field, one could add the line
xelm /*/@created Date/time-added-to-db:s
to any .abs
record-indexing configuration file.
Similarly, one could add an indexing element of the form
<z:index name="date-modified" type="s"> <xsl:value-of select="some/xpath"/> </z:index>
to any alvis
-filter indexing stylesheet.
Indexing can be specified at searching time using a query term
carrying the non-standard
BIB-1 attribute-type 7
. This removes the
need to send a Z39.50 Sort Request
separately, and can dramatically improve latency when the client
and server are on separate networks.
The sorting part of the query is separate from the rest of the
query - the actual search specification - and must be combined
with it using OR.
A sorting subquery needs two attributes: an index (such as a
BIB-1 type-1 attribute) specifying which index to sort on, and a
type-7 attribute whose value is be 1
for
ascending sorting, or 2
for descending. The
term associated with the sorting attribute is the priority of
the sort key, where 0
specifies the primary
sort key, 1
the secondary sort key, and so
on.
For example, a search for water, sort by title (ascending), is expressed by the PQF query
@or @attr 1=1016 water @attr 7=1 @attr 1=4 0
whereas a search for water, sort by title ascending, then date descending would be
@or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
Notice the fundamental differences between dynamic
ranking
and sorting
: there can be
only one ranking function defined and configured; but multiple
sorting indexes can be specified dynamically at search
time. Ranking does not need to use specific indexes, so
dynamic ranking can be enabled and disabled without
re-indexing; whereas, sorting indexes need to be
defined before indexing.