2. Main Components

The Zebra system is designed to support a wide range of data management applications. The system can be configured to handle virtually any kind of structured data. Each record in the system is associated with a record schema which lends context to the data elements of the record. Any number of record schemas can coexist in the system. Although it may be wise to use only a single schema within one database, the system poses no such restrictions.

The Zebra indexer and information retrieval server consists of the following main applications: the zebraidx indexing maintenance utility, and the zebrasrv information query and retrieval server. Both are using some of the same main components, which are presented here.

The virtual Debian package idzebra-2.0 installs all the necessary packages to start working with Zebra - including utility programs, development libraries, documentation and modules.

2.1. Core Zebra Libraries Containing Common Functionality

The core Zebra module is the meat of the zebraidx indexing maintenance utility, and the zebrasrv information query and retrieval server binaries. Shortly, the core libraries are responsible for

Dynamic Loading

of external filter modules, in case the application is not compiled statically. These filter modules define indexing, search and retrieval capabilities of the various input formats.

Index Maintenance

Zebra maintains Term Dictionaries and ISAM index entries in inverted index structures kept on disk. These are optimized for fast inset, update and delete, as well as good search performance.

Search Evaluation

by execution of search requests expressed in PQF/RPN data structures, which are handed over from the YAZ server frontend API. Search evaluation includes construction of hit lists according to boolean combinations of simpler searches. Fast performance is achieved by careful use of index structures, and by evaluation specific index hit lists in correct order.

Ranking and Sorting

components call resorting/re-ranking algorithms on the hit sets. These might also be pre-sorted not only using the assigned document ID's, but also using assigned static rank information.

Record Presentation

returns - possibly ranked - result sets, hit numbers, and the like internal data to the YAZ server backend API for shipping to the client. Each individual filter module implements it's own specific presentation formats.

The Debian package libidzebra-2.0 contains all run-time libraries for Zebra, the documentation in PDF and HTML is found in idzebra-2.0-doc, and idzebra-2.0-common includes common essential Zebra configuration files.

2.2. Zebra Indexer

The zebraidx indexing maintenance utility loads external filter modules used for indexing data records of different type, and creates, updates and drops databases and indexes according to the rules defined in the filter modules.

The Debian package idzebra-2.0-utils contains the zebraidx utility.

2.3. Zebra Searcher/Retriever

This is the executable which runs the Z39.50/SRU/SRW server and glues together the core libraries and the filter modules to one great Information Retrieval server application.

The Debian package idzebra-2.0-utils contains the zebrasrv utility.

2.4. YAZ Server Frontend

The YAZ server frontend is a full fledged stateful Z39.50 server taking client connections, and forwarding search and scan requests to the Zebra core indexer.

In addition to Z39.50 requests, the YAZ server frontend acts as HTTP server, honoring SRU SOAP requests, and SRU REST requests. Moreover, it can translate incoming CQL queries to PQF queries, if correctly configured.

YAZ is an Open Source toolkit that allows you to develop software using the ANSI Z39.50/ISO23950 standard for information retrieval. It is packaged in the Debian packages yaz and libyaz.

2.5. Record Models and Filter Modules

The hard work of knowing what to index, how to do it, and which part of the records to send in a search/retrieve response is implemented in various filter modules. It is their responsibility to define the exact indexing and record display filtering rules.

The virtual Debian package libidzebra-2.0-modules installs all base filter modules.

2.5.1. DOM XML Record Model and Filter Module

The DOM XML filter uses a standard DOM XML structure as internal data model, and can thus parse, index, and display any XML document.

A parser for binary MARC records based on the ISO2709 library standard is provided, it transforms these to the internal MARCXML DOM representation.

The internal DOM XML representation can be fed into four different pipelines, consisting of arbitrarily many successive XSLT transformations; these are for

  • input parsing and initial transformations,

  • indexing term extraction transformations

  • transformations before internal document storage, and

  • retrieve transformations from storage to output format

The DOM XML filter pipelines use XSLT (and if supported on your platform, even EXSLT), it brings thus full XPATH support to the indexing, storage and display rules of not only XML documents, but also binary MARC records.

Finally, the DOM XML filter allows for static ranking at index time, and to to sort hit lists according to predefined static ranks.

Details on the experimental DOM XML filter are found in Chapter 7, DOM XML Record Model and Filter Module.

The Debian package libidzebra-2.0-mod-dom contains the DOM filter module.

2.5.2. ALVIS XML Record Model and Filter Module

Note

The functionality of this record model has been improved and replaced by the DOM XML record model. See Section 2.5.1, “DOM XML Record Model and Filter Module”.

The Alvis filter for XML files is an XSLT based input filter. It indexes element and attribute content of any thinkable XML format using full XPATH support, a feature which the standard Zebra GRS-1 SGML and XML filters lacked. The indexed documents are parsed into a standard XML DOM tree, which restricts record size according to availability of memory.

The Alvis filter uses XSLT display stylesheets, which let the Zebra DB administrator associate multiple, different views on the same XML document type. These views are chosen on-the-fly in search time.

In addition, the Alvis filter configuration is not bound to the arcane BIB-1 Z39.50 library catalogue indexing traditions and folklore, and is therefore easier to understand.

Finally, the Alvis filter allows for static ranking at index time, and to to sort hit lists according to predefined static ranks. This imposes no overhead at all, both search and indexing perform still O(1) irrespectively of document collection size. This feature resembles Google's pre-ranking using their PageRank algorithm.

Details on the experimental Alvis XSLT filter are found in Chapter 8, ALVIS XML Record Model and Filter Module.

The Debian package libidzebra-2.0-mod-alvis contains the Alvis filter module.

2.5.3. GRS-1 Record Model and Filter Modules

Note

The functionality of this record model has been improved and replaced by the DOM XML record model. See Section 2.5.1, “DOM XML Record Model and Filter Module”.

The GRS-1 filter modules described in Chapter 9, GRS-1 Record Model and Filter Modules are all based on the Z39.50 specifications, and it is absolutely mandatory to have the reference pages on BIB-1 attribute sets on you hand when configuring GRS-1 filters. The GRS filters come in different flavors, and a short introduction is needed here. GRS-1 filters of various kind have also been called ABS filters due to the *.abs configuration file suffix.

The grs.marc and grs.marcxml filters are suited to parse and index binary and XML versions of traditional library MARC records based on the ISO2709 standard. The Debian package for both filters is libidzebra-2.0-mod-grs-marc.

GRS-1 TCL scriptable filters for extensive user configuration come in two flavors: a regular expression filter grs.regx using TCL regular expressions, and a general scriptable TCL filter called grs.tcl are both included in the libidzebra-2.0-mod-grs-regx Debian package.

A general purpose SGML filter is called grs.sgml. This filter is not yet packaged, but planned to be in the libidzebra-2.0-mod-grs-sgml Debian package.

The Debian package libidzebra-2.0-mod-grs-xml includes the grs.xml filter which uses Expat to parse records in XML and turn them into IDZebra's internal GRS-1 node trees. Have also a look at the Alvis XML/XSLT filter described in the next session.

2.5.4. TEXT Record Model and Filter Module

Plain ASCII text filter. TODO: add information here.