As mentioned above, there can be only one indexing pipeline, and configuration of the indexing process is a synonym of writing an XSLT stylesheet which produces XML output containing the magic processing instructions or elements discussed in Section 2.5, “Canonical Indexing Format”. Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to enlighten the wary.
Stylesheets can be written in the pull or the push style: pull means that the output XML structure is taken as starting point of the internal structure of the XSLT stylesheet, and portions of the input XML are pulled out and inserted into the right spots of the output XML structure. On the other side, push XSLT stylesheets are recursively calling their template definitions, a process which is commanded by the input XML structure, and is triggered to produce some output XML whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input XML with strong and well-defined structure and semantics, like the following OAI indexing example, whereas the push type might be the only possible way to sort out deeply recursive input XML formats.
A pull stylesheet example used to index OAI harvested records could use some of the following template definitions:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:z="http://indexdata.com/zebra-2.0" xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/" xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0"> <!-- Example pull and magic element style Zebra indexing --> <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/> <!-- disable all default text node output --> <xsl:template match="text()"/> <!-- disable all default recursive element node transversal --> <xsl:template match="node()"/> <!-- match only on oai xml record root --> <xsl:template match="/"> <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"> <!-- you may use z:rank="{some XSLT; function here}" --> <!-- explicetly calling defined templates --> <xsl:apply-templates/> </z:record> </xsl:template> <!-- OAI indexing templates --> <xsl:template match="oai:record/oai:header/oai:identifier"> <z:index name="oai_identifier:0"> <xsl:value-of select="."/> </z:index> </xsl:template> <!-- etc, etc --> <!-- DC specific indexing templates --> <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title"> <z:index name="dc_any:w dc_title:w dc_title:p dc_title:s "> <xsl:value-of select="."/> </z:index> </xsl:template> <!-- etc, etc --> </xsl:stylesheet>
The DOM filter allows indexing of both binary MARC records and MARCXML records, depending on its configuration. A typical MARCXML record might look like this:
<record xmlns="http://www.loc.gov/MARC21/slim"> <rank>42</rank> <leader>00366nam 22001698a 4500</leader> <controlfield tag="001"> 11224466 </controlfield> <controlfield tag="003">DLC </controlfield> <controlfield tag="005">00000000000000.0 </controlfield> <controlfield tag="008">910710c19910701nju 00010 eng </controlfield> <datafield tag="010" ind1=" " ind2=" "> <subfield code="a"> 11224466 </subfield> </datafield> <datafield tag="040" ind1=" " ind2=" "> <subfield code="a">DLC</subfield> <subfield code="c">DLC</subfield> </datafield> <datafield tag="050" ind1="0" ind2="0"> <subfield code="a">123-xyz</subfield> </datafield> <datafield tag="100" ind1="1" ind2="0"> <subfield code="a">Jack Collins</subfield> </datafield> <datafield tag="245" ind1="1" ind2="0"> <subfield code="a">How to program a computer</subfield> </datafield> <datafield tag="260" ind1="1" ind2=" "> <subfield code="a">Penguin</subfield> </datafield> <datafield tag="263" ind1=" " ind2=" "> <subfield code="a">8710</subfield> </datafield> <datafield tag="300" ind1=" " ind2=" "> <subfield code="a">p. cm.</subfield> </datafield> </record>
It is easily possible to make string manipulation in the DOM
filter. For example, if you want to drop some leading articles
in the indexing of sort fields, you might want to pick out the
MARCXML indicator attributes to chop of leading substrings. If
the above XML example would have an indicator
ind2="8"
in the title field
245
, i.e.
<datafield tag="245" ind1="1" ind2="8"> <subfield code="a">How to program a computer</subfield> </datafield>
one could write a template taking into account this information
to chop the first 8
characters from the
sorting index title:s
like this:
<xsl:template match="m:datafield[@tag='245']"> <xsl:variable name="chop"> <xsl:choose> <xsl:when test="not(number(@ind2))">0</xsl:when> <xsl:otherwise><xsl:value-of select="number(@ind2)"/></xsl:otherwise> </xsl:choose> </xsl:variable> <z:index name="title:w title:p any:w"> <xsl:value-of select="m:subfield[@code='a']"/> </z:index> <z:index name="title:s"> <xsl:value-of select="substring(m:subfield[@code='a'], $chop)"/> </z:index> </xsl:template>
The output of the above MARCXML and XSLT excerpt would then be:
<z:index name="title:w title:p any:w">How to program a computer</z:index> <z:index name="title:s">program a computer</z:index>
and the record would be sorted in the title index under 'P', not 'H'.
The names and types of the indexes can be defined in the indexing XSLT stylesheet dynamically according to content in the original XML records, which has opportunities for great power and wizardry as well as grande disaster.
The following excerpt of a push stylesheet might be a good idea according to your strict control of the XML input format (due to rigorous checking against well-defined and tight RelaxNG or XML Schema's, for example):
<xsl:template name="element-name-indexes"> <z:index name="{name()}:w"> <xsl:value-of select="'1'"/> </z:index> </xsl:template>
This template creates indexes which have the name of the working
node of any input XML file, and assigns a '1' to the index.
The example query
find @attr 1=xyz 1
finds all files which contain at least one
xyz
XML element. In case you can not control
which element names the input files contain, you might ask for
disaster and bad karma using this technique.
One variation over the theme dynamically created indexes will definitely be unwise:
<!-- match on oai xml record root --> <xsl:template match="/"> <z:record> <!-- create dynamic index name from input content --> <xsl:variable name="dynamic_content"> <xsl:value-of select="oai:record/oai:header/oai:identifier"/> </xsl:variable> <!-- create zillions of indexes with unknown names --> <z:index name="{$dynamic_content}:w"> <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/> </z:index> </z:record> </xsl:template>
Don't be tempted to play too smart tricks with the power of XSLT, the above example will create zillions of indexes with unpredictable names, resulting in severe Zebra index pollution..
It can be very hard to debug a DOM filter setup due to the many
successive MARC syntax translations, XML stream splitting and
XSLT transformations involved. As an aid, you have always the
power of the -s
command line switch to the
zebraidz
indexing command at your hand:
zebraidx -s -c zebra.cfg update some_record_stream.xml
This command line simulates indexing and dumps a lot of debug information in the logs, telling exactly which transformations have been applied, how the documents look like after each transformation, and which record ids and terms are send to the indexer.