3. ICU Chain Files

The ICU chain files defines a chain of rules which specify the conversion process to be carried out for each record string for indexing.

Both searching and sorting is based on the sort normalization that ICU provides. This means that scan and sort will return terms in the sort order given by ICU.

Zebra is using YAZ' ICU wrapper. Refer to the yaz-icu man page for documentation about the ICU chain rules.

Tip

Use the yaz-icu program to test your icuchain rules.

Example 10.2. Indexing Greek text

Consider a system where all "regular" text is to be indexed using as Greek (locale: EL). We would have to change our index type file - to read

      # Index greek words
      index w
      completeness 0
      position 1
      alwaysmatches 1
      firstinfield 1
      icuahain greek.xml
      ..
     

The ICU chain file greek.xml could look as follows:

      <icu_chain locale="el">
      <transform rule="[:Control:] Any-Remove"/>
      <tokenize rule="l"/>
      <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
      <display/>
      <casemap rule="l"/>
     </icu_chain>
     


Zebra is shipped with a field types file icu.idx which is an ICU chain version of default.idx.

Example 10.3. MARCXML indexing using ICU

The directory examples/marcxml includes a complete sample with MARCXML records that are DOM XML indexed using ICU chain rules. Study the README in the marcxml directory for details.