USI (Universal Search Interface) Profile

Version 4, 18th January 2013.
Index Data ApS

Version 4:

Version 3:

Version 2:

Version 1:


1. Introduction

This document describes the use of Index Data's Metaproxy software to provide a uniform front-end to searching many different kinds of resources using a consistent query format and obtaining the results in a consistent record format. The combination of software that makes this possible is known as the Universal Search Interface (USI). (The USI architecture and its components are described elsewhere in the document USI (Universal Search Interface) Architecture.)

Using a consistent protocol-compliant front-end, the USI provides access to all kinds of Connectors (Z39.50, APIs, screen scraping), through a single interface.

2. Example

An instance of the USI is running on for testing and demonstration purposes. Here is a sample request on that USI:

This searches for records containing the words "dinosaur" in PLoS Medicine, using the SRU web-service protocol. This SRU URL may be accessed in any web browser, and the XML response viewed.

3. Client Programming

The USI provides access to searchable resources by means of the SRU protocol, a de facto international standard which is in the process of ratification as an OASIS standard).

Since the SRU protocol has freely available specifications, client software may be constructed in any language. However, the use of client toolkits simplifies access, and Index Data's YAZ toolkit is a particularly convenient option. It is available in 32-bit and 64-bit versions on Windows and POSIX platforms. Although the toolkit itself is written in C, various glue layers enable its ZOOM API for clients to be accessed from a variety of languages including C/C++, Perl, Python, Java, Visual Basic, Scheme, Tcl, Ruby, .NET and Squeak.

Alternatively, since SRU is a simple protocol based on URL-encoded queries and XML responses, new implementations may be built from the ground up using only HTTP and XML libraries. (The example URL above is an SRU request; the XML that a browser displays in response is an SRU response.)

SRU is a stateless protocol and has no notion of a persistent result set. However, Metaproxy's implementation efficiently shares and retains result sets, re-using them when a subsequent request is received with the same query. If a search is submitted and the first ten records retrieved, then the client comes back and requests the next ten records for the same search, the server returns records from the result-set that it built previously.


Consider once more the example URL given at the beginning of this document. It breaks down as follows:

This is an SRU search-and-retrieve request, and is made up of the following components:

After obtaining the first three matching records, further records may subsequently be retrieved by sending the same query but also specifying startRecord=4, startRecord=7, etc. (This parameter is 1-based.)

Queries may be much richer than a single term. They are expressed in CQL, although different Connectors will support different subsets of the full expressive power of that notation. In particular, while many targets support AND searches, few support OR.

Many Connectors support searching against specific indexes. The most widely supported are those listed in the Query Schema section.

For example, to find papers written by Liza Gross with "dinosaur" occurring in the title, use the query


In the URL:

4.1. Rich database names

The database names passed into the USI may be augmented by various parameters using the syntax basename,param1=val1&param2=val2&param3=val3... -- that is, with a comma introducing an ampersand-separated sequence of one or more name=value pairs.

Parmeter names and values may be URL-encoded in the standard way in order to prevent the significant characters =, & and ? from appearing in them. (Note that an unquoted ? will bring a premature end to the rich database name.)

The following parameters are recognised:

user A username to be used in authentication onto the back-end site, often but not always in conjunction with a password. For targets that are implemented by screen-scraping, this is passed through into the connector.
password A password to be used in authentication onto the back-end, together with the username.
proxy A comma-separated list of one of more IP address/port number specifications of a proxy to be used when accessing the back-end site. For example,
nocproxy Turn off the use of the Context Proxy to provide pre-authenticated links into the back-end web-site. (Takes no value.)
content-user If the content-proxy is in use and authentication onto the full-text system requires different credentials from authentication onto the discovery system, then the username and password for full-text authentication can be specified with this parameter and content-password.
content-password See content-user.
content-proxy When the content-proxy is in use, this parameter can be used to specify an HTTP proxy that it must use for IP-based authentication purposes.
x-anything Arbitrary extension parameters with names begining x- may be specified; all such parameters are passed through into the Init task of the screen-scraping connector to be used in a connector-specific way.

Extension parameters may or may not have values. They may also have multiple values, in which case the values are separated by a vertical bar (|).

For example, imagine a screen-scraping connector which can provide access to multiple databases dependent on which URL it starts from. The start URL may be specified at run-time by an x-starturl parameter, whose value is made available as the x-starturl parameter of the connector. Such a connector would be invoked using a rich database name such as: Stat!Ref,x-starturl= If, as in this case, the start-URL contains a question mark, then when the rich database name is used as part of a USI URL it must be encoded as %3f. For example:!Ref,proxy=

Other possible extensions include:

5. Profiling

The SRU protocol specifies how searches are expressed in a query language, how requests are encoded as URLs and how responses are encoded as XML, but says nothing about what these can contain -- what fields are available for searching, what fields can be returned in records, how authentication is specified, etc. To make it useful, SRU must be used with a "profile" which specifies these things. That is the role of this document.

5.1. Authentication

The USI itself has no notion of what user it is serving, and simply passes credentials through to the targets it is working with.

Since the SRU specification does not indicate how authentication credentials should be transmitted, the web-site credentials are encoded in a rich database name as follows:


(It is possible to set up the software so that the target credentials are passed as HTTP/Basic Authentication instead: this can be done when the software is operated behind a firewall or with IP authentication. In general, the available strategies for authentication are strongly influenced by the disposition of the system's various components inside and outside firewalls.)

IP-based authentication can be achieved by instructing the USI to route HTTP requests through an appropriate proxy running on an IP address recognised by the back-end system. This can be done by adding a "proxy" element to the rich database name. The value of this element is one or more hostname:port addresses, separated by commas. So for example the following rich database names are suitable:


When multiple proxy addresses are provided, each is tried in turn until one yields a successful response. This facility should be used with care, as it can have an adverse effect on performance. It is not possible to determine whether any given failure is due to the failure of the nominated proxy or the back-end system; so if the back-end fails, then multiple attempts will be made at using it -- through each proxy -- and all will fail.

5.1.1. Pre-authenticated full-text links

Depending on how the USI is configured, the URL field in the response records from some targets are rewritten to direct traffic through a proxy. This is a rewriting proxy which allows the user to access the same HTTP session that the screen-scraping gateway was using, facilitating seamless access for users to the target system: the authentication credentials need not be resubmitted.

5.2. Query schema

Queries can be submitted using either the Prefix Query Format (PQF) of the Z39.50 Type-1 query, or the much friendlier Common Query Language (CQL). The latter is recommended.

In CQL, query terms may be against whole records or against specific indexes such as author, title or subject. CQL itself does not mandate a specific set of index names, but allows this to be profiled on an application basis, drawing indexes from established sets such as the Dublin Core and Bath Profile context sets, and if necessary creating additional private indexes. The following schema is used in the USI:

cql.anywhere Search for specified terms anywhere in the record. In practice, this is the only kind of search supported by many web-sites. Not specifying an index is equivalent to using cql.anywhere.
dc.creator Author search.
dc.title Title search.
dc.subject Subject search, including controlled keywords where supported.
bath.isbn ISBN. Note that there is currently no standard way to search by ISBN using CQL. The Bath profile for Z39.50 provides no ISBN search, and the SRU Bath Profile followed the Z39.50 profile in recommending the use of a "standard identifier" search for this, using the dc.identifier index. However, the semantics of this index are insufficiently precise -- for example, it also encompasses local identifiers -- so the USI profile uses bath.isbn, an informal addition to the Bath context set.
bath.issn ISSN. For some reason, this does exist in the Bath context set. Date of publication. By using different operators (e.g. < 1800, >= 2010, = 1998) this can be used to specify open or closed date ranges or exact dates.
rec.identifier Search by unique identifier within the target database, as obtained from a record previously. As noted above, Dublin Core's identifier element is inappropriate, as its semantics are too broad.
net.path Narrow the search to a particular database. This is useful when expressing searches for some screen-scrapers, where a single web-site provides access to multiple databases. (This will often not be needed, as database names can be embedded in complex database names such as opinionarchives2?subdatabase=american_spectator, but is occasionally useful.)
dc.language Language of the full text. May be used for searching either by language name (e.g. "english") or by ISO three-letter code. Different back-ends will offer different levels of support.
dc.format Search by publication type (journal, book, etc.)
id.fullText May be used with search term "1" to limit as search to records describing resources for which the full text is available.
id.peerReviewed May be used with search term "1" to limit as search to records describing resources for which the full text was peer reviewed as part of the publication process.
dc.description Used for searching abstracts.
dc.source A related resource from which the described resource is derived in whole or in part. Used for searching publication name.
dc.publisher Publisher.
id.seriesTitle Series title.

This set of indexes is drawn from six CQL context sets:

In practice, screen-scraped targets will have limited searching functionality, while Z39.50/API-based ones will have more. The USI will fail with an explicit diagnostic when a search requires facilities that the underlying target can't handle: this may include unsupported indexes (e.g. author search on a site that only supports keyword searching) or unsupported boolean operators (e.g. the use of OR on a site that only implements AND).

5.3. Data schema

Records are returned as XML. SRU does not mandate the use of any particular XML schema: the USI profile defines an expressive schema, which is outlined here. Data from all targets is returned uniformly in that same schema.

author Author of the work. May be repeated to indicate multiple authors.
title Title of the work. For works that are part of a larger aggregate, such as articles in a journal or chapters in an edited volume, this field contains the title of the article rather than of the journal or book.
subject Uncontrolled subject keywords or controlled subkect strings/keywords.
description Ideally an abstract of the work, but may be used for other related text such as notes or summaries.
publisher The publisher of a book; sometimes also used to specify the publisher of a journal.
date Date of publication. Where possible, this is provided in ISO standard format: YYYY-MM-DD when the exact date is known, YYYY-MM or YYYY otherwise.
isbn International Standard Book Number
issn International Standard Serial Number
journaltitle The title of a journal in which an article appears.
volume The volume of the journal in which an article is found.
issue The issue of the journal volume in which an article is found.
startpage First page on which an article appears within an edited volume or an article in an issue of a journal.
endpage Last page on which an article appears within an edited volume or an article in an issue of a journal.
citation An unparsed free-text citation of the document. Some web-sites provide this in a format that cannot be parsed to yield separate journaltitle, volume, issue, startpage and endpage. Can be displayed to help users track down the resource by other means.
url The web address of the content, if available. Permanent links are preferred when available.
thumburl The web address of a small image representing the described document, if available.
relevance Relevance score assigned by the searched database, relative to other documents in the same collection.
id Local identifier, such as an ILS or source database identifier: that is, an identifier for the document which is meaningful only within the database where the document was found. This is useful for subsequently re-finding the same document using a search on the rec.identifier index.
holding Structured information about an individual copy of the document described by the record, e.g. a physical book in a library. See below for details.
booktitle Title of a book from which a chapter or excerpt was taken. (In records describing complete books, the title will be in the title field.)
copyright Copyright statement for the item itself.
copyrightabstract Copyright statement for abstract/summary.
pubtype Source-level publication type, for example "journal" or "book".
doctype Item-level document type, for example "article" or "chapter".
extent Number of pages, length in any measurement, etc.
format technical or encoding format like mp3, wma, cda
medium physical medium/carrier of information, like videodiscs DVD, film reel,
languageitem Language of published item
languageabstract Language of summary, abstract, etc.
permalink Permanent link to record on the back-end system.
fulltexturl Link to full text. Contrast with both url and permalink.

All of these fields are repeatable. Their values are all plain text with the exception of holding, which is a structure containing the following fields:

location The physical location where the item resides, e.g. the name of a library branch.
callno Where to find the item within the specified location, e.g. a classification number and shelf designation.
available A statement of availability for loans, study at the library, etc.
due If the item is on loan, the date when it is due back.

Some targets return very few fields by default. For some such targets, more fields may be obtained by specifying the "full record" schema using the SRU parameter recordSchema=F. Note however that that does not make a difference with most targets, and that for targets where it does return more fields there is usually a large performance penalty. This is because the typical implementation in screen-scraping connectors is to load a "full record" page for each item in a result list, resulting in many additional HTTP round-trips behind the scenes. So this facility should be used with caution.

5.4. XML Format

In casting this abstract schema into XML, several approaches are possible. Initially, we envisaged constructing a schema that primarily uses elements from the basic Dublin Core Metadata Element Set, extended where necessary by elements from the broader Dublin Core Terms vocabulary, and expressed in a form that conforms to the Dublin Core XML guidelines. However, attempts to construct such a schema quickly revealed that even the expanded Dublin Core vocabulary is hopelessly inadequate for our purposes: for example, it lacks all elements related to journal-article chronology and holdings.

Instead, we settled on using the Library of Congress's MODS (Metadata Object Description Schema) as the basis: it is much more capable of representing different kinds of resources, and includes information such as holdings. The example MODS records include a book, book chapter, serial, article in a serial, serial special issue, serial supplement, electronic serial, web document , conference publication, map, motion picture, music, and sound recording.

Using the mapping exemplified below, a valid MODS 3.4 record can contain most of the fields listed in the tables above. In order to accomodate the remaining fields, it was necessary to add extension fields in a separate namespace (e.g. id:relevance, id:circ, id:available, id:due and id:citation).

The following sample record includes all fields, with data expressed as $FIELDNAME placeholders. If the <id:xxx> private elements are removed, this validates against the MODS 3.4 schema.

<?xml version='1.0' encoding='UTF-8' ?>
<mods version="3.4"
    <url usage="primary">$URL/</url>
    <url access="preview">$THUMBURL</url>
    <url access="fulltext">$FULLTEXTURL</url>
  <name type="personal">
      <roleTerm type="text">author</roleTerm>
  <abstract type="description">$DESCRIPTION</abstract>
  <!-- <location> is repeatable for multiple holdings -->
  <relatedItem type="host">
      <!-- or -->
      <detail type="volume">
      <detail type="issue">
      <extent unit="pages">
  <identifier type="issn">$ISSN</identifier>
  <identifier type="isbn">$ISBN</identifier>
  <identifier type="permalink">$PERMALINK</identifier>
  <accessCondition type="copyright">$COPYRIGHT</accessCondition>
  <accessCondition type="copyrightabstract">$COPYRIGHTABSTRACT</accessCondition>
  <language usage="primary">
    <languageTerm type="text">$LANGUAGEITEM</languageTerm>
  <language objectPart="summary">
    <languageTerm type="text">$LANGUAGEABSTRACT</languageTerm>

5.5. Error Reporting

USI errors are reported in standard SRU format, as a <diag:diagnostics> element in the namespace, as in this example:

<diag:diagnostic xmlns:diag="">
  <diag:message>Masking character not supported</diag:message>
  <diag:details>Right truncation not supported (backend)</diag:details>

In this structure, the three elements have the following meanings:

A URI uniquely identifying the error that has occurred. In practice, these always consist of info:srw/diagnostic/1/ followed by a small integer, and the meanings of these diagnostic codes are described briefly in the SRU Diagnostics List.
A short English-language message describing the error condition identified by the URI. This generally corresponds closely to the text in the Diagnostics List referenced above.
Optionally, any additional information provided to elucidate the error. For some error URIs, the format that this information should take is specified by the Diagnostics List; for others, it is freeform. When the USI passes through a diagnostic from the back-end system, it appends the string "(backend)" to the details string, so that the source of the error is apparent.

The most important values of <diag:uri> that can occur are as follows:

Authentication error. Usually indicates that the supplied username/password pair was rejected at the back-end or that the IP address of the specified proxy server was not recognised.
Database does not exist.
Authentication succeeded, but the user is not authorised for the specific database that was reqeusted.
Too many simultaneous users on a limited-seat licenced database.
An error that does not fall into any of these categories, but which is described in human-readable form in the <diag:details> element.

All other diagnostic USIs indicate a system error and should be reported as a support issue.

5.6. Web-service access to the Database Registry

5.6.1. Accessing the registry

It is possible to inspect the Database Registry (also known above as the Connector Database) by means of a web service. The Registry is presented as a database available via SRU, and conforming to an extended version of the ZeeRex profile.

It is provided as part of each USI node as the special database IR-Explain---1. For example, given a USI node running on port 9000 of, the registry is accessible at and can be searched using CQL like any other SRU database:

Returns information about databases with "jstor" in their title,

Returns information about the one databases whose unique name is JSTOR, and

Returns information about all databases in the registry.

5.6.2. Format of registry records

Records are returned in a subset of the full ZeeRex format, since many of the fields in full ZeeRex records are technical in nature and of no interest when describing USI databases whose interfaces are normalised. Here is an example record:

<?xml version="1.0"?>
<sru:searchRetrieveResponse xmlns:sru=""
        <explain id="com.indexdata.explain" authoritative="true">
          <serverInfo protocol="SRU" version="1.1">
            <title lang="en" primary="true">NewsBank [World]</title>
            <description lang="en" primary="true">WhatEVER</description>
            <dateModified>Tue, 11 Oct 2011 16:49:41 GMT</dateModified>
            <recordSyntax identifier="text/xml">
                <field name="title"/>
                <field name="description"/>
                <field name="journaltitle"/>
                <field name="extent"/>
            <set name="cql" identifier="info:srw/cql-context-set/1/cql-v1.1"/>
            <set name="dc" identifier="info:srw/cql-context-set/1/dc-v1.1"/>
            <set name="bath" identifier=""/>
            <set name="id" identifier=""/>
            <index search="true">
              <title lang="en">Keyword</title>
                <name set="cql">anywhere</name>
            <index search="true">
              <title lang="en">Author</title>
                <name set="dc">creator</name>
            <index search="true">
              <title lang="en">Title</title>
                <name set="dc">title</name>
            <index search="true">
              <title lang="en">Source</title>
                <name set="dc">source</name>
	      query-full query-and query-truncation query-prox query-sort
	      index-keyword index-author index-title index-journaltitle
	      result-title result-description result-journaltitle result-extent
	    <extensions>starturl journals</extensions>
            <supports type="boolean">and</supports>
            <supports type="boolean">or</supports>
            <supports type="proximity"/>
            <supports type="maskingCharacter">*</supports>
            <supports type="maskingCharacter">?</supports>
            <supports type="sort"/>

In addition to ZeeRex elements described as being part of v2.0 of the standard, the returned records contain additional elements, not yet part of the standard. These include serviceProvider and field. It also uses an additional type for the supports element -- boolean.

As is apparent from the sample record, the most important elements are:

host and port Connection details for the server.
database Short, machine-readable database name, used to identify the database when invoking the USI.
title Long, human-readable database name.
serviceProvider Service provider.
description Optional human-readable notes about the database or about the connector that implements it.
field (repeatable) Specifies the name of a field that the database is capable of returning.
index (repeatable) Specifies the an index that the database is capable of returning. The contained title is human-readable and may be ignored. The name element's set attribute and content contain the context-set and index name respectively.
supports (repeatable) Indicates that support for specific functionality is available: boolean operators, proximity, masking (i.e. truncation and wildcarding) and sorting.

The capabilities field in the configInfo is an internal extension to ZeeRex that provides a simple whitespace-separated list of named capabilities provided by the target. The important capabilities are also expressed in a more machine-readable form elsewhere in the ZeeRex record, so this field usually need not be consulted.

The extensions field is also an internal extension to ZeeRex. It provides a whitespace-separated list of named extensions supported by the USI's connector for the target. Each such named extension may be invoked by including its name, prefixed by the sequence x-, as a component in the rich database name. For example, to invoke a target called "foo" with the extension "bar" having value "baz", use the rich database name foo,x-bar=baz.

5.6.3. Feature support indicators

The following <supports> elements may appear:

<supports type="boolean">and</supports> Boolean queries with the and operator are supported. No guarantees are made about how many terms may be combined with this operator, nor about what combinations of this operator together with others may be supported.
<supports type="boolean">or</supports> Boolean queries with the or operator are supported. Caveats as for and
<supports type="boolean">not</supports> Boolean queries with the or operator are supported. Caveats as for not
<supports type="proximity"/> Boolean queries with the prox operator are supported. Caveats as for and
<supports type="maskingCharacter">*</supports> The specific masking character is recognised. However, this says nothing about what context it is supported in: for this, the truncation elements must be consulted.
<supports type="truncation">right</supports> Right-truncation is supported, i.e. wildcard terms of the form foo*.
<supports type="truncation">left</supports> Left-truncation is supported, i.e. wildcard terms of the form *foo.
<supports type="truncation">both</supports> Simultaneous left-and-right truncation is supported, i.e. wildcard terms of the form *foo*. Note that this is not that same as saying that both left- and right-truncation are supported. Some targets support both of these but not both-at-once truncation; conceivable some targets may support both-at-once truncation and not one or other of the single truncations.
<supports type="truncation">embedded</supports> Embedded truncation is supported, i.e. wildcard terms of the form f*oo.
<supports type="sort"/> Sorting (using the CQL sortby keyword) is supported.
<supports type="daterange">start</supports> Date-range searching using the inequality relation >= is supported. (Support for the index alone, indicated by the <index> element, says nothing about which relations can be used: for targets support only exact-year searching.)
<supports type="daterange">end</supports> Date-range searching using the inequality relation <= is supported. When both start and end dateranges are supported, as well as the and boolean operator, these can be combined to form fully bounded ranges.

6. An extension: live status reports

As well as the SRU protocol, Metaproxy provides a facility for live reporting on the status of a given USI node. The statistics reported include the average, minimum and maximum response times, and a breakdown of the number of responses by time taken. Current Metaproxy thread usage is also reported.

The report is accessible via an HTTP request to /status. For example,

The result is formatted in XML as in the following example:

<?xml version="1.0"?>
  <responses frequency="2995">
    <response duration_start="0.000100" duration_end="0.001000" frequency="358"/>
    <response duration_start="0.001000" duration_end="0.010000" frequency="6"/>
    <response duration_start="0.010000" duration_end="0.100000" frequency="40"/>
    <response duration_start="0.100000" duration_end="0.200000" frequency="2"/>
    <response duration_start="0.200000" duration_end="0.300000" frequency="10"/>
    <response duration_start="0.300000" duration_end="0.500000" frequency="110"/>
    <response duration_start="0.500000" duration_end="1.000000" frequency="533"/>
    <response duration_start="1.000000" duration_end="1.500000" frequency="428"/>
    <response duration_start="1.500000" duration_end="2.000000" frequency="449"/>
    <response duration_start="2.000000" duration_end="3.000000" frequency="525"/>
    <response duration_start="3.000000" duration_end="4.000000" frequency="152"/>
    <response duration_start="4.000000" duration_end="5.000000" frequency="46"/>
    <response duration_start="5.000000" duration_end="6.000000" frequency="44"/>
    <response duration_start="6.000000" duration_end="8.000000" frequency="38"/>
    <response duration_start="8.000000" duration_end="10.000000" frequency="25"/>
    <response duration_start="10.000000" duration_end="15.000000" frequency="97"/>
    <response duration_start="15.000000" duration_end="20.000000" frequency="72"/>
    <response duration_start="20.000000" duration_end="30.000000" frequency="38"/>
    <response duration_start="30.000000" frequency="22"/>
    <response duration_max="89.898670"/>
    <response duration_min="0.000177"/>
    <response duration_average="3.007569"/>
  <thread_info busy="0" total="50"/>

Valid XHTML 1.0 Strict Valid CSS! except for the round corners