USI (Universal Search Interface) Architecture

Version 1, 26th July 2012.
Index Data ApS

Version 1:

New file, initial version extracted and modified from v15 of an older document that was specific to a customer.

1. Introduction
2. Example
3. Architecture
- 3.1. Components
- 3.2. Different Kinds of Connector

1. Introduction

This document describes the use of Index Data's Metaproxy software to provide a uniform front-end to searching many different kinds of resources using a consistent query format and obtaining the results in a consistent record format. The combination of software that makes this possible is known as the Universal Search Interface (USI). (The query and record formats are part of the MKC Profile, described elsewhere.)

Using a consistent protocol-compliant front-end, the USI provides access to all kinds of Connectors (Z39.50, APIs, screen scraping), through a single interface.

2. Example

An instance of the USI is running on mkc.indexdata.com for testing and demonstration purposes. Here is a sample request on that USI:

http://mkc.indexdata.com:9000/PLOS_MED?version=1.1&operation=searchRetrieve&query=dinosaur&maximumRecords=3

This searches for records containing the words "dinosaur" in PLoS Medicine, using the SRU web-service protocol. This SRU URL may be accessed in any web browser, and the XML response viewed.

See MKC (MasterKey Connect) Profile for more explanation of this URL and further examples.

3. Architecture

3.1. Components

Uniform access to searchable resources is implemented using a combination of three components: Metaproxy, the Connector Database and the Connector Engine. Client applications communicate directly only with Metaproxy; this consults the Connector Database, and make calls into the Connector Engine and the third-party services to be searched.

The roles of the components are as follows:

Metaproxy rewrites and forwards protocol packets. Its job in this context is to rewrite queries that the application supplies in a canonical format, mapping them as required by each resource; and conversely to map result records in various formats to a canonical format.
The Connector Database provides a resource registration database describing the capabilities and idiosyncrasies of the various resources. Metaproxy consults this database to determine what transformations are required for each resource. The data in the Connector Database is managed using a RESTful web-service (documentation supplied on request). We can optionally also supply MKAdmin, a web-based Admin Console that uses this web-service to allow maintenance of the Connector Database.
Connector Engine provides programmatic web-service APIs to human-facing web-sites by masquerading as a human user, submitting query forms, and parsing the HTML responses. Within this framework, the instructions for searching a given site are expressed as a "connector" (in a different and more specific sense from how that term is used elsewhere in this document). A "connector" in this sense is a small file describing how to log into the site, fill in and submit forms, step through pages of results, and parse out the relevant fields.

3.2. Different Kinds of Connector

Although Metaproxy provides a uniform front end to searchable resources, the resources themselves may be of several different types. The most important of these are:

Z39.50, SRU or Solr server - Metaproxy can interrogate these directly, as it implements these protocols natively. These are in general the fastest services, as there is little overhead involved in accessing them.
Proprietary protocol - Where vendors offer access to their data via a proprietary protocol, we can build a gateway that accepts Z39.50/SRU requests and runs them against that protocol. This transforms the proprietary service into a standard-compliant one which Metaproxy can deal with, but at the cost of another layer of software which may have some effect on performance.
Human-facing web server - In cases where a vendor exposes no programmatic interface to its database, only a web site, we provide a screen-scraper, implemented within the generic scraping framework of the Connector Engine. In general, these will be the slowest Connectors, as human-facing web servers tend to carry much more overhead than API servers.

When implementing a Connector for a specific service, we might use a native Z39.50 server, a scraper, or some other kind of gateway. In general, we would choose to implement just one of these -- whichever provides the best results for that service.

except for the round corners