USI (Universal Search Interface) Architecture
Version 1, 26th July 2012.
Index Data ApS
Version 1:
-
New file, initial version extracted and modified from v15 of
an older document that was specific to a customer.
Contents
This document describes the use of Index Data's
Metaproxy
software to provide a uniform front-end to searching many
different kinds of resources using a consistent query format and
obtaining the results in a consistent record format. The
combination of software that makes this possible is known as the
Universal Search Interface (USI). (The query and record formats
are part of the MKC Profile, described elsewhere.)
Using a consistent protocol-compliant front-end, the USI
provides access to all kinds of Connectors (Z39.50, APIs, screen
scraping), through a single interface.
An instance of the USI is running on mkc.indexdata.com for
testing and demonstration purposes. Here is a sample request on
that USI:
http://mkc.indexdata.com:9000/PLOS_MED?version=1.1&operation=searchRetrieve&query=dinosaur&maximumRecords=3
This searches for records containing the words "dinosaur"
in PLoS Medicine, using the SRU web-service protocol.
This
SRU URL may be accessed in any web browser, and the XML
response viewed.
See
MKC (MasterKey Connect) Profile
for more explanation of this URL and further examples.
Uniform access to searchable resources is implemented using a
combination of three components: Metaproxy, the Connector
Database and the Connector Engine. Client applications
communicate directly only with Metaproxy; this consults the
Connector Database, and make calls into the Connector Engine and
the third-party services to be searched.
The roles of the components are as follows:
-
Metaproxy
rewrites and forwards protocol packets. Its job in this
context is to rewrite queries that the application supplies in
a canonical format, mapping them as required by each resource;
and conversely to map result records in various formats to a
canonical format.
-
The Connector Database
provides a resource registration database describing the
capabilities and idiosyncrasies of the various resources.
Metaproxy consults this database to determine what
transformations are required for each resource. The data in
the Connector Database is managed using a RESTful web-service
(documentation supplied on request). We can optionally also
supply MKAdmin, a web-based Admin Console that uses this
web-service to allow maintenance of the Connector Database.
-
Connector Engine
provides programmatic web-service APIs to human-facing
web-sites by masquerading as a human user, submitting query
forms, and parsing the HTML responses. Within this framework,
the instructions for searching a given site are expressed as a
"connector" (in a different and more specific sense from how
that term is used elsewhere in this document). A "connector"
in this sense is a small file describing how to log into the
site, fill in and submit forms, step through pages of results,
and parse out the relevant fields.
Although Metaproxy provides a uniform front end to searchable
resources, the resources themselves may be of several different
types. The most important of these are:
-
Z39.50, SRU or Solr server -
Metaproxy can interrogate these directly, as it implements
these protocols natively. These are in general the fastest
services, as there is little overhead involved in accessing
them.
-
Proprietary protocol -
Where vendors offer access to their data via a proprietary
protocol, we can build a gateway that accepts Z39.50/SRU
requests and runs them against that protocol. This transforms
the proprietary service into a standard-compliant one which
Metaproxy can deal with, but at the cost of another layer of
software which may have some effect on performance.
-
Human-facing web server -
In cases where a vendor exposes no programmatic interface to
its database, only a web site, we provide a
screen-scraper, implemented within the generic scraping
framework of the Connector Engine. In general, these will be
the slowest Connectors, as human-facing web servers tend to
carry much more overhead than API servers.
When implementing a Connector for a specific service, we might
use a native Z39.50 server, a scraper, or some other kind of
gateway. In general, we would choose to implement just one of
these -- whichever provides the best results for that service.
except for the round corners