IndexGeo proposal for technology to support the Australian Spatial Data Directory (ASDD)

IndexGeo Pty Ltd proposed the Z39.50 protocol to the ANZLIC Metadata Working Group's call for proposals of technology to build a coordinated and distributed directory of geospatial datasets for Australia.

July 1997

Keywords: geospatial data directory, Z39.50 search and retrieve protocol, digital library, resource discovery, SGML, ASDD, Australia, spatial dataset description, metadata

New... The Z39.50 search and retrieve protocol has been chosen. Search the ASDD now.

Contents:

Introduction

Many of the goals and strategies of the ANZLIC Strategic Plan 1997-2000 (Australia New Zealand Land Information Council, September 1997) discuss the need to maximise the access to spatial information, as part of building the Australian Spatial Data Infrastructure (ASDI). Developing and implementing an Australian Spatial Data Directory (ASDD) will help to address this need.

The ANZLIC Metadata Working Group has undertaken a project to "identify and evaluate applications of database and internet technology that can support a distributed directory system". In July 1997 ANZLIC invited technology providers, researchers or consultants to demonstrate the capability of available technology that can support its efforts to establish a distributed ASDD.

This document is a summary and elaboration of the proposal submitted by IndexGeo Pty Ltd. The proposals are being assessed by the Working Group during December 1997.

Summary of the proposal

The Z39.50 Search and Retrieve protocol is used to establish a distributed network of digital library servers.

The distributed servers are coordinated by a "clearinghouse", providing a consistent front-end interface to all servers.

Any custodian can join the clearinghouse by establishing a compliant Z39.50 server and adhering to technical and presentation guidelines.

Each custodian can utilise their existing document management system, because the Z39.50 server is capable of interfacing with any collection of documents.

HTTP to Z39.50 gateways enable access from both the Z39.50 digital libraries and from the World Wide Web.

The key advantages of this proposal are:

Why use the Z39.50 protocol

The Z39.50 protocol is the protocol used for library information services, especially bibliographic information systems. This then means that any services that are developed using this protocol are immediately searchable from other existing library systems. So, the ASDD would be able to be searched from the US Geospatial Data Clearinghouse or any Global Information Locator Service (GILS) and vice versa. This is important because there are US agency directories such as the NASA Global Change Master Directory that have datasets covering Australasian regions.

Z39.50 is capable of simultaneous query at a number of distributed servers.

Server software is available from a variety of vendors, commercial and public domain. Available servers are discussed below.

Gateways to other protocols mean that the directory is available to other realms. An HTTP to Z39.50 gateway enables collections to be searched using any World Wide Web browser. Electronic mail to Z39.50 gateways mean that no-one is disadvantaged. Of course, true Z39.50 clients can also be used. This diversity means that the directory can be searched by many means.

Front-end Z39.50 servers can connect to any back-end document management system. Documents can be stored in any brand of database, or even as a simple collection of SGML text files. Each particular server can interrogate many back-end document collections. Importantly, this enables the custodian to leave their existing document management systems in place and simply install a front-end interface to index and serve them.

Z39.50 insulates the user from the actual schema (structure and field names) of each document management system. Configuration files enable the standard interface Use Attribute numbers (numbers behind the pick-list names) to be mapped to the actual field names in each particluar document collection.

The ZedWeb Project is currently being developed to create a HTTP to Z39.50 gateway service that "integrates Australian Z39.50 servers and provides a single web-interface to access the underlying data sources". The service will eventually integrate any Australian Z39.50 digital library server that is a "bibliographic" collection. For the purposes of the ZedWeb project, "bibliographic" means any database which supports the BIB-1 attribute set for querying and returns MARC records.

Using the Z39.50 protocol would obviously give the ASDD a strong position in the Australian and international digital library communities.

Clearinghouse

Each server that participates in the distributed directory would abide by technical guidelines, so as to form a network of uniform access points to the directory. This group is called a "clearinghouse". Each server is called a "node" of the clearinghouse.

Each node would have exactly the same WWW page, being its front page to the clearinghouse. So both of the URLs http://www.indexgeo.com.au/asdd/ and http://www.environment.gov.au/asdd/ would present exactly the same page. This means that there are many access points to the clearinghouse and users can expect to be presented with exactly the same page, no matter at which node they start. If one particular server is not available then the clearinghouse can still be accessed via any other server.

The front page would list all of the clearinghouse nodes, with a couple of lines of description for each node. Each entry would have three links, with the following functions:

The nodes would be listed by jurisdiction and by thematic category.

The front page would also provide an interface to conduct a simultaneous search at a number of coordinating nodes. It is envisaged that there will be too many nodes to be able to conduct a simultaneous search of all nodes.

Each front page would also link to a local set of HTML documentation about the clearinghouse.

When a node joins the clearinghouse it would be provided with a module of HTML pages and scripts to generate HTML interfaces. This module would be automatically updated by the coordinating node.

Examples of clearinghouses are:

Coordinating nodes

The clearinghouse would have one node that is the "coordinating node". This node has no hierarchical difference to the other nodes, as all nodes are considered peers.

The coordinating node maintains the authoritative front page for the clearinghouse, the user documentation, scripts that generate the standard interfaces, authority lists which define the spatial regions, Z39.50 Use Attributes, SGML Document Type Definition (DTD). Whenever these documents are updated then they are automatically shipped to all of the nodes.

The coordinating node would also have a technical lead person who provides advice and guidance to the other nodes. This technical lead would keep abreast of the technology, attend relevant events and workshops, attend international standards fora, cooperate with other national clearinghouses, and then pass information on to our node managers through workshops and direct contact. The US FGDC has Doug Nebert in this role - we need a similar position.

Each thematic category and jurisdiction also has a coordinating node, as they are other clearinghouses in their own right, having other geographically-local and subject-specific nodes.

Custodians remain in control

With this model, all custodians remain in control of the publishing of their dataset descriptions. These are basic tenets of metadata management - custodians must be able to publish, update, and withdraw documents as they see fit and publishing should be done close to the source.

Each custodian manages their document collections using whatever document management system that they choose or may already have in place. All that they need to do to participate, is to establish a compliant Z39.50 server and set up the configuration files that map the search interface attribute numbers to the field names in their document collections. This flexibility allows the custodian to upgrade the structure of their document management system as needed, yet still present a consistent interface to the clearinghouse.

Technical and presentation guidelines are the glue

To join the clearinghouse, each node must abide by technical and presentation guidelines. These would ensure that the various Z39.50 servers can interoperate and that each node of the clearinghouse presents a consistent interface.

Each node would:

Adhering to such guidelines will mean that our servers would also be compliant GILS servers.

Discussion group for node managers

A technical discussion group would be established as an email listserver. This would enable node managers to collaborate and would facilitate the resolution of day-to-day management of the clearinghouse. The coordinating nodes would also run a regular workshop series to assist all node managers to keep abreast of the changing technologies.

Jurisdiction and thematic coordinating nodes will gather data

Each jurisdiction and thematic coordinating node would regularly draw in the SGML files from each of the clearinghouse nodes that it coordinates. Consistent presentation documents would be automatically generated from the SGML metadata - HTML, SUTRS, etc.

This enables the coordinating nodes to be searched by the simultaneous query system.

The gathering would be configured differently for each node, as they will have different publishing times and frequency.

A simple purpose-built HTTP robot could easily be developed to gather the SGML files from defined sites. All nodes would be provided with this agent, so that any node can easily host a collection for any other node that cannot provide a Z39.50 server.

Simultaneous search at distributed nodes

The Z39.50 protocol is capable of conducting a simultaneous search at a number of distributed nodes.

This is necessary because more than one server may have data that meets the users search criteria. However, users cannot be expected to search at each node separately.

There will be a large number nodes, so we need a way to conduct an effective simultaneous search. The coordinating nodes would automatically gather dataset descriptions from the servers of their respective clearinghouses, as explained above. In this way, the coordinating nodes would hold a separate collection of all documents in their jurisdiction for the purpose of a simultaneous search.

The ZedWeb Project will be investigating distributed search at many nodes and will develop solutions. The ASDD should make strong linkages with that project. Ideally, the simultaneous search can be conducted at all relevant nodes

Sophisticated search services

Z39.50 services are capable of sophisticated searching:

Spatial searching would be assisted by interface pick-lists of region names. Authority lists of well-known and well-defined spatial region names: states and territories, Local Government Areas (LGA), oceans and seas, major catchments, etc. - these would be standard lookup tables for all servers, so that they can automatically resolve a region name into the coordinates of its minimum bounding rectangle, centroid, or bounding polygon.

Low-cost systems

Both commercial and public domain implementations are available (explained below).

Different nodes will have different needs and budgets.

Z39.50 servers are capable of interrogating various back-end document management systems. In this way, a small organisation does not need to install a costly database, they can simply manage a collection of SGML documents. The Z39.50 server can index the SGML documents and return HTML or text documents.

Importantly, this solution will also allow any custodian to host a collection of dataset descriptions for another agency. That collection can appear to be totally separate from the hosting agency's collections. Any agency can also build a public interface to allow personal users to contribute and manage metadata descriptions using online document management services.

Each node can seek support funding

Available server software

Any Z39.50 implementation can be used (for example: Isite, Zebra, Blue Angel, InterBasis). In fact, different nodes of the distributed directory can use different implementations. One node could use Isite, another node could use Zebra, while another node could use Blue Angel. Because they are all based on the Z39.50 protocol, they can all interoperate to form a virtual catalogue.

This diversity is a big advantage as some organisations can choose to use a polished commercial product while others can use a more affordable public domain solution. Some organisations also have a need to customise their interfaces, so they can choose an implementation that allows them that flexibility.

Some organisations may choose to develop their own customised server based on a toolkit, such as the YAZ toolkit from Index Data.

The following sites have more information and reviews of Z39.50 servers and gateways:

Accessed by a variety of client software

The directory would be able to be accessed by a wide variety of clients, thus extending its reach. These clients would include:

The following sites have more information and reviews of Z39.50 clients:

International cooperation enables interoperability

Other countries have already built similar distributed directories using Z39.50.

We should cooperate with them as both sides will then benefit. Our support is useful.

Such cooperation is extremely important, as then our directories can interoperate ... GELOS, GILS, FGDC Clearinghouse.

Standardisation processes are required

The following standardisation processes need to be in place in order to build the clearinghouse:

Demonstration services

The following services demonstrate the Z39.50 solution.

Australasia:

Other countries and international:

Other references


URL:http://www.indexgeo.com.au/tech/asdd/ig-proposal.html
Last Modified: 22 December 1997