CODATA / TDWG Task Group on Access to Biological Collection

ABCD Schema - Task Group on
Access to Biological Collection Data

A joint CODATA and TDWG initiative supported by GBIF

4th Workshop, Oeiras, October 2003

During the TDWG Annual Meeting.

Venue: Instituto Gulbenkian de Ciência, Oeiras, Lisboa, Portugal

Draft Notes from the Meeting of Data Definition Subgroup (Oct. 24)

[A report of the meeting of the Protocol Definition Subgroup on the 22nd of October will soon be made available.]

Introduction

[Introduction to the ABCD group and current state of development - Berendsohn.] The TDWG Accessions Subgroup was attempting to provide a data dictionary for data exchange and database design since the early 1990ies. TDWG picked up activities of external groups providing unit-level (specimen, observation) data standards, including those for botanical garden records (ITF) and herbaria (HISPID). These developments were paralleled by the creation of relational information models (ASC 1992, CDEFD 1995, BioCISE 1998, etc., see standards page) serving as reference models for database and standard development. During the TDWG meeting in 2000 (Frankfurt), the need for an XML standard data definition was generally accepted. Early in 2001 the group was assigned Working Group status by CODATA (2002 this was upgraded to Task Group status).

ABCD was devised as a comprehensive XML schema, providing maximum capacity to expose data contained in unit-level collection databases. Apart from the definition of XML tags and structures, it was an explicit aim to define the semantics of all elements, thus providing a basic ontology for the domain. The schema is extendible to support special requirements of specific communities.

ABCD is to provide a common data definition for all biological collections (living, dead, sets of observations, sounds, etc.) and it is to include full support for statements of provider rights, IPR, copyright etc. It allows variable atomisation, i.e. polymorphic content (e.g. names as one string or fully atomised) to support a wide range of provider database structures. In general, it does not take a minimal common denominator approach but supports rich data, thus looking beyond the obvious questions users have regarding information content of biological collections.

The ENHSIN and BioCASE projects drove the process 2001/2002, providing drafts that were discussed during TDWG and other meetings and which were exposed in a Request for Comment process. An editorial meeting sponsored by GBIF led to the version currently used in reference implementations. In 2002 the ABCD Schema was accepted by GBIF, but GBIF was going ahead with DiGIR and Darwin Core (DC) because provider and portal software was available for DiGIR, and that protocol was not able to handle complex schemas such as ABCD. A protocol supporting ABCD was provided by the BioCASE reference implementation in 2003, and in October, GBIF decided to integrate the BioCASE network into the nascent GBIF network. The current version (and version history) of ABCD is availble under http://www.bgbm.org/TDWG/CODATA/Schema/.

Discussion

General issues

The very broad coverage leaves it to the consumer to determine where to look. However, it looks far too complicated for the average user and thus will not normally be exposed. It is up to the programmer to reassemble the different uses of the structure into a presentation layer that supports user requirements. A reference portal implementation is under construction by the Paris’ BioCASE team, the Berlin team has implemented preliminary interfaces as an intermediate measure (see http://www.biocase.org/provider/).

As an effect of polymorphism, the challenge with ABCD from a client application developer’s perspective is that the data can be in different places. BioCASE developers have to do some tricks such as concatenating data to present the data. They need more documentation.

Does ABCD support phylocode and rankless naming? ABCD does not cover taxonomic data (synonyms etc.), it covers names as the result of the identification of a unit (a specimen or the observation of an organism in the field). ABCD does not currently support structured names following the phylocode, they would have to be entered as an informal name in the current schema. If there is a need for a structured response, we can the schema can be extended without affecting the over-all structure.

Internationalisation – currently the entire documentation is in English. If certain parts of the schema (e.g. the full element name in the element annotation part of the schema) are to be used in the interface, it should support other languages as well. However, some participants consider this inclusion of elements of interface in the schema a violation of XML rules. [The XML community doesn’t seem to have a simple answer to this. According to Bob there are strong proponents for the opinion that is OK for annotation content to “escape from the schema”.]

Another aspect of polymorphism is the efficiency for importing data. For example, the Australian Virtual Herbarium interrogates the network to find if a duplicate record is available and, if so, imports it. Polymorphism may be in the way of this. However, it also may help because people are not forced to de-atomize their data; moreover, there is no obstacle for groups of data providers to agree on certain elements to be mandatory for information exchange.

Mandatory/Recommended elements

ABCD currently has only 4 elements that have to be provided in every XML document produced: SourceInstitutionCode, SourceName, UnitID, and SourceLastUpdatedDate. The combination of the first three provides the universally unique identifier of the unit. Note: XML Schema permits a mandatory element with simple content to have empty content, even though the element itself must be present if the document is to be valid against the Schema. However, the discussion focused on “mandatory” in the sense of required content.

Should the name be mandatory? And which one? Scientific names are currently supported on three levels of atomisation: as a single string, as two strings (name and author/year, to align with Darwin Core), and fully atomised. In addition, informal names such as common names are supported as the result of an identification event. None of these are mandatory. ABCD supports the provision of unit data without identification, because researchers may be highly interested in unidentified material from a defined location.

After some discussion of alternative structures, it was agreed that – if a scientific name is provided – the simplest taxon string should be made mandatory, even if more atomised data is present. This element can be concatenated in a view on the level of the provider database und it be used for searches. Common name should not be included except as an anomaly.

This implies to rearrange the identification type so that the NameAuthorYearString becomes mandatory whenever there is a scientific name selected as the result of an identification. The name of this element led to the misconception that it must include all these items. This is not the case and the name should reflect that. At least a comment must clarify that this can be anything from a single genus name (or “Genus sp.” etc.) to a full botanical name (incl. basionym and combination author teams) or a full zoological name (including year of publication).

It must also be made clear if scientific names of a rank higher than genus can be inserted here (e.g. family name). It may be necessary to allow that because it reflects current use in many databases (e.g. the identification of an not further specified invertebrate in a vertebrate collection). The higher taxon elements provided by the schema are used to classify the identification result, when such a classification is present in the source database. They can be used to simplify searches and provide metadata, but perhaps should not represent the actual result of the identification.

But what to do if the provider does not distinguish between common and scientific name and has a mix? This would be solved by the inclusion of all these identification results under informal names – and a strong recommendation to change the database to make it more useful for scientific purposes.

In a collection information system, cases where users want to search common names and scientific names in one go appear to be rare, if this is to be allowed, it can be taken care of at the portal level.

It is not the purpose of the collection schema to provide a place to put all taxonomic data that may be present in a source database. For example, synonymic searches should be using a Names Service to expand the query.

There clearly is a need for more guidance to configuration, recommendations how to map elements, preferred points for searches, what to do if an element is empty, etc. GBIF currently supports a project to produce a generic interface to map between database schemas and federation schemas such as ABCD. An effort should be made to provide a fully commented version or rather a documentation of the semantics of ABCD schema.

Controlled vocabulary

Controlled vocabularies or authority files used as such would greatly enhance the usability of the data for data mining applications etc. Controlled vocabularies should only be used where there is a controlled list of values that does not depend on language or (scientific) opinion. Different communities could agree on different areas of the schema where controlled vocabularies are useful, however, these then cannot be enforced by the XML schema. Requiring the use of controlled vocabularies means requiring databases to adhere to it, which at present would be a major obstacle in the global effort to mobilise content. We should be influencing the community with best practices of how the capture data in the future. Best practices have to be documented, e.g. recommendations should be given to use fully descriptive values instead of abbreviations even if in a foreign language.

A specific question to be addressed is that of taxonomic ranks for higher taxa. In Darwin Core (DC), there are elements for the names of different ranks (but not all ranks are defined). In ABCD, there is a single unbounded element with an attribute rank. To be able to convert ABCD data into DC, a controlled vocabulary should be provided for that attribute in ABCD. However, if we make it a controlled vocabulary, we are proscriptive. The solution decided upon was to provide a controlled vocabulary of rank names but to leave the attribute as optional. That is, taxa can be entered without rank.

Reusability of complex types

Some of the complex types used in ABCD could become part of other standards, thus helping to build a repository of XML data types for biodiversity data. Consensus was that we can agree on complex types that are relatively simple – like datetime, coordinates, etc. Where possible, types defined by other communities should be used. Links to geographic standards are discussed within the Spatial Data Subgroup. GML (Geographic Markup Language) types can be used to extend the present ABCD schema.

What would be the best way to structure such a type library? In GML this seems to be solved by creating separate schemas for the bits that can be reused and then include them in a file.

DateTimeType – we need to add an Explicit field. The value of Yes would indicate that the event occurred over the range of time indicated. The value of No for Explicit would indicate the event occurred sometime during the range of time.

In order to record a seasonal event with no known year, the DateTimeValue would be empty and the Julian Dates would be given.

Identification / Name types – Reusability of these in the Descriptive Data standard and in the forthcoming Names and Concepts standard must be discussed. Where possible, these standards should at least use the same element definitions.

Alternative hierarchies for Collection and Units – The current schema was arbitrarily structured to represent a hierarchy Dataset – Unit – collection site of the unit. Entire communities exist where the data reverses the two latter items (the first DTD provided by Charles Copp in Sydney included a choice between the two hierarchies). Putting field data in some way parallel to the unit data type would support reuse of the unit and the field data type in both places. It must be further discussed if it is possible to rearrange the unit-subtype as to exclude gathering event and location and include a choice that allows to treat unit as multiple item from one gathering site, or one gathering site per unit.

Additions to the standard

It is relatively simple to add individual elements to existing structures. For example, an element for distribution of duplicate specimens needs to be added in the HerbariumUnitType. Furthermore, we are expecting additional domain-specific substructures under UnitCollectionDomain. It is relatively simple to extend current implementations to include such additions. However, changes to the overall structure will most probably not be picked up by the current reference implementation. In any case, a procedure for versioning needs to be implemented for the standard.

Standard process, publication, revisions

ABCD should be moved forward in the direction of a TDWG standard. The approach would be to stamp a version and put it into a state - the draft standard, then a proposed standard, and then the published standard. In the TDWG process, 60 days of review, comment, and debate are required before a vote to move from one state to another.

The issue is organization of the work on it. One approach would be SourceForge and open a new site for it, like TDWG Schemas or something. However, in view of the close collaboration between GBIF and TDWG, appropriate thing to do may be a GBIF sponsored TDWG standards repository. However, CIRCA (the GBIF communications and repository tool) is not considered optimal. The GBIF DADI programme officer indicates that it would be possible to have a CVS repository at the GBIF site. Setting it up is the recommendation from the group.

Can we recommend proposing ABCD as a TDWG standard? It is presently at the draft standard state. As pointed out above, it needs further work done on it and documentation added. It also urgently needs an introduction for people who come new to it. People looking into it in detail need answers. One possibility would be a joint publication in the group to force collaboration.

Implementation issues

GBIF is currently building Java code to interpret DIGIR/DC and BioCASE/ABCD records from providers, particularly to get taxon names into a common form. That software will be freely available. The long-term perspective is to completely encapsulate the entire dataset.

Walter Berendsohn, Nov. 24, 2003.
Thanks to Mickaël Graf, Donald Hobern, Chuck Miller, Markus Döring, and Charles Copp for taking notes during the session. Thanks to Bob Morris for clarifying two XML issues (empty content in mandatory elements and use of content in schema annotation in applications).

Working Group Homepage | TDWG Accessions Subgroup Homepage | CODATA | TDWG

Page hosted by the Department of Biodiversity Informatics and Laboratories of the Botanic Garden and Botanical Museum Berlin-Dahlem. DISCLAIMER
Page editor: Walter Berendsohn (w.berendsohn [at] bgbm.org).

This page last edited on 06.03.2005