TDWG Subgroup on Accession Data


Gregor Hagedorn on HISPID and the Accessions Standard
(in reply to convenor's report 1 and K. Savov)

HISPID and flat-file vs. relational

My most general critique of HISPID is that although data groups are distinguished, these do not relate to any meaningful, independent object in the real world (entities or superentities), but are mere classifications of attributes.

As HISPID says, "It was agreed that the interchange format would be a flat file". The contention that this is so because it would require minimum programming for import and export should be contented. In my experience, it is next to impossible to import flat file data into a relational database, without extensive revision of the data. Flat file is NOT the lowest common denominator It only works well for flat file systems.

Obviously, this is true the other way round as well. I therefore do not propose that data should be exchanged in a fully normalized structure. Yet those parts which relate to real world objects, clearly distinguishable and understandable outside of the data modeling or implementation context, NEED to be specified separately in a data exchange format. They will be separate in any system, regardless whether a partly relational, partly hierarchical, or a fully relational database model is choosen.

In my experience, since the further separation of items (relational normalization) is of a technical, implementation oriented nature, THIS process can either be automatisized (where natural primary keys exist, denormalized can be data combined into single records) or ignored without too much concern (i.e. data are duplicated, although they should be combined).

Which of these two options (in regard to a full model like CDEFD) applies should be specified. Compatible data models with the same normalization structure would be able to exchange loss-less data by adding the internal primary keys to the exchange data, thus allowing the recombination of duplicate information.

Complexity

The data structure of any interchange format will be very complex. It is therefore most important to define broad data areas, which can be defined without reference to an explicit data model. These data areas should be separated by well described interfaces, allowing to consider anything beyond the interface as a black box. This black box needs to have an intuitive meaning.

Ideally the interface should consist of a single attribute (object identifier sensu Berendsohn), being the primary key of the data object in question. Such a key can always be a single number, a "pointer" to other information.

In the context of a global data exchange, the issue arises that these identifiers must be unique across all relevant databases systems. They database systems must have an a priori way of determining, whether a key is suitable or not. It should not be necessary to do any arbritration e.g. over internet or in some other, on-line way. Obviously, it is possible to administrate blocks of numbers, and in the case of quadruple (8 byte cardinals) the chance of even random numbers to collide is minimal.

Yet artificial numeric keys do not satisfy two other requirements which I would like to see met.

1. Data relating to the same real world object, e.g. a publication should be recognized as such. Using artificial keys, any occurrence of the same publication will be treated as a separate record, duplicates will arise. If identical information is conveyed, the key should be identical.

2. For the user interface, a human-legible version needs to be presented. Although longer strings take up more space than compact numbers, such strings can be immediately understandable. The waste of storage should not be an item (usually it will be 8 byte per record versus aprox. 20-40 byte average). In any event it could be an implementation dependent question.

Thus the issue of how these interface pointers should look like should be discussed in Toronto. I propose the use of calculated strings, which could default to the string representation of a artificial key (I use a coded data/time + 5 digit short random number, which I consider preferable to only a random number), and which could be replaced by a natural primary key whereever meaningful. If the rules for the calculation of this key are defined, duplicates across all databases using the same rules would be recognized.

Verification of implementations

As correctly proposed by K. Sarov it is necessary to propose ways to confirm that a given implementation confirms with the data model endorsed by TDWG.

I propose the use of Example data sets and example transactions. These should NOT come from database developers, but from applied scientists, intending to use the database system. They should develop model data and model transactions, which are complex (not necessarily the rare bird, but complexity encountered regularly) and are vital to their work.

I do not think it is possible to verify an implementation of such complex data models completely. Using the proposed way, the verification would ensure that the core parts, the vital functionality is available in the implementation.


To index page. Contact: Walter G. Berendsohn, subgroup convener, wgb@zedat.fu-berlin.de. This page last updated Oct. 7, 1996