Resource Identification for a Biological Collection Information Service in Europe
Results of the Concerted Action Project

[Contents] [BioCISE HomeThe Survey | Collection catalogue | Software | Standards and Models]

Computerizing and networking biological collection data

Linda Olsvig-Whittaker and Walter G. Berendsohn

Pp. 5-12  in: Berendsohn, W. G. (ed.), Resource Identification for a Biological Collection Information Service in Europe (BioCISE). - Botanic Garden and Botanical Museum Berlin-Dahlem, Dept. of Biodiversity Informatics.

Capture of biological collection data

Over the last two decades, curators in Natural History Museums as well as those responsible for ecological mapping or monitoring projects started databasing their collection and observation information. While databases were first used as a tool for collection management, over the last years awareness grew that this digitised information can be regarded as a most valuable source for biodiversity research. From the data provider's point of view a tendency "from IT for audit and accountability to IT for access" (MDA 1997) became obvious. In view of accelerating environmental changes, loss of biodiversity, and a pressing need for fast decisions in environmental politics, this information base must be utilized to the greatest possible extent. Collection information is of significant importance not only for the scientific community, but also for the whole of society (see Lane 1998). Databasing and facilitating access to collection and observation data has been recognized as one of the priorities of the OECD Megascience Forum GBIF initiative (Edwards 1999).

However, most biological and ecological information is not yet accessible on-line and therefore not fully useful for the community (OECD 1999), and in many cases not even available in electronic form at all. Right now, a variety of programs and initiatives form distributed networks of biological information, and show the degree to which information technology can serve as a tool for mastering the information mountain (Anderson 1998, Scott 1998); for an overview, see BIOSIS (1999).

BioCISE: The Vision:
A common electronic access system facilitating queries across the hundreds of millions of specimens and monitoring or mapping records held by institutions, projects and individual researchers in the EU and partner countries.

Data capture in existing collections

Due to technical advances, the digitisation of even huge collections and printed survey data becomes feasible in principle, while the overall cost of digitisation is relatively small compared to the costs of gathering and maintenance of specimens. The collections-community generally agrees in principle that computerisation can and should be achieved. However, laboratories often presently lack the resources even to digitise the data for newly acquired units, let alone updating their databases for older parts of their collections.

Additional difficulties arising in the digitisation of older stock relate to aspects of data quality. Historical collections may carry very little accurate geospatial information, while survey data records are rarely vouchered, thus not allowing verification of taxonomic identifications. Standard geospatial information for historical (and even some recent) collections is necessary to analyse these important information resources with modern methods. However, vague locations and casual observations often characterize older collection data, whereas records resulting from modern official surveys or comprehensive scientific research programs usually provide clear polygon or point-based data. The conversion is error-prone for various reasons. Location names may no longer exist, have changed their circumscription, or have been applied to more than one area in a certain region; national map grids have been used that do not correspond with the general system of latitudes and longitudes, or geographical information is very general or even lacking.

Generally, raw data from historic and many recent collections and surveys may be of low accuracy and thus cannot be used directly for analysis. Evaluating and - to some extent - standardizing the data greatly enhances their usability, but it must be done with experts and may require further checks and research, especially with respect to historic units. It is obviously not feasible to check each record individually; hence techniques must be developed to automatically identify "suspect" record sets (e.g. outliers) for expert re-examination (Chapman 1992). For natural history specimens, the original information should be made accessible together with the interpreted data to allow re-examination of labels, for example. Digital pictures of specimens, labels, collector's field books, accession ledgers, etc. provide a new possibility to achieve this, with the advantageous possibility of de-coupling the capture of quality data from the location and handling of the specimen itself (Berendsohn 1999b).

Collection Information System
Biological collection information systems provide information on the single "unit", i.e. the individual specimen or observation. Not directly part of such a system is synthesised information such as species-related information, e.g. information on migratory species, resistance to heavy metals, etc. The main data areas of a biological collection information system thus consist of:
  • Gathering event (who collected, when, in what context)
  • Gathering site (location, geographical and ecological features)
  • Determination (taxon, who did it)
  • Unit information (age, stage, sex, description, states as nomenclatural type, etc.)
  • Unit (specimen) location (laboratory, duplicate distribution)
  • Unit (mostly specimen) management (catalogue and accession number, preservation, storage, amount of material)

In user workshops and interviews, opinions differed widely on whether public access should be given to raw data. On one hand, raw data may be misleading and may even damage the reputation of the data provider; on the other hand, the decision of which data to use should be left to the users. Raw and validated data should be marked to be clearly distinguishable. Additional information on performed data validation becomes crucial, especially where datasets from a number of different sources are combined in a common access system. However, in many cases certain items in the raw data (such as the name of the collector and / or identifier) may help the informed user to assess data quality, without further effort on the provider's side. In survey records, the possibility to make raw data such as pictures or sound records available to the user will allow for the checking of taxonomical identifications, in some ways analogous (though not equal) to the citation of specimens.

Ease of capture vs. data quality

Progress in data capture was greatly impeded by a lack of adequate software. The complexity of collection information (see Chapter III) was often underestimated, leading to much duplication of efforts and inadequate software solutions which are often being maintained by research scientists and curators. Those not involved in the development of the software often regarded databasing as an additional complication of curator's task, sometimes even as an impediment for research, competing for scarce resources. An often-heard argument against some of the advanced systems, which are only now becoming available, is also that they are "too complicated". There is some truth in this point of view: a database application will always be less straightforward to handle than a word processor, since the information must be structured and data quality is usually at least partly checked at the point of entry. However, this is also the strength of data capture in a database: information quality is usually higher, and the information can be linked to other sources. The more atomised the data areas are and the more "fields" a user has had to fill in, the easier is it to pursue aims like standardization, error tracking, and linking. Fortunately, the power of today's desktop computers increasingly allows implementing interactive functions to ease the task of user input. Locality data capture from on-screen maps is an example. Input of scientific names etc. as text in a single "field", which is parsed into its atomic data element, processed in the background and (if necessary) corrected by feedback mechanisms is another.

At least for new collections, the problem of geographical locality input is increasingly eased because of the availability of accurate co-ordinate data captured in the field using Global Positioning Systems. Where co-ordinates are not available, on-line gazetteers are evolving into valuable aids for data input (see Berendsohn 1999c). The input of taxonomic data (names) also becomes less error prone due to the availability of catalogues on the WWW (e.g. The Plant Names Project 1999, Farr & Zijlstra 1999, IOPI 1999).

Promises of networking

When talking about collection information systems, we start by thinking of an individual database of a floristic mapping project or a museum's collection. Even within a single database, applying the right tools to the data collection can give new insights. Simple analyses can provide geographical distribution patterns of species and population dynamics from original data. Sometimes biological data can be combined with records of environmental parameters, climate data, and taxon information for more complex statistics. Very often, though, databases are restricted to a special field of application. Mapping project databases seldom include specimen information but usually just observation records, museums collections are typically restricted to physical object data.

With queries being aimed across wider organism groups, time, and geo-ecological criteria, answers are sought not only combining information from different fields within one database, but also integrating contents from a number of distinct databases in an information system. Interoperability of different collection information systems, and possibly a common search interface, are the natural extensions of combined queries.

Drawing together information from different sources is well worthwhile: A common access to information now stored in dispersed, autonomous, and heterogeneous databases could considerably enlarge the information content, not only by expanding accessible records, but adding value by giving new insights. Some examples shall demonstrate this:

Tracing of toxic agents. Decreases in some bird populations, caused by a thinning of eggshells and first observed in the field, could be traced back to the wide spread use of DDT by analyses on natural science collections of eggs (Duckworth et al. 1993). Similarly, dynamics in the concentration of heavy metals in the environment can be inferred from analyses of hair samples taken from museum objects. Given sufficient material to establish a time series, such data can be used to reconstruct the history and causes of species decline.

Biodiversity assessments in developing countries. Natural history collections in Europe are holding huge amounts of data on the biodiversity of developing countries in the form of specimens and associated data (labels etc.). In these countries, which often count with high alpha and beta diversity, the combination of such data with local information and expertise can be put to work for a variety of tasks, ranging from defining priority areas for conservation measures to research planning (Soberón & al. 1996).

Analysis of environmental change. Since higher plants adapt to different atmospheric concentrations of CO2 by adjusting the density of stomata on their leaves, this feature may be exploited in the reconstruction of changes in the atmospheric composition. Some caution applied in the interpretation, taking into account other possible influences, data and material collected over a time period at a given location allow conclusions on the development of the CO2 concentration (Sieders 1998, Wagner 1998).

Another example of environmental parameter reconstruction comes from the aquatic sector: Analysing the community composition of diatoms, a group of unicellular, shell bearing algae, is a well established tool in the inference of a number of hydrochemical and hydrophysical properties of water bodies, including acidity and nutrient availability. Due to the highly solution resistant silica shells, a long timeline can be assembled from recent water samples combined with analyses of (sub-)fossil communities in sediment cores (Battarbee 1981, van Dam & Beljaars 1984, van Dam 1996, Juggins et al. 1996). Samples to be included into such an investigation can also be gathered off aquatic macrophytes in historical or recent past herbarium collections (van Dam & Mertens 1993, ter Braak & van Dam 1989).

Chapman (1992) points out that a visualisation of environmental parameters (e.g. climatic profiles) may help to indicate apparently suitable locations for a species. Though this approach alone does not allow accurate predictions on the distribution of species, in a combination with actual observation records it could be used to focus survey efforts on areas revealing a higher probability for finding additional populations or to confirm predicted distribution boundaries. Information combined from several databases might form the basis to model biological interactions as, e.g., how changes in bird distribution are likely to affect their role in pollination or seed dispersal, and what the consequences will be for the vegetation. This leads up to the use of a computerised information system in biodiversity research: "The more detailed knowledge is available, the more we can begin to ask complex questions" (Blackmore 1998).

Problems arising from combining information from different sources

With drawing together datasets from different databases, some problems arise apart from the purely technical difficulties in interconnecting highly heterogeneous sources. We are facing four major complexes:

Heterogeneous data definitions

How can datasets be combined which use a variety of different reference systems in geography (point data, several grid nets, polygon data,..), taxonomy (classification schemes and taxon names highly varying with time of identification and opinion), to name but the two most important? These aspects will have to be solved in a general approach while setting up a collection information service: Different reference systems in collection databases will have to be taken as they are, and cross-referenced in a background structure (thesaurus and other representations and mapping of differing concepts, see Chapter XI).

Data quality

How can data quality be measured and indicated? This is a multi-facetted and multi-layered problem. Ideally, every data item should be traceable as to its source, changes made, and as to the people who handled these processes. However, following this through would probably overburden any system, let alone a large network of systems. For catalogues of collections providing commercial materials strict standardization of procedures may lead to far reaching quality control and ensuing reliability (see Chapter VI). However, for most non-commercial collections, assembled in the process of scientific research in a variety of fields and numerous subjects, this type of quality control would certainly not be achievable. Moreover, such measures cannot be applied to historical materials, which are undoubtedly a most valuable resource of natural history collections. Traditionally, personal data such as the collector's and the identifier's name are often the expert's criterion to assess data reliability. However, large-scale publication of personal data, from which, e.g., itineraries can be reconstructed, may not be liked by all persons in question, and may even be illegal. A kind of peer review process of collection data, i.e. having the data scrutinized before publishing it is also not feasible, because the amount of expert personnel resources this would take are completely out of question. However, the current mechanism of annotating specimens in natural history collections could be extended to annotate data on the network, thus providing additional information to the user.
In the short term, we need to use mechanisms to describe entire datasets as accurately as possible, so to hand on information to the user from which they may be able to judge. Very useful first steps in this direction are standardized content descriptions of "documents" (in this case: database) content. A set of such attributes has been published as the "Dublin Core" (Anon. 1998), which has been achieved by broad international consensus. Further standardization of the structure of individual databases participating in the network is also significant for data quality, because an exact definition of the information content of individual attributes is facilitated. Defining a set of attributes as a common denominator for access to data on individual specimens in natural history collections is a question that has been tackled by the ZBIG project in the United States (Vieglas 1999) and, more recently, by the ENHSIN project in Europe. (The European Natural History Specimen Information Network is an EU financed project where major European collection holders united to solve problems related to common data access. ENHSIN goes back to an initiative by CETAF - the Consortium of European Large Scale Taxonomic Facilities - and BioCISE.) These efforts will hopefully ensue in a common convention, probably under the auspices of the IUBS Commission for Taxonomic Databases (TDWG).

Sensitive data

Advantages of access to a high number of detailed datasets notwithstanding, there are also some sensitive data to which access restrictions have to be closely observed. This includes the personal data of collectors, identifiers and other people associated with collection and evaluation, but also information on, e.g., endangered species or research in progress. Again, this touches aspects of general policy decisions to be made in the installation of an information service (who shall be given access to what data?), and autonomous decisions of the data owners, to hold back certain information (datasets or parts of them) from publication.

Intellectual property rights

Above all with survey data, one reservation is often stated more or less openly: If we make our observation data publicly available, there is nothing left - our data are all we have. Apart from the aspect that most field surveys are publicly funded, a contribution to a common knowledge base could be far outweighed by the increase in information to be derived from it, providing dense time-series through the constant accumulation of new data - an activity which will still be an essential of environmental research. Protection of the (often considerable) investment information providers made in their database was discussed along several lines during the BioCISE workshops: Outright sale of the data to a national agency is one possibility, a "Join the Club" approach could be another, with all members profiting from the access to the others' data. The idea of federated databases, i.e. several databases continuing to be held and maintained by the provider, but accessible in a common system, is another option.
In any case, a common access system must adequately recognize the respective origin of all data, and make it compelling for any user to correctly cite the sources he exploited. Collections represent an enormous knowledge base on global biodiversity. This knowledge has been produced by human beings, who work for or in research institutes or companies, or by motivated individuals. In natural history collections, the materials and data gathered stem from lands belonging to people under a certain local and national jurisdiction. The scope of problems that may relate to collections being published on networks is enormous. For example, how can benefits derived from using the system be properly shared with the countries of origin (Biodiversity Convention)? How can authors of individual data items, such as the person identifying a certain specimen, be properly credited for their work? How can collection institutions ensure their stake in the information made available?

These are largely unresolved questions, paralleled in many other fields where knowledge is being made accessible over the network. The ENHSIN (2000) project contains a work package dealing with these subjects. Finding pragmatic solutions is a priority, because IPR problems could considerably impede progress towards a common information service.


© BioCISE Secretariat. Email: biocise@, FAX: +49 (30) 841729-55
Address: Botanischer Garten und Botanisches Museum Berlin-Dahlem (BGBM), Freie Universität Berlin, Königin-Luise-Str. 6-8, D-14195 Berlin, Germany