Biodiversity knowledge network


Benefitting from the expertise of the whole global community

Progress: limited (needs further development)


Researchers in biodiversity have long had a culture of curating and annotating data — from identifying specimens to correcting and cleaning up entire downloaded datasets. These efforts are a key part of the data validation process: even with the best automated tools, identifying and correcting most errors still requires an expert, human eye. Yet these annotations are not always made available to the original data owners, and even when they are, there may be neither the resources nor the mechanisms in place to incorporate them. As a result, mistakes get replicated or have to be repeatedly corrected, duplicating effort, while there is little incentive for researchers to continue to correct and annotate records more widely.

Data aggregators generally encourage users to report mistakes; several GBIF national nodes have developed systems of data curation, including amateur networks to curate citizens’ observations while the EU OpenUp! project includes a data quality toolkit for GBIF data. Some projects are already using expert curation for aggregated data, for example the Encyclopedia of Life and the Fish Barcode of Life Initiative (FISH-BOL). However, too often these use ad hoc systems and require an extra effort on the part of the contributors, especially if they want to make corrections in many different sites, while data providers or publishers may not feel confident in trusting changes submitted over the Internet. The next step will be to agree with individual institutions and projects how data cleanup efforts can be recognized and valued, putting the incentives in place to ensure that annotations are made and fed back into the system. In combination with the fitness-for-use and annotation component – which considers the systems needed to enable annotations to be integrated into the data – this will be the first step towards making distributed data curation the norm.

In the short term, the priority should be developing a shared identity management system for contributors, whether professionals or citizen scientists, so that they can have a common identity and contribution history across platforms — particularly the key data networks and publishers.

In the medium term, key data networks will be able to trace back any changes to the original contributor and over time it will be possible to use metrics to value contributions automatically, based on the contributor’s past history.

In the long term, annotating data will become the norm and the curation of data will come to be considered a shared responsibility among the biodiversity community.