Monday, January 21, 2008

How TaxonRank works

Fossilized organisms are important tools in the study of stratigraphy, past climates and ecologies. However, taxonomic classifications of organism, and thereby their names, change frequently and finding correct synonymies for a given species is a considerable problem for non-taxonomists.
Computer-based knowledge management systems can help to make the existing wealth of taxonomic knowledge accessible and easier to interpret. TaxonRank is a ranking algorithm based on bibliometric analysis and Internet page ranking technologies. TaxonRank uses published synonymy list data stored in TaxonConcept, Snet's taxonomic information system.

Synonymy lists contain valuable taxonomic information. Since decades the Open Nomenclature notation allows to express unclear classifications and to comment on other author’s identifications of a specimen. These lists contain occurrences of specimen in the literature matching the author's concept of a specific taxon and reflect the taxonomic opinion or concept of the list's author on a specific taxon. Their highly formalized nature makes them ideally suited for information systems which allow to analyze and describe relations between taxonomic concepts. Since synonymy lists contain references to other synonymy lists or taxonomic descriptions respectively, they represent a typical ontology.

The idea behind TaxonRank is that some authors species identifications might have a stronger impact on a 'common taxon concept' than others. This can be the result of many factors, e.g. the quality of species illustrations, the reputation of the author or the availability of a publication. In analogy to PageRank we state that the rank of a synonymy list is determined by the rank of the synonymy cited in a particular synonymy list.
The PageRank algorithm is based on concepts and topology of the world wide web and therefore we first need to define ’pages’ and ’links’ between these pages.

To apply the PageRank to synonymy lists, we define a synonymy list Si for a taxon t published by author i as an analogon of a Internet page containing an arbitrary number of pairs of synonymous names syn and the cited publication doc listed by author i as l{syn, doc}. We further define such pairs as synonymy list entries.
The order of such a synonymy list entries o({syn, doc}, Si) is in turn defined by the publication year of the document containing the synonymy list.
A link within a synonymy list from Si to Sj is present when the synonymy list entry l{syn, doc} exists in Si and in Sj and o({syn, doc}, Sj) > o({syn, doc}, Si) and syn is element of P, i.e. a synonym name, publication pair has been previously used by the author of an older publication.

The set of all synonymy lists Sj of P is LSi and the number of links from Sj is Nj . Further, any pair l{syn, doc} is defined as a synonym list having itself as only synonym list entry. The distance distj between taxon tj and taxon ti is the distance of nodes within the ontological graph network P and determines the strength strength(Si, Sj) = 1/distj of a link l leading from Si
to Sj .
The SynonymyRank SR of a specific synonymy list Si is defined analogous to the PageRank algorithm and calculated recursively using equation (1):
(1)

The rank of a synonymy list Si is thus defined as the sum of the ranks of all synonymy lists pointing to list Si, divided by the number of all links on Sj .

To calculate the rank of a specific taxon within a synonymy list we included a pre-ranking derived from the Open Nomenclature notations used by both the synonymy list author as well as the citation author. In our ranking experiment we regard certain open nomenclature tags as indicators of confidence with respect to a species identification and assign a scalar value to each tag.
This scalar value is used as confidence factor for each species determination of a synonymy list and represents a measure of the taxonomic expert knowledge.
The rank of a taxon occurrence tik in a synonymy list is calculated as the product of this confidence factor and the synonymy list rank SR.
(2)

We can then calculate the rank of a taxon within a synonymy list TR(ti, Sj) using equation (2) as the mean of all instances of a taxon under consideration of the confidence factor in (e.g 0.5 for open nomenclature tags cf, p, sp. etc..).
In a first approach to determine the rank for a taxon as an element of P we can calculate the total taxon rank as the sum of all TR(ti) of any synonymy list SR.

As an example, we calculated TaxonRank on Subbotina triangularis. The sizes of the circles reflect the ranks of the synonym candidate taxa, All highly ranked taxa plot in the cluster around the target species, which indicates that TaxonRank correctly identifies the most important taxa. Testing the quality of TaxonRank is simplified by the fact that the most recent literature has little influence on the rank of a specific taxon. And in fact we find all taxa, for which a rank > 30 was calculated, in the synonymy list of the paleocene foraminifer working group (Olsson, 1999).

2 comments:

Anonymous said...

Very interesting!

Jens Klump said...

A preprint of this paper can be found at http://edoc.gfz-potsdam.de/gfz/display.epl?mode=doc&id=13007