Wednesday, January 30, 2008

TaxonConcept's Taxon Concepts

Currently, TaxonConcept's data exchange capabilities are quite limited. This is mainly because we have not been able to determine a appropriate XML format which would allow us to represent the majority of TaxonConcepts information categories such as, concepts, descriptions, links to image objects, references etc.
Commonly used formats such as DarwinCore and ABCD are mainly designed to represent metadata of collection objects or species observation data and therefore not suited for our purposes. But at least TC's most important information pieces the taxonomic concepts and references can be represented fairly good by the TDWG standard TCS (TaxonConcept transfer schema).
According to this TDWG standard, a Taxon Concept is a name plus a description of a taxon, a definition which fits perfectly to what we do:
The taxonomic concepts stored in TC are entries of published synonymy lists. These basically represent definitions of individual taxonomic opinions by including or excluding other author's descriptions. In other words, a synonymy list is a Taxon Concept, which includes other Taxon Concepts.
The most difficult part when we start to translate our data to TCS will be the translation of 'Open Nomenclature' to TCS. Fortunately, TCS allows to represent our synonymy list entries as concept relationships, pro parte relationships (p) as well as misapplied names (non). which should enable us to create sufficient XML representations of our data.

Monday, January 21, 2008

How TaxonRank works

Fossilized organisms are important tools in the study of stratigraphy, past climates and ecologies. However, taxonomic classifications of organism, and thereby their names, change frequently and finding correct synonymies for a given species is a considerable problem for non-taxonomists.
Computer-based knowledge management systems can help to make the existing wealth of taxonomic knowledge accessible and easier to interpret. TaxonRank is a ranking algorithm based on bibliometric analysis and Internet page ranking technologies. TaxonRank uses published synonymy list data stored in TaxonConcept, Snet's taxonomic information system.

Synonymy lists contain valuable taxonomic information. Since decades the Open Nomenclature notation allows to express unclear classifications and to comment on other author’s identifications of a specimen. These lists contain occurrences of specimen in the literature matching the author's concept of a specific taxon and reflect the taxonomic opinion or concept of the list's author on a specific taxon. Their highly formalized nature makes them ideally suited for information systems which allow to analyze and describe relations between taxonomic concepts. Since synonymy lists contain references to other synonymy lists or taxonomic descriptions respectively, they represent a typical ontology.

The idea behind TaxonRank is that some authors species identifications might have a stronger impact on a 'common taxon concept' than others. This can be the result of many factors, e.g. the quality of species illustrations, the reputation of the author or the availability of a publication. In analogy to PageRank we state that the rank of a synonymy list is determined by the rank of the synonymy cited in a particular synonymy list.
The PageRank algorithm is based on concepts and topology of the world wide web and therefore we first need to define ’pages’ and ’links’ between these pages.

To apply the PageRank to synonymy lists, we define a synonymy list Si for a taxon t published by author i as an analogon of a Internet page containing an arbitrary number of pairs of synonymous names syn and the cited publication doc listed by author i as l{syn, doc}. We further define such pairs as synonymy list entries.
The order of such a synonymy list entries o({syn, doc}, Si) is in turn defined by the publication year of the document containing the synonymy list.
A link within a synonymy list from Si to Sj is present when the synonymy list entry l{syn, doc} exists in Si and in Sj and o({syn, doc}, Sj) > o({syn, doc}, Si) and syn is element of P, i.e. a synonym name, publication pair has been previously used by the author of an older publication.

The set of all synonymy lists Sj of P is LSi and the number of links from Sj is Nj . Further, any pair l{syn, doc} is defined as a synonym list having itself as only synonym list entry. The distance distj between taxon tj and taxon ti is the distance of nodes within the ontological graph network P and determines the strength strength(Si, Sj) = 1/distj of a link l leading from Si
to Sj .
The SynonymyRank SR of a specific synonymy list Si is defined analogous to the PageRank algorithm and calculated recursively using equation (1):
(1)

The rank of a synonymy list Si is thus defined as the sum of the ranks of all synonymy lists pointing to list Si, divided by the number of all links on Sj .

To calculate the rank of a specific taxon within a synonymy list we included a pre-ranking derived from the Open Nomenclature notations used by both the synonymy list author as well as the citation author. In our ranking experiment we regard certain open nomenclature tags as indicators of confidence with respect to a species identification and assign a scalar value to each tag.
This scalar value is used as confidence factor for each species determination of a synonymy list and represents a measure of the taxonomic expert knowledge.
The rank of a taxon occurrence tik in a synonymy list is calculated as the product of this confidence factor and the synonymy list rank SR.
(2)

We can then calculate the rank of a taxon within a synonymy list TR(ti, Sj) using equation (2) as the mean of all instances of a taxon under consideration of the confidence factor in (e.g 0.5 for open nomenclature tags cf, p, sp. etc..).
In a first approach to determine the rank for a taxon as an element of P we can calculate the total taxon rank as the sum of all TR(ti) of any synonymy list SR.

As an example, we calculated TaxonRank on Subbotina triangularis. The sizes of the circles reflect the ranks of the synonym candidate taxa, All highly ranked taxa plot in the cluster around the target species, which indicates that TaxonRank correctly identifies the most important taxa. Testing the quality of TaxonRank is simplified by the fact that the most recent literature has little influence on the rank of a specific taxon. And in fact we find all taxa, for which a rank > 30 was calculated, in the synonymy list of the paleocene foraminifer working group (Olsson, 1999).

Thursday, January 17, 2008

Graph layout experiments with Jsviz


Yesterday I found jsviz, a javascript based tree and graph layout classes at http://www.jsviz.org .
While I am quite happy to have the prefuse Java plug-in to display the complex relationships between taxa I always wished to have something php or js based because the prefuse applet is loading quite slow.
The first experiments with jsviz are quite promising! I begun to simply modify the examples from the jsviz blog pages and fed them with XML files generated by TaxonConcept. The results are very impressing, for example the snowflake graph for Archeoglobigerina cretacea just looks beautiful: http://taxonconcept.stratigraphy.net/taxon_jsviz.php?taxid=1861