Monday, August 25, 2008

International Geo Sample Numbers (IGSN) in publications


I came across another interesting article by Rod Page. He reports on his attempt to use regular expression to find Genbank identifiers in full texts. His regular expression worked for Genbank identifiers but surprisingly also matched UTM coordinates ergo gave some false positives.

At the first sight, the identifiers he found looked very similar to those used by the International Geo Sample Numbers (IGSN) project which aimed to resolve ambiguities sample naming.

IGSNs are assigned to samples (on request) by a central registry (SESAR) which cares for unambiguity of these identifiers. By definition, a IGSN is a 9 digit identifier where the first 3 digits stand for the institution responsible for the sample and the remaining 6 digits for the sample itself, for example HRV0002Y4.

So can such mismatches as reported by Rod also occur if I would search for IGSNs in scientific articles?
To test if there are other identifier systems using the same pattern, I found the cool exalead search engine, which allows to use regular expressions for web searches. The regular expression which would match such identifiers is [A-Z]{3}[0-9]{6}.
And indeed, the first match exalead returned leads to data from the Interpro protein database which uses the pattern IPRxxxxxx for it's accession numbers. Good that SESAR has not yet assigned the prefix IPR ;) However, the example shows that theoretically IGSNs can be ambiguous.

Today IGSNs are mostly used in geoscientific sample (core) repositories and there they truly are unambiguous. However, most probably these identifiers will also be used in scientific publications. As molecular biological methods are sometimes used in paleontology and paleoclimatology studies, it is not completely unlikely that such accession numbers are used in publications together with IGSNs. Trouble for geoinformatics text mining applications;)

A simple solution for this dilemma would be to recommend authors to cite IGSN as IGSN:ABC012345, a IGSN: followed by the 9 digit identifier. This is already the way they are displayed on the bar code labels SESAR provides.

No comments: