I came across another interesting article by Rod Page. He reports on his attempt to use regular expression to find Genbank identifiers in full texts. His regular expression worked for Genbank identifiers but surprisingly also matched UTM coordinates ergo gave some false positives.
At the first sight, the identifiers he found looked very similar to those used by the International Geo Sample Numbers (IGSN) project which aimed to resolve ambiguities sample naming.
IGSNs are assigned to samples (on request) by a central registry (SESAR) which cares for unambiguity of these identifiers. By definition, a IGSN is a 9 digit identifier where the first 3 digits stand for the institution responsible for the sample and the remaining 6 digits for the sample itself, for example HRV0002Y4.
So can such mismatches as reported by Rod also occur if I would search for IGSNs in scientific articles?
To test if there are other identifier systems using the same pattern, I found the cool exalead search engine, which allows to use regular expressions for web searches. The regular expression which would match such identifiers is [A-Z]{3}[0-9]{6}.
And indeed, the first match exalead returned leads to data from the Interpro protein database which uses the pattern IPRxxxxxx for it's accession numbers. Good that SESAR has not yet assigned the prefix IPR ;) However, the example shows that theoretically IGSNs can be ambiguous.
Today IGSNs are mostly used in geoscientific sample (core) repositories and there they truly are unambiguous. However, most probably these identifiers will also be used in scientific publications. As molecular biological methods are sometimes used in paleontology and paleoclimatology studies, it is not completely unlikely that such accession numbers are used in publications together with IGSNs. Trouble for geoinformatics text mining applications;)
A simple solution for this dilemma would be to recommend authors to cite IGSN as IGSN:ABC012345, a IGSN: followed by the 9 digit identifier. This is already the way they are displayed on the bar code labels SESAR provides.
Monday, August 25, 2008
International Geo Sample Numbers (IGSN) in publications
Posted by Robert Huber at 25.8.08 0 comments
Labels: Genbank, IGSN, regular expression, SESAR, text mining, UTM
Sunday, August 24, 2008
Biodiversity informatics session at EGU 2009
I just discovered the provisional programme for the EGU 2009 ESSI (Earth and Space Science Informatics) sessions. The session topics really sound interesting... and surprise... : for the first time there will be a biodiversity informatics session.
Seems as if things will grow together and, hey, is this the beginning of a biogeoinformatics community? ;)
Posted by Robert Huber at 24.8.08 0 comments
Labels: geoinformatics
Friday, August 22, 2008
Disinforming Google Street View
After the first Google cars appeared here in my home town Bremen, Germany, many people have been concerned about Google's Street View activities. But apparently the legal situation does not allow hindering Google to make pictures of every corner of the city.
I personally was very amused to see the Google car in my street just before I finished painting our house;)
So what can you do to protect your privacy at least a bit when Google comes? By camouflage and disinformation;)
Posted by Robert Huber at 22.8.08 0 comments
Friday, August 15, 2008
Citation parsing
The next version of ageparser will extensively use regular expressions to identify stratigraphic terms. While working on this, I also played with some regular expressions which are useful to identify citations within a scientific document and to parse authors and year of these citations. I assume this is a quite common task for some of you, so maybe you find some of the following expressions useful for your own code:
Pattern for common person names:
$personpat=(([Bb][Ee][Nn]\s|[Dd][Ee]\s[Ll][Aa]\s|P'|[Dd]'|[vV][Aa][Nn]\s|[vV][Oo][Nn]\s|[dD][eE][lL]?\s|[dD][iI]\s)?[ÄÖÜA-Z]{1}[A-ZÄÖÜÒÓÀÉÈóòäöüàéèâa-z-]{1,})
Pattern for authors:
$citpersonpat=$personpat."{1}(,\s".$personpat.",?)?((\sand\s".$personpat.")|((,\s".$personpat.",?)?\set al(\.|[i]{2})))?"
And two patterns for citations:
$citpattern1="[\s\.]{1}(".$citpersonpat.")\s\([0-9]{4}[a-z]?\)"
$citpattern2="[\s\.]{1}[\(,;]?(".$citpersonpat.")[,;]?\s[0-9]{4}[a-z]?[\),;\.\s]"
Posted by Robert Huber at 15.8.08 0 comments
Labels: agenames, ageparser, text mining