Stratigraphy.net internals: text mining

Showing posts with label text mining. Show all posts

Monday, August 25, 2008

International Geo Sample Numbers (IGSN) in publications

I came across another interesting article by Rod Page. He reports on his attempt to use regular expression to find Genbank identifiers in full texts. His regular expression worked for Genbank identifiers but surprisingly also matched UTM coordinates ergo gave some false positives.

At the first sight, the identifiers he found looked very similar to those used by the International Geo Sample Numbers (IGSN) project which aimed to resolve ambiguities sample naming.

IGSNs are assigned to samples (on request) by a central registry (SESAR) which cares for unambiguity of these identifiers. By definition, a IGSN is a 9 digit identifier where the first 3 digits stand for the institution responsible for the sample and the remaining 6 digits for the sample itself, for example HRV0002Y4.

So can such mismatches as reported by Rod also occur if I would search for IGSNs in scientific articles?
To test if there are other identifier systems using the same pattern, I found the cool exalead search engine, which allows to use regular expressions for web searches. The regular expression which would match such identifiers is [A-Z]{3}[0-9]{6}.
And indeed, the first match exalead returned leads to data from the Interpro protein database which uses the pattern IPRxxxxxx for it's accession numbers. Good that SESAR has not yet assigned the prefix IPR ;) However, the example shows that theoretically IGSNs can be ambiguous.

Today IGSNs are mostly used in geoscientific sample (core) repositories and there they truly are unambiguous. However, most probably these identifiers will also be used in scientific publications. As molecular biological methods are sometimes used in paleontology and paleoclimatology studies, it is not completely unlikely that such accession numbers are used in publications together with IGSNs. Trouble for geoinformatics text mining applications;)

A simple solution for this dilemma would be to recommend authors to cite IGSN as IGSN:ABC012345, a IGSN: followed by the 9 digit identifier. This is already the way they are displayed on the bar code labels SESAR provides.

Friday, August 15, 2008

Citation parsing

The next version of ageparser will extensively use regular expressions to identify stratigraphic terms. While working on this, I also played with some regular expressions which are useful to identify citations within a scientific document and to parse authors and year of these citations. I assume this is a quite common task for some of you, so maybe you find some of the following expressions useful for your own code:

Pattern for common person names:

$personpat=(([Bb][Ee][Nn]\s|[Dd][Ee]\s[Ll][Aa]\s|P'|[Dd]'|[vV][Aa][Nn]\s|[vV][Oo][Nn]\s|[dD][eE][lL]?\s|[dD][iI]\s)?[ÄÖÜA-Z]{1}[A-ZÄÖÜÒÓÀÉÈóòäöüàéèâa-z-]{1,})

Pattern for authors:

$citpersonpat=$personpat."{1}(,\s".$personpat.",?)?((\sand\s".$personpat.")|((,\s".$personpat.",?)?\set al(\.|[i]{2})))?"

And two patterns for citations:

$citpattern1="[\s\.]{1}(".$citpersonpat.")\s$[0-9]{4}[a-z]?$"
$citpattern2="[\s\.]{1}[$,;]?(".$citpersonpat.")[,;]?\s[0-9]{4}[a-z]?[$,;\.\s]"

Tuesday, July 8, 2008

Geoparser

Today I was scanning the web for tools which are able to scan documents and identify locations or coordinates (which we'll need to reach of our ultimate goal of a 4D (space and time) index and search engine ;) ) and found Rod Page's interesting article: iPhylo: From PDFs to Google Earth.
He offers a online service probably based on some regular expessions?, which is able to extract coordinates from pdf files and returns KML or JSON files. A simple and pragmatic approach. Cool!

I also found some geoparser tools which are able to identify location names in texts. The most interesting is Metacartas geoparser API which seems to give good results. Metacartas internet pages offer some impressive examples on how this API can be used.
Another geoparser is DIGMAP's text mining service which returns some OGC compliant XML file containing all found (not only geographic) features.
And there is MEDINA's geoxwalk which seems to be restricted to the british islands. However, I could not test this tool: the mentioned site only offers a screenshot and some pdf documnents on this tool.

Metacarta's geoparser seems to be the most advanced solution, however it's a service offered by a commercial company and unfortunately their 'terms of use' page returns a 404 error. Most probably they will not offer this service for free.

I wonder if it would be possible to create something similar to agenames which identifies location names and returns coordinate pairs. Maybe based on the geonames gazetteer?

Stratigraphy.net internals

Monday, August 25, 2008

International Geo Sample Numbers (IGSN) in publications

Friday, August 15, 2008

Citation parsing

Tuesday, July 8, 2008

Geoparser

Blog Archive

Agenames News

TaxonConcept News

Geoblogosphere News

Contributors