Tuesday, July 8, 2008

Geoparser

Today I was scanning the web for tools which are able to scan documents and identify locations or coordinates (which we'll need to reach of our ultimate goal of a 4D (space and time) index and search engine ;) ) and found Rod Page's interesting article: iPhylo: From PDFs to Google Earth.
He offers a online service probably based on some regular expessions?, which is able to extract coordinates from pdf files and returns KML or JSON files. A simple and pragmatic approach. Cool!

I also found some geoparser tools which are able to identify location names in texts. The most interesting is Metacartas geoparser API which seems to give good results. Metacartas internet pages offer some impressive examples on how this API can be used.
Another geoparser is DIGMAP's text mining service which returns some OGC compliant XML file containing all found (not only geographic) features.
And there is MEDINA's geoxwalk which seems to be restricted to the british islands. However, I could not test this tool: the mentioned site only offers a screenshot and some pdf documnents on this tool.

Metacarta's geoparser seems to be the most advanced solution, however it's a service offered by a commercial company and unfortunately their 'terms of use' page returns a 404 error. Most probably they will not offer this service for free.

I wonder if it would be possible to create something similar to agenames which identifies location names and returns coordinate pairs. Maybe based on the geonames gazetteer?

2 comments:

Roderic Page said...

Robert,

You are correct, my services uses regular expressions to pull out latitude and longitude pairs from text. It's a nightmare, with all sorts of formats out in the wild, coupled with horrible character encoding issues. But it works reasonably well.

I'm a big fan of the Metacarta service, which I use extensively in my iPhylo project (current demo here). I use it to find localities for specimen and sequence records. It's free, but you need a username and password. I can't remember the terms of use, but like most things I suspect they'll throttle you if you use it too much.

Unknown said...

geoCrosswalk is a service from EDINA, which makes various geo services available to anyone (students, lecturers, researchers etc. ) in UK higher education.