The next version of ageparser will extensively use regular expressions to identify stratigraphic terms. While working on this, I also played with some regular expressions which are useful to identify citations within a scientific document and to parse authors and year of these citations. I assume this is a quite common task for some of you, so maybe you find some of the following expressions useful for your own code:
Pattern for common person names:
$personpat=(([Bb][Ee][Nn]\s|[Dd][Ee]\s[Ll][Aa]\s|P'|[Dd]'|[vV][Aa][Nn]\s|[vV][Oo][Nn]\s|[dD][eE][lL]?\s|[dD][iI]\s)?[ÄÖÜA-Z]{1}[A-ZÄÖÜÒÓÀÉÈóòäöüàéèâa-z-]{1,})
Pattern for authors:
$citpersonpat=$personpat."{1}(,\s".$personpat.",?)?((\sand\s".$personpat.")|((,\s".$personpat.",?)?\set al(\.|[i]{2})))?"
And two patterns for citations:
$citpattern1="[\s\.]{1}(".$citpersonpat.")\s\([0-9]{4}[a-z]?\)"
$citpattern2="[\s\.]{1}[\(,;]?(".$citpersonpat.")[,;]?\s[0-9]{4}[a-z]?[\),;\.\s]"
Friday, August 15, 2008
Citation parsing
Posted by Robert Huber at 15.8.08
Labels: agenames, ageparser, text mining
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment