Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#2 Tokenize before sentence segmentation

open
Engine (5)
5
2013-02-08
2007-03-06
Kevin Scannell
No

Advantages:
(1) can kill giorr stuff completely: giorr-xx.txt moves to the lexicon with \.'s; token-xx.txt rules where a "." => <d>

all giorr-xx.pre stuff and code in giorr function proper
get encoded as token-xx.txt rules that "enclose" non-terminal punc. in longer tokens. What's left will be <X>.</X> or whatever.

(2) localizes all uses of BDCHARS in tokenization, search

Discussion