#2 Tokenize before sentence segmentation

open
Engine (5)
5
2013-02-08
2007-03-06
No

Advantages:
(1) can kill giorr stuff completely: giorr-xx.txt moves to the lexicon with \.'s; token-xx.txt rules where a "." => <d>

all giorr-xx.pre stuff and code in giorr function proper
get encoded as token-xx.txt rules that "enclose" non-terminal punc. in longer tokens. What's left will be <X>.</X> or whatever.

(2) localizes all uses of BDCHARS in tokenization, search

Discussion


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks