-
Neko1.9.11 goes into a loop on some documents e.g.
http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm
http://cizel.co.kr/main.php
reverting to 0.9.4 seems to fix the problem.
2009-02-20 11:32:14 UTC in CyberNeko HTML Parser
-
SinglePhaseTransducer uses SimplesortedSets to sort annotations by offsets.
SimpleSortedSet uses a map internally and puts Lists of Annotations as values. We can expect that in most cases there is only one annotation per offset (think about Tokens - they are the most frequent ones) in which case generating ArrayLists is clearly a waste of time and memory.
The patch attached fixes that by...
2008-08-01 11:00:40 UTC in GATE
-
GATE currently has an internal mechanism for parsing document formats which converts the markup into annotations (at least for XML/HTML documents) and does some detection of MIME types.
The TIKA project (incubator.apache.org/tika/) does exactly that. It also generates some markup for PDF documents and is good at detecting MIME types and encodings. TIKA's API is simple and could be easily...
2008-02-06 10:37:06 UTC in GATE
-
as you corrected yourselves - it exposes a single ANNOTATION as an AnnotationSet.
2007-12-06 12:47:40 UTC in GATE
-
I just found a change I'd made on my local copy of GATE ages ago. I just tested it against the build number 2820 and it seems to work fine.
I've created a class SingletonAnnotationSet which - as its name indicates - exposes a single AS as an AnnotationSet and is immutable.
The original motivation was that I noticed that SinglePhaseTransducer (used by Jape) creates an awful lot of temporary...
2007-12-06 12:45:38 UTC in GATE