Got StringIndexOutOfBoundsException in the tokenizer. The text acrually had some italian in it and it had trouble with the word "il".
The code in question is line 1001:
// check if the previous word starts with a capital letter,
// is at least 3 letters long, is an alphabet sequence,
// and has a comma.
boolean previousIsCity =
(Character.isUpperCase(previous.charAt(0))
&& previous.length() > 2
&& matches(alphabetPattern, previous) && tokenItem
.findFeature("p.punc").equals(","));
In this case previouse is an empty string, which fails the charAt call.
The fix is simply to move the previous.length() > 2 before the Character.isUpperCase call.
I think the text in question was "1512-17; Il Principe"