Motivation: different venues require character off-set information (e.g. passage retrieval) Java's BufferedReader will dismiss '\r' characters when calling readLine, therefore all our character offsets are wrong.
This is how to fix it:
(From communication with Laurent Mertens:)
You can read in files using something like the following bit of code:
StringBuilder sbFile = new StringBuilder(); { BufferedReader bufferedReader = new BufferedReader(new FileInputStream(file), "UTF-8"); // char buffer char[] chars = new char[4096]; int n; while((n = bufferedReader.read(chars)) != -1) { // this preserves the original new lines! sbFile.append(new String(java.util.Arrays.copyOfRange(chars, 0, n))); } bufferedReader.close(); }
This should fix the problem. So yes, use a BufferedReader, but no, don't use the "readline()" method.
He also had some implementation for double checking the encoding (UTF8 versus ISO), "GetSafeBufferedReader" which could go in place of "new BufferedReader"
That's just a class of my own (well, actually a colleague) :) It throws an error
when trying to read a file encoded in X, with a bufferedreader initialized to
work with encoding Y...
public static BufferedReader GetSafeBufferedReader(final FileInputStream fileInputStream, final String encoding) { CharsetDecoder decoder = Charset.forName(encoding).newDecoder(); decoder.onMalformedInput(CodingErrorAction.REPORT); decoder.onUnmappableCharacter(CodingErrorAction.REPORT); return new BufferedReader(new InputStreamReader(fileInputStream, decoder)); }
Here an alternative way to read the file (I suspect the way the TAC assessors do it)
You can read a file into a String like this:
Diff: