Menu

#119 preserve \r characters when indexing documents

Next_Release
open
None
2
2014-09-25
2014-09-16
lauradietz
No

Motivation: different venues require character off-set information (e.g. passage retrieval) Java's BufferedReader will dismiss '\r' characters when calling readLine, therefore all our character offsets are wrong.

This is how to fix it:

(From communication with Laurent Mertens:)

You can read in files using something like the following bit of code:

    StringBuilder sbFile = new StringBuilder();
    {
        BufferedReader bufferedReader = new BufferedReader(new FileInputStream(file), "UTF-8");
        // char buffer
        char[] chars = new char[4096];
        int n;
        while((n = bufferedReader.read(chars)) != -1) { // this preserves the original new lines!
          sbFile.append(new String(java.util.Arrays.copyOfRange(chars, 0, n)));
        }
        bufferedReader.close();
    }

This should fix the problem. So yes, use a BufferedReader, but no, don't use the "readline()" method.

He also had some implementation for double checking the encoding (UTF8 versus ISO), "GetSafeBufferedReader" which could go in place of "new BufferedReader"

That's just a class of my own (well, actually a colleague) :) It throws an error
when trying to read a file encoded in X, with a bufferedreader initialized to
work with encoding Y...

public static BufferedReader GetSafeBufferedReader(final FileInputStream fileInputStream,
                                                   final String encoding) {
    CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

    return new BufferedReader(new InputStreamReader(fileInputStream, decoder));
}

Discussion

  • lauradietz

    lauradietz - 2014-09-17

    Here an alternative way to read the file (I suspect the way the TAC assessors do it)

    You can read a file into a String like this:

        String filePath = "/path/of/file/to/read";
        try {
            String fileContent = IOUtils.toString(new FileInputStream(new File(filePath)), "UTF-8");
            // ...
        } catch (IOException e) {
            // ...
        }
    
     
  • David Fisher

    David Fisher - 2014-09-22
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,3 @@
    -
     Motivation: different venues require character off-set information (e.g. passage retrieval) Java's BufferedReader will dismiss '\r' characters when calling readLine, therefore all our character offsets are wrong.
    
     This is how to fix it:
    
    • assigned_to: lauradietz
     

Log in to post a comment.