The Lemur Project / Feature Requests / #119 preserve \r characters when indexing documents

#119 preserve \r characters when indexing documents

Milestone: Next_Release

Status: open

Owner: lauradietz

Labels: None

Priority: 2

Updated: 2014-09-25

Created: 2014-09-16

Creator: lauradietz

Private: No

Motivation: different venues require character off-set information (e.g. passage retrieval) Java's BufferedReader will dismiss '\r' characters when calling readLine, therefore all our character offsets are wrong.

This is how to fix it:

(From communication with Laurent Mertens:)

You can read in files using something like the following bit of code:

    StringBuilder sbFile = new StringBuilder();
    {
        BufferedReader bufferedReader = new BufferedReader(new FileInputStream(file), "UTF-8");
        // char buffer
        char[] chars = new char[4096];
        int n;
        while((n = bufferedReader.read(chars)) != -1) { // this preserves the original new lines!
          sbFile.append(new String(java.util.Arrays.copyOfRange(chars, 0, n)));
        }
        bufferedReader.close();
    }

This should fix the problem. So yes, use a BufferedReader, but no, don't use the "readline()" method.

He also had some implementation for double checking the encoding (UTF8 versus ISO), "GetSafeBufferedReader" which could go in place of "new BufferedReader"

That's just a class of my own (well, actually a colleague) :) It throws an error
when trying to read a file encoded in X, with a bufferedreader initialized to
work with encoding Y...

public static BufferedReader GetSafeBufferedReader(final FileInputStream fileInputStream,
                                                   final String encoding) {
    CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

    return new BufferedReader(new InputStreamReader(fileInputStream, decoder));
}

Discussion

lauradietz - 2014-09-17

Here an alternative way to read the file (I suspect the way the TAC assessors do it)

You can read a file into a String like this:

String filePath = "/path/of/file/to/read"; try { String fileContent = IOUtils.toString(new FileInputStream(new File(filePath)), "UTF-8"); // ... } catch (IOException e) { // ... }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Fisher - 2014-09-22

Description has changed:

Diff:

--- old +++ new @@ -1,4 +1,3 @@ - Motivation: different venues require character off-set information (e.g. passage retrieval) Java's BufferedReader will dismiss '\r' characters when calling readLine, therefore all our character offsets are wrong. This is how to fix it:

assigned_to: lauradietz
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

preserve \r characters when indexing documents

Search engine and data mining applications and ClueWeb datasets.

Group

Searches

Help

#119 preserve \r characters when indexing documents

Discussion