Silently return no rows on file with invalid UTF-8
Brought to you by:
aruckerjones,
sconway
If a file contains invalid UTF-8 (eg the simple string a8 33 0d 34 00 0a which shows incorrect use of the a8 diaerisis, see attached file), then the CSVReader will read the file, return no rows, and give no errors.
Looking into this, on this data, a BuffereReader throws a MalformedInputException - a subclass of IOException.
Java reader will throw a Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read(BufferedReader.java:182)
But the CSVReader then declares that it's ok since MalformedInputException isn't on its list of PASSTHROUGH_EXCEPTIONS, thus believes the stream is closed, and thus returns acts as if it's reached the end of the stream.
protected boolean isClosed() throws IOException {
if (!verifyReader) {
return false;
}
try {
br.mark(READ_AHEAD_LIMIT);
int nextByte = br.read();
br.reset(); // resets stream position, possible because its buffered
return nextByte == -1; // read() returns -1 at end of stream
} catch (IOException e) {
if (PASSTHROUGH_EXCEPTIONS.contains(e.getClass())) {
throw e;
}
return true;
}
}
The fix is to add java.nio.charset.MalformedInputException to the list of PASSTHROUGH_EXCEPTIONS.
Added MalformedInputException to the passthrough list. Will be out in the next release.
@sconway Brilliant, thanks!