Improve handling of unmatched opening quotes

Brought to you by: aruckerjones, sconway

#54 Improve handling of unmatched opening quotes

Milestone: Unstable (example)

Status: closed

Owner: Andrew Rucker Jones

Labels: None

Priority: 5

Updated: 2017-08-18

Created: 2016-12-22

Creator: Richard Cyganiak

Private: No

This patch improves the handling of accidental stray quotes in OpenCSV.

Since OpenCSV supports multi-line fields in CSV files, a stray quote is interpreted as the beginning of a multi-line field. As there is no matching closing quote, OpenCSV will interpret the rest of the file as part of a single big multi-line field—except if it hits another accidental stray quote to terminate the field.

This patch changes a couple of small things:

If we are in a multi-line field and reach the end of the file, an IOException is thrown. The message indicates that there is an unterminated quote. Previously, in this case parsing would have succeeded, and the contents of the unterminated multi-line field would have been silently discarded.
If a multi-line field contains more than 10,000 lines, an IOException is thrown. The message indicates that there is a likely unterminated quote. This is to prevent running out of memory or exceeding Java's maximum String length (2GB) in case of a stray quote in a very large file.
Performance for handling large multi-line fields is improved by recycling the same StringBuilder from one line to the next, instead of copying the StringBuilder's contents to a string (this.pending), and appending it to a fresh StringBuilder when processing the next line. This gets us to the 10,000 line limit much quicker.
One unit test was adjusted to check for the changed behaviour described in 1.

The motivation for this patch is a case where OpenCSV would hang for many hours with 100% CPU load when parsing a particular CSV file with 4M rows. After extended investigation, the problem turned out to be a stray unmatched quote halfway into the file. The result was that OpenCSV would attempt to slurp hundreds of thousands of rows into a single field value. The repeated copying of the field value—two copies being taken for every processed line—slowed processing to a crawl as the field value grew to many megabytes. Eventually it would have run out of memory due to the field value becoming so large, but we were not that patient.

1 Attachments

OpenCSV-unterminated-quotes.patch

Discussion

Andrew Rucker Jones - 2017-06-14

status: open --> pending

assigned_to: Andrew Rucker Jones
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2017-06-14

Scott (the head developer) and I discussed your suggestions and your patch, and we thank you for your contribution and your ideas. I see things as follows:

I personally completely agree with you that an unclosed quote should throw an IOException if the end of the file is reached, but opencsv already has a test for this case and tests specifically that the end of the line is ignored. We consider the existing tests to be a contract with users of opencsv that define the behavior of opencsv. We are extremely reluctant to break backward compatibility, and don't see a pressing need to do so in this case.

I like the core of your idea to include an upper limit to the number of lines in a multiline record. This part I implemented and just committed. The default value is no upper limit to preserve backwards compatibility, but you are free to set any upper limit you want for your environment. I hope this meets your needs.

Thank you for the performance improvements, but patch #53 did a more thorough job of the same thing, so I committed that instead.

I expect this to be released in 3.10. No, I don't know when that will be. :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Richard Cyganiak - 2017-06-15

Thanks for your considered response Andrew.

I'm satisfied with points 2 and 3.

On point 1, I do appreciate the concern for backward compatibility and not breaking the contract with users. So let me ask what exactly that contract is?

Imagine an input file with 1000 rows. It contains exactly one quote character (that is, an unmatched quote), in a field in the middle of row 500.

Is the contract to parse and return the first 499 rows, plus the first few fields of row 500, and then to silently discard the remaining 500 rows?

Is there any way for me to determine and report to my users that there was an issue with the input file, and information was discarded?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2017-06-15

That is precisely what the contract is, yes, and there has long been a test to prove that. I reiterate: I am in complete agreement with you that it makes no sense. And no, I know of no way of getting at the information that data were discarded.

I will forward your argument to our head developer, Scott Conway, and see if it sways his opinion about backward compatibility.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2017-08-01

I have had a chance to talk to the head developer. As it turns out, the test your patch broke was never meant to be a normative test. It was committed by accident. I just removed it.

I tried adapting your patch to the fact that the multiline limit is now implemented in a different way, and to the fact that we accepted another patch for the last version that conflicts with your patch. I wasn't able to do it quickly and easily, and I have to admit, I don't get into that part of the code much. I work on the bean code, mostly. Is it possible you could resubmit a patch for the current state of master?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2017-08-18

status: pending --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2017-08-18

Okay, I got around to getting the IOException for an unfinished multiline field into the code, and the beginning of the unfinished text is included in the IOException's message so finding errors in large inputs should be a little easier. Your patch was a big help. Thanks, Richard!

This will be out in 4.1, of course. Since we just release 4.0 a week ago, it might be a little while before 4.1 comes out.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.