This patch improves the handling of accidental stray quotes in OpenCSV.
Since OpenCSV supports multi-line fields in CSV files, a stray quote is interpreted as the beginning of a multi-line field. As there is no matching closing quote, OpenCSV will interpret the rest of the file as part of a single big multi-line field—except if it hits another accidental stray quote to terminate the field.
This patch changes a couple of small things:
The motivation for this patch is a case where OpenCSV would hang for many hours with 100% CPU load when parsing a particular CSV file with 4M rows. After extended investigation, the problem turned out to be a stray unmatched quote halfway into the file. The result was that OpenCSV would attempt to slurp hundreds of thousands of rows into a single field value. The repeated copying of the field value—two copies being taken for every processed line—slowed processing to a crawl as the field value grew to many megabytes. Eventually it would have run out of memory due to the field value becoming so large, but we were not that patient.
Scott (the head developer) and I discussed your suggestions and your patch, and we thank you for your contribution and your ideas. I see things as follows:
I expect this to be released in 3.10. No, I don't know when that will be. :)
Thanks for your considered response Andrew.
I'm satisfied with points 2 and 3.
On point 1, I do appreciate the concern for backward compatibility and not breaking the contract with users. So let me ask what exactly that contract is?
Imagine an input file with 1000 rows. It contains exactly one quote character (that is, an unmatched quote), in a field in the middle of row 500.
Is the contract to parse and return the first 499 rows, plus the first few fields of row 500, and then to silently discard the remaining 500 rows?
Is there any way for me to determine and report to my users that there was an issue with the input file, and information was discarded?
That is precisely what the contract is, yes, and there has long been a test to prove that. I reiterate: I am in complete agreement with you that it makes no sense. And no, I know of no way of getting at the information that data were discarded.
I will forward your argument to our head developer, Scott Conway, and see if it sways his opinion about backward compatibility.
I have had a chance to talk to the head developer. As it turns out, the test your patch broke was never meant to be a normative test. It was committed by accident. I just removed it.
I tried adapting your patch to the fact that the multiline limit is now implemented in a different way, and to the fact that we accepted another patch for the last version that conflicts with your patch. I wasn't able to do it quickly and easily, and I have to admit, I don't get into that part of the code much. I work on the bean code, mostly. Is it possible you could resubmit a patch for the current state of master?
Okay, I got around to getting the IOException for an unfinished multiline field into the code, and the beginning of the unfinished text is included in the IOException's message so finding errors in large inputs should be a little easier. Your patch was a big help. Thanks, Richard!
This will be out in 4.1, of course. Since we just release 4.0 a week ago, it might be a little while before 4.1 comes out.