Hi,
Apologies if this has been covered elsewhere or if this is not the right place to ask the question, but I've not been able to find a solution to the problem I'm encountering and was hoping you might be able to point me in the right direction.
I've been using opencsv for a while and it's a great solution - thanks for all your hard work. I'm parsing a very large CSV file, where each line is a individual record and none of the fields contain a line separator.
The problem I'm encountering is that occasionally I get a malformed entry where a field contains a double quote that isn't correctly escaped. In this case I just want to exclude the one line and continue parsing the remaining lines i.e. I want the line terminator to take precedence over the incorrectly quoted field. However I can't seem to find a way to achieve this.
My code currently looks like:
fis = new FileInputStream(theFile); isr = new InputStreamReader(fis, encoding); ICSVParser parser = new RFC4180Parser(); reader = new CSVReaderBuilder(isr).withCSVParser(parser).withMultilineLimit(1).build();
I've tried using the withMultilineLimit(1) and this causes opencsv to correctly highlight the problem and throws an exception but how do I get opencsv to then continue parsing the file from the next line after the error?
Many thanks,
Tim
Are quote character escaped elsewhere in the file? If not, simply redefine the quote character as null. Feel free to ask for clarification if I'm being too terse.
Hi Andrew,
I've attached a sample of the data (the data is all in the public domain). You'll see every field is surrounded in double quotes, which is handled correctly by opencsv i.e. they don't then appear in the field values.
The problem is when I get a field that should be:
"SK6 2DY"
but due to user error is entered as:
"SK6 "DY"
it then throws out subsequent fields.
If I redefine the quote character, presumably my fields would then come back with quotes around them?
Ideally I just want opencsv to process the file line by line. If it gets to the case where it hits a line separator and it's in the middle of a double quoted field i.e. there is a quote mismatch, it flags that line as an error and I have some way of continue parsing from the line after the failure.
Hi Andrew,
Looking through the opencsv source, I've found a work around. If I use:
withMultilineLimit(1)
and catch the
CsvMultilineLimitBrokenException
, then call:I can then carry on parsing the file. It does feel like a bit of a kludge though. I guess the other approach is to write my own parser based on RFC4180Parser that intentionally ignores multi-lines.
I did contemplate setting the quote character to null and manually removing the quotes around the fields but I suspect there may be cases where I've got commas in the fields and therefore need the quotes to be operational.
Just wondering whether there's a cleaner way of doing it? Any tips greatly appreciated.
Cheers,
Tim
Hi Andrew,
I've found a cleaner approach to this:
Quite inventive, effective, efficient. I like it. Since you are now fairly well served with a good workaround, I will leave the rest up to Scott, since this is really more his part of the code. I don't know if he will want to change anything in the official code base to accomodate your unusual situation or not. I'm sure, though, he will be proud that our code is extensible enough for you to do this, and that you thought to do it (and so well). :)
Hi Andrew,
Many thanks for all the effort you and Scott put in to the project. It's very much appreciated.
All the best,
Tim
It is an inventive solution. I like to see developers extend the code for their own ends.
My only concern is how does this work when it hits the malformed section of line.
If you want to go old school the original CSVParser has an option ignoreQuotes that treats quote as any other character. I tried it with the code below and the test passed.
Hi Scott,
With the extended parser, when it hits a malformed line it parses as far as the line break and then returns a string array with fewer entries in it than normal i.e. the field that is malformed and any subsequent fields are just ignored.
So in my case a well formed line has 55 fields. If I get any less than that I know the line is malformed and I can flag it as an error and then carry on processing. (I use the header row to determine the correct number of fields and just compare that to the size of the array returned, so if any additional fields are added at any point in the future, the code will work fine.)
The file I'm parsing has 5 million rows, so it's quite important for me that one malformed line doesn't abort the whole the file. In my case a multi-line field is never valid.
If I used the ignoreQuotes option, I'm presuming I'd then hit an issue if any field then contained a comma? Fields containing commas would be valid for my file.
Ideally the incoming file should never have a non escaped quote but unfortunately it looks like very occasionally this does happen i.e. 1 line in 5 million and I was trying to find a neat way of handling this.
Cheers,
Tim
And that you did! Like I said I enjoy seeing people who extending the code.
As far as your question about commas I had to add a NULL escape character to basically tell the CSVParser that there was no escape character:
CSVParserBuilder parserBuilder = new CSVParserBuilder().withIgnoreQuotations(true).withEscapeChar(ICSVParser.NULL_CHARACTER);
And after the record I put the extra comma in had an extra field. So your extension turned out better there. Especially since you know the number of columns (55) and can just print out an error log of which lines were malformed to be looked at later.
Good work!
Scott :)
This ticket looks resolved to me. Feel free to object if that's not the case.
Thanks for this solution, btw can we extend CSVParser and override this method like you did while changing some fields like escape char?
Fields are final and only builder can set them.
Thanks!
Of course you can - though I would recommend you extending the CSVParserBuilder as well to have a builder for your own extended class as well.
We went the builder route because after several years and many, many modifications later we realized that our classes had 8-9 constructors some with a dozen parameters so we could maintain backwards compatibility while still allowing for the new features being requested. It just got to be too much so we created a Factory/Builder class and never looked back. It makes adding new features soooooo much easier.