opencsv / Support Requests / #93 Handling single line CSV files

Andrew Rucker Jones - 2021-06-03

Are quote character escaped elsewhere in the file? If not, simply redefine the quote character as null. Feel free to ask for clarification if I'm being too terse.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Raynor - 2021-06-03

Hi Andrew,

I've attached a sample of the data (the data is all in the public domain). You'll see every field is surrounded in double quotes, which is handled correctly by opencsv i.e. they don't then appear in the field values.

The problem is when I get a field that should be:

"SK6 2DY"

but due to user error is entered as:

"SK6 "DY"

it then throws out subsequent fields.

If I redefine the quote character, presumably my fields would then come back with quotes around them?

Ideally I just want opencsv to process the file line by line. If it gets to the case where it hits a line separator and it's in the middle of a double quoted field i.e. there is a quote mismatch, it flags that line as an error and I have some way of continue parsing from the line after the failure.

Sample.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Raynor - 2021-06-03

Hi Andrew,

Looking through the opencsv source, I've found a work around. If I use:

withMultilineLimit(1)

and catch the CsvMultilineLimitBrokenException, then call:

parser.parseLine(null); reader.readNext();

I can then carry on parsing the file. It does feel like a bit of a kludge though. I guess the other approach is to write my own parser based on RFC4180Parser that intentionally ignores multi-lines.

I did contemplate setting the quote character to null and manually removing the quotes around the fields but I suspect there may be cases where I've got commas in the fields and therefore need the quotes to be operational.

Just wondering whether there's a cleaner way of doing it? Any tips greatly appreciated.

Cheers,

Tim
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Raynor - 2021-06-04

Hi Andrew,

I've found a cleaner approach to this:

public class CSVParserSingleLine extends com.opencsv.RFC4180Parser { @Override protected String[] parseLine(String nextLine, boolean multi) { String[] results = super.parseLine(nextLine, multi); pending = null; // Force the CSV parser to only parse single lines return(results); } }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andrew Rucker Jones - 2021-06-04
  
  Quite inventive, effective, efficient. I like it. Since you are now fairly well served with a good workaround, I will leave the rest up to Scott, since this is really more his part of the code. I don't know if he will want to change anything in the official code base to accomodate your unusual situation or not. I'm sure, though, he will be proud that our code is extensible enough for you to do this, and that you thought to do it (and so well). :)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Raynor - 2021-06-04

Hi Andrew,

Many thanks for all the effort you and Scott put in to the project. It's very much appreciated.

All the best,

Tim

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

It is an inventive solution. I like to see developers extend the code for their own ends.

My only concern is how does this work when it hits the malformed section of line.

If you want to go old school the original CSVParser has an option ignoreQuotes that treats quote as any other character. I tried it with the code below and the test passed.

public class UnescapedQuoteTest {
    private static final String IMPORT_FILE = "src/test/resources/SR93.csv";

    private CSVReader reader;

    @BeforeEach
    public void createReader() throws FileNotFoundException {
        CSVParserBuilder parserBuilder = new CSVParserBuilder().withIgnoreQuotations(true);
        ICSVParser parser = parserBuilder.build();
        CSVReaderBuilder builder = new CSVReaderBuilder(new FileReader(IMPORT_FILE)).withMultilineLimit(1);
        reader = builder.withCSVParser(parser).build();
    }

    @Test
    @DisplayName("Unescaped quotes should not throw exceptions")
    public void processFile() throws IOException {
        String[] line;
        for (line = reader.readNextSilently(); line != null; line = reader.readNextSilently()) {
            System.out.println(Arrays.toString(line));
        }
    }
}

Tim Raynor - 2021-06-07

Hi Scott,

With the extended parser, when it hits a malformed line it parses as far as the line break and then returns a string array with fewer entries in it than normal i.e. the field that is malformed and any subsequent fields are just ignored.

So in my case a well formed line has 55 fields. If I get any less than that I know the line is malformed and I can flag it as an error and then carry on processing. (I use the header row to determine the correct number of fields and just compare that to the size of the array returned, so if any additional fields are added at any point in the future, the code will work fine.)

The file I'm parsing has 5 million rows, so it's quite important for me that one malformed line doesn't abort the whole the file. In my case a multi-line field is never valid.

If I used the ignoreQuotes option, I'm presuming I'd then hit an issue if any field then contained a comma? Fields containing commas would be valid for my file.

Ideally the incoming file should never have a non escaped quote but unfortunately it looks like very occasionally this does happen i.e. 1 line in 5 million and I was trying to find a neat way of handling this.

Cheers,

Tim

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2021-06-08

And that you did! Like I said I enjoy seeing people who extending the code.

As far as your question about commas I had to add a NULL escape character to basically tell the CSVParser that there was no escape character:

CSVParserBuilder parserBuilder = new CSVParserBuilder().withIgnoreQuotations(true).withEscapeChar(ICSVParser.NULL_CHARACTER);

And after the record I put the extra comma in had an extra field. So your extension turned out better there. Especially since you know the number of columns (55) and can just print out an error log of which lines were malformed to be looked at later.

Good work!

Scott :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-07-03

status: open --> closed

assigned_to: Andrew Rucker Jones
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-07-03

This ticket looks resolved to me. Feel free to object if that's not the case.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zakaria Fadili - 2023-12-22

Thanks for this solution, btw can we extend CSVParser and override this method like you did while changing some fields like escape char?
Fields are final and only builder can set them.

Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2023-12-26

Of course you can - though I would recommend you extending the CSVParserBuilder as well to have a builder for your own extended class as well.

We went the builder route because after several years and many, many modifications later we realized that our classes had 8-9 constructors some with a dozen parameters so we could maintain backwards compatibility while still allowing for the new features being requested. It just got to be too much so we created a Factory/Builder class and never looked back. It makes adding new features soooooo much easier.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Handling single line CSV files

Group

Searches

Help

#93 Handling single line CSV files

Discussion