Menu

#229 CSV input with comma between escaped quotes produces "Number of data fields does not match number of headers" exception

v1.0 (example)
closed-works-for-me
None
5
2021-11-30
2021-11-29
Tim Mousaw
No

I have an input CSV file which has escaped double quotes, but also contains a comma within the escaped double quotes. This causes OpenCSV to think there are more fields than headers. I believe it is valid CSV to have a comma in between escaped quotes (at least it passes online CSV parsers). I have attached the source files and an input file that reproduces generically how I am using the library.

1 Attachments

Discussion

  • Tim Mousaw

    Tim Mousaw - 2021-11-29

    Forgot to mention that I'm using the latest version of OpenCSV (5.5.2) as of the writing of this bug.

     
  • Scott Conway

    Scott Conway - 2021-11-30

    The problem is your last record has too many quotes. The CSVParser takes either two double quotes as an escaped quote or an escaped quote - but not both.

    When I change your last line from

    Example of error where there are \""Escaped quotes that has a comma, and then it blows up\""

    to

    Example of error where there are \"Escaped quotes that has a comma, and then it blows up\"

    it works perfectly.

    Hope that helps.

     
  • Tim Mousaw

    Tim Mousaw - 2021-11-30

    @sconway - Thanks for your reply. Interesting that a CSV validator wouldn't point this out. This is the format that is coming from an external tool (Prisma Cloud). Knowing this, I can fix it before running it through my tool. I'm still a bit curious why it would validate as valid CSV. I'll have to check the rules the online validator is using to evaluate the CSV.

    The other thing I am curious about is why Excel can open this document and interpret it how one would "expect" (I use that term loosely). It's possible Excel does this wrong as well, but it seems that if Excel can interpret this file correctly, there must be a way to get OpenCSV to do the same thing.

     

    Last edit: Tim Mousaw 2021-11-30
  • Scott Conway

    Scott Conway - 2021-11-30

    You can try setting up the parser with strictQuotes, flipping the value to see what you get. But for the typical CSVParser if you have quoted data then it expects the total number of close quotes to be a net of zero (even number - escaped quotes don't count) and once that is reached then the field is over with and it looks for the next separator. That was why the \"" was bad and should have been either "" or \" or even \"\" because once it hit the second, unescaped, double quote the CSVParser realized it was at the end of the field, even though it wasn't.

    There is no fix for this that would not break existing users. Thus I am not even willing to entertain THAT idea. That said opencsv has two parsers to choose from: the original CSVParser, and the RFC4180Parser which uses the RFC4180 standard (which came about AFTER opencsv was created). Though I am pretty sure neither would handle the \"". BUT if you can find out the logic used by excel to parse csv data you could create a ExcelCSVParser that implements the ICSVParser then create your CSVToBean with that.

    And actually you can forget the strict quotes. I was playing with that while typing the reply and its a no go. Here is the unit tests I have been playing with to prove things out.

    @Test
    public void bug229Test() throws IOException {
        CSVParserBuilder builder = new CSVParserBuilder();
        ICSVParser parser = builder.build();
        String testString = "2,\"Example of error where there are \\\"Escaped quotes that has a comma, and then it blows up\\\"\"";
        // "one","t""w"o,"three"
        String[] nextLine = parser.parseLineMulti(testString);
    
        assertEquals(2, nextLine.length);
    
        assertEquals("2", nextLine[0]);
        assertEquals("Example of error where there are \"Escaped quotes that has a comma, and then it blows up\"", nextLine[1]);
    }
    
    @Test
    public void bug229TestBadString() throws IOException {
        CSVParserBuilder builder = new CSVParserBuilder();
        ICSVParser parser = builder.withStrictQuotes(true).build();
        String testString = "\"2\",\"Example of error where there are \\\"\"Escaped quotes that has a comma, and then it blows up\\\"\"\"";
        // "one","t""w"o,"three"
        String[] nextLine = parser.parseLineMulti(testString);
    
        assertEquals(2, nextLine.length);
    
        System.out.println(nextLine[0]);
        System.out.println(nextLine[1]);
        assertEquals("2", nextLine[0]);
        assertEquals("Example of error where there are \"Escaped quotes that has a comma, and then it blows up\"", nextLine[1]);
    }
    

    In the strict quotes the data field stops hard when it comes to what it considers the last quote (ie even number has been reached).

    Actually what I think you want is the RFC4180Parser because it treats the slash as a normal character, not an escape character. So your output would be

    Example of error where there are \"Escaped quotes that has a comma, and then it blows up\"

     
  • Tim Mousaw

    Tim Mousaw - 2021-11-30

    @sconway That's perfect! RFC4180Parser is indeed what I want and worked without having to modify the output from Prisma Cloud! Hopefully, this output continues to conform to that standard. Thanks a bunch! I'm pretty sure you can close this bug as working to spec.

    Thanks again for your help!

     
  • Scott Conway

    Scott Conway - 2021-11-30

    My pleasure :)

     
  • Scott Conway

    Scott Conway - 2021-11-30
    • status: open --> closed-works-for-me
    • assigned_to: Scott Conway
     

Log in to post a comment.

MongoDB Logo MongoDB