Menu

#270 RFC4180Parser is unpredictable with wrongly quoted csv

v1.0 (example)
open
None
5
2026-02-22
2026-02-18
Papegaaij
No

RFC-4180 is quite clear in how fields must be quoted or not. A field can either be quoted or unquoted. An unquoted field is not allowed to contain double quotes. A quoted field is allowed to contain double quotes, but only if escaped by another double quote.

The current parsing tries to be lenient against errors in quoting, which I do understand given that many CSV documents are incorrectly quoted. However, we would like the parser to be strict and reject any incorrectly quoted document. The reason behind this, is that incorrect quoting can lead to unpredictable behavior, such as fields being split where they should not.

I've created a patch that introduces strict quotes in the RFC4180Parser (and Builder). This option defaults to disabled, keeping the old behavior. If switched on, the parser will reject any document that contains quotes that are not conform the specification. Unquoted fields may not contain quotes and inside a quoted field, all quotes must be escaped.

1 Attachments

Discussion

  • Scott Conway

    Scott Conway - 2026-02-20
    • assigned_to: Scott Conway
     
  • Scott Conway

    Scott Conway - 2026-02-20

    K - I will try and take a look at this but it will take some time as work and life are working together to take up my free time.

    I want to thank you for sending tests, the RFC4180ParserSpec.groovy additions. I will be looking at that to address your claims that the RFC4180Parser is not compliant with the RFC4180 specification. This is my main concern given the parser's age (nine years) and the amount of use this parser has had that this is the second "bug" that the parser is not compliant that I can remember.... and they were both this year! The other was that the CRLF stated in the RFC4180 should apply to just CR as old mac files used that as a standard. So forgive my initial skepticism while I understand your first test.

    If you are in a time crunch or we cannot come to an agreement of the specification then I give you the same suggestion as Bug #267 which is to create your own parser which extends the RFC4180Parser and adds the withStrictQuotes and the code that processes it. Then create your own parser builder that extends the RFC4180ParserBuilder that adds the withStrictQuotes. The only drawback is that you will have to hard code the exception language but that will give you a parser that works the way you want.

     
  • Scott Conway

    Scott Conway - 2026-02-20

    okay - I did just take a quick copy of your first test and modified it to just use the existing parser so it is always expecting the lientent result and I now see the issues you are talking about and I am shocked this has stayed in so long without someone bringing it up.... well maybe not - it is most likely that all the data before was programatically generated correctly to the rfc4180 specification so no one ever tried a hand parsed or partially malformed csv file. But it does break both rule 6 and 7 so I will need to do some soul searching on the best way to handle this but will probably go your route.

    That said it will take some time as I do want to do some additional testing to feel good as to why this took so long to find... plus the job and home life demands.

    Thank you for the tests! A good test explains far more than pages of explanation and your first test nailed it!

    Scott :)

    P.S. just to show here is the modified first test.

       @Unroll
        def 'no strict quote parsing of #input should yield error #expectedError'(String input, String expectedError, String[] lenientResult) {
            given:
            RFC4180ParserBuilder builder = new RFC4180ParserBuilder()
    
            when:
            CSVReader csvReader = new CSVReaderBuilder(new StringReader(input)).withCSVParser(builder.build()).build()
    
            then:
            csvReader.readAll().get(0) == lenientResult
    
            where:
            input                                      | expectedError                                                                        | lenientResult
            '''"val1", "{x:""123"", y: ""456""}",""''' | '''The unquoted field [ "{x:""123""] contains quotes.'''                             | ['''val1''', ''' "{x:"123"''', ''' y: "456"}"''', ""]
            '''quote at end",""'''                     | '''The unquoted field [quote at end"] contains quotes.'''                            | ['''quote at end"''', ""]
            '''""double quote at start"",""'''         | '''The field ["double quote at start"] contains a single, unescaped double quote.''' | ['''"double quote at start"''', ""]
            '''"single loose " quote"",""'''           | '''The field [single loose " quote"] contains a single, unescaped double quote.'''   | ['''single loose " quote"''', ""]
        }
    
     
  • Papegaaij

    Papegaaij - 2026-02-20

    Hi Scott,
    Thanks for taking the time to look into this and all the time you've already spent on this library. I guess nobody tried to put a space before a quoted field until a colleague of mine did :) It took us quite some time to find out what was actually happening inside the parser at this point, but once we found the cause, it was easy to construct some failing tests. The RFC 4180 is actually quite strict about quotes, but I guess most CSV files are generated and either quote correctly or not quote at all. There was already a bug about this before: #157 . RFC4180ParserSpec contains a testcase for this.

    The tests are great to show these kind of issues, especially the spock ones. It's a nice framework to work with. We use it in our application as well.

    Take your time to get this properly resolved. For us its not that time critical. We do want this to be resolved eventually, but I expect most people using the CSV import will use generated CSV files, which will probably be RFC 4180 compliant. For non-compliant files, it still is very likely you will get an error, but the chances are the error is a bit vague or not entirely correct. For example, the row might end up with more columns and get skipped for that.

    Best regards,
    Emond

     
  • Scott Conway

    Scott Conway - 2026-02-22

    Sigh - I had a little free time and wanted to see if it was possible to use the validators that we added a few years back because of all the requests to add custom processing or custom validation - https://opencsv.sourceforge.net/#processors_and_validators. I knew it would not work but wanted to try the LineValidator.

    The issue is the LineValidator validates a single line read by Java - not a single record/row. So a string with a newline inside it requires two Java reads and thus calls the validator twice for each segment, not once with the entire string. And that is because the reader builds the resulting array incrementally to reduce the memory overhead.

    Just giving a heads up.

    :)

     

Log in to post a comment.

MongoDB Logo MongoDB