Menu

#119 Incorrect parsing of last quoted element with trailing text

v1.0 (example)
closed-works-for-me
None
5
2015-11-13
2015-11-04
No

In CSV where quotes are used and the last element on the line is followed by a trailing text/whitespace a 'quotechar' incorrectly appears in the parsed result.

A test case addition to CSVParserTest that exhibits the behaviour:

@Test
public void testQuotedLastElementWithTrailingText() throws Exception {
    assertEquals(" ", csvParser.parseLine("a,\"\" ")[1]);                 // [a,"" ]
    assertEquals("b ", csvParser.parseLine("a,\"b\" ")[1]);               // [a,"b" ]
    assertEquals("b cde", csvParser.parseLine("a,\"b\" cde")[1]);         // [a,"b" cde]
    assertEquals("b\"c\" de", csvParser.parseLine("a,b\"c\" de")[1] );    // [a,b"c" de] - this currently works OK
}

Discussion

  • Scott Conway

    Scott Conway - 2015-11-13

    Sorry it took so long to respond - was busy with work, life, and the 3.6 release.

    I am making two assumptions here. The first is that you are using the latest opencsv (though it should not matter the last bug fix for CSVParser was in 2011 - the rest were checkstyle and findbugs changes that did not modify tests) and the second is that the left portion of the csvParser is what you expected to see.

    Breaking it down this is what I saw.

    assertEquals(" ", csvParser.parseLine("a,\"\" ")[1]);
    
    org.junit.ComparisonFailure: 
    Expected : 
    Actual   :"
    

    This is correct because what you are parsing after the comma is two quotes back to back and by the csv standards "Embedded double quote characters may then be represented by a pair of consecutive double quotes, or by prefixing an escape character such as a backslash"

    To get what you want either remove the quote

    assertEquals(" ", csvParser.parseLine("a, ")[1]);
    

    or put the space between the two quotes which is what you are expecting.

    assertEquals(" ", csvParser.parseLine("a,\" \"")[1]);
    

    The next one was

    assertEquals("b ", csvParser.parseLine("a,\"b\" ")[1]);               // [a,"b" ]
    
    org.junit.ComparisonFailure: 
    Expected :b 
    Actual   :b"
    

    The actual here is correct also but what the field really is three characters (b then double quote then space) which I can show by adding the following (which passes)

    assertEquals(3, csvParser.parseLine("a,\"b\" ")[1].length());
    

    Now I am starting to understand part of your confusion about opencsv. You are expecting to ignore double quotes if it is not the last element and that is not the case. As far as processing quotations in opencsv there is an option called strictQuotes if set to false (which is the default) a field is from one delimiter to the next and if there is a quotation that is not the first or last quote character in the field it is considered part of the string. Hence the result you are seeing. If strictQuotes is set to true then a field ends at the quote and everything up to the next delimiter is ignored. By creating a CSVParser with strictQuotes on:

        CSVParserBuilder builder = new CSVParserBuilder();
        CSVParser parser = builder.withStrictQuotes(true).build();
    

    I will see the following test pass

    assertEquals(1, parser.parseLine("a,\"b\" ")[1].length());
    

    But there is no way to give you what you want which is to ignore a single dangling quote in the middle of a field. It is either part of the data or the end of the data.

    Hope that helps.

     
    • Jan Gutvirth

      Jan Gutvirth - 2015-11-13

      Scott,
      Thank you for your analysis. Actually I did not know what to expect as the result of my examples - I either expected both double quotes to be there:
      assertEquals("\"b\" ", csvParser.parseLine("a,\"b\" ")[1]);
      or none to be there:
      assertEquals("b ", csvParser.parseLine("a,\"b\" ")[1]);
      but not the first to be taken as an escaping quote (and swallowed) and the second to be normal quote (and appended to the result).

      After reading the grammar in one of the CSV specs: https://tools.ietf.org/html/rfc4180
      Where escaped field is defined as:
      escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
      and non escaped is defined as:
      non-escaped = *TEXTDATA
      where TEXTDATA may not contain a double quote it seems that my examples are not valid CSV so it is difficult to say what should happen.

      Again, thanks for your time.
      Regards
      Jan

       
  • Scott Conway

    Scott Conway - 2015-11-13
    • status: open --> closed-works-for-me
    • assigned_to: Scott Conway
     

Log in to post a comment.