#76 Opencsv parsing error for double quotes

open
Scott Conway
None
9
2014-08-21
2011-09-01
kazem
No

I am using opencsv 2.3.
It handles parsing over 600,000 records in a csv file all fine except for the first record below:

"id","email","profile_id","first_name","last_name"
"475592","rbst218@yahoo.com","bc1er1163","\""CHia Sia Ta",""
"12345",''kazem.naderi@tt.com","bc234a","kazem","naderi"

It seems opencsv gets confused with the value of first_name (disymmetric double quotes at the beginnnig of the value)

You can save the above two lines in a file let's say badProfiles.csv and then using the code below try to parse the record. You will see that it cannot parse it correctly

public static List<UpakneeUserBean> parseCvSIntoBeans2(String filePath)
throws FileNotFoundException {
CSVReader reader = new CSVReader(new FileReader(filePath));
HeaderColumnNameMappingStrategy<UserBean> strat = new HeaderColumnNameMappingStrategy<UserBean>();
strat.setType(UserBean.class);
CsvToBean<UserBean> csv = new CsvToBean<UserBean>();
return csv.parse(strat, reader);
}

I have attached the UserBean.java file to this bug report.

Thanks for your help

Discussion

  • kazem
    kazem
    2011-09-01

     
    Attachments
  • kazem
    kazem
    2011-09-01

    I changed the priority because this error breaks our scheduled job to parse the CSV file and insert the result into DB. Currently we have to by pass records similar to the one reported in this bug report entry

     
  • kazem
    kazem
    2011-09-01

    • priority: 5 --> 8
     
  • Scott Conway
    Scott Conway
    2011-09-06

    Please check out the latest trunk version (106) and in the Integration test there is now an issue3402853. Please look at the data in user.csv and the unit test and tell me what should it be?

    There were two problems with the data you put in the bug. The first was a simple typo - for kazem email address you accidentally used two single quotes instead of the double quote.

    But for the first line you need to tell me what you are expecting the output to be because the string you provided is definitly an error.

    If you want the name to be "CHia Sia Ta then remove the extranous double quote "\"CHia Sia Ta" if you want it to be ""CHia Sia Ta then you need to escape both quotes "\"\"CHia Sia Ta"

     
  • kazem
    kazem
    2011-09-07

    Hello,

    Thanks for looking into this. Sorry that the second line had a typo. Before I continue I'd like to tell you that I am impresses with the way you stick to Test Driven Development. I am a fan.

    Getting back to the issue, I can see that you corrected it in user.csv.
    The error comes from the first line. The first name in our system is \"CHia Sia Ta (strange for being a first name). The backslash and the double quote are part of the first name.

    In the csv file the first name is specified as: ,"\""CHia Sia Ta",
    As you can see the double quote is escaped with another double quote and I believe this is correct.
    This is what we have as the user.csv:

    "id","email","profile_id","first_name","last_name"
    "475592","rbst218@yahoo.com","bc1er1163","\""CHia Sia Ta",""
    "12345","kazem.naderi@tt.com","bc234a","kazem","naderi"

    The Parser fails on parsing the first name and your unit test fails on:

    assertEquals(2, list.size());

    I appreciate your help on this
    Kazem

     
  • kazem
    kazem
    2011-09-07

    • assigned_to: nobody --> sconway
     
  • Scott Conway
    Scott Conway
    2011-09-08

    Ahh in that case the name should be "\\\"CHia Sia Ta". You can check out the updated file from subversion to see the test. By default the backslash is the escape character so you use a double backslash to create a single backslash for output then a \" to get the ". because your input had "" opencsv incorrectly assumed that you finished one filed and started another - this caused it to think you had one line (with a field containing a newline character) instead of two and that is why the assert on the number of lines failed.

    There is another defect in the system about allowing a double quote to escape another double quote regardless of escape character but I have not had a chance to look at it yet.

    BTW thanks for the compliment. TDD has saved me many times and I am a firm believer in it.

    Let me know if this works for you and I will close out the defect.

    :)

     
  • kazem
    kazem
    2011-09-08

    Hi,
    I do not think what you currently have in your system is a bug as a double quote used to escape another double quote. By default your parser allows using double quotes as text qualifier. In these cases usually double quotes within a field get escape with another double quote (As Excel does the same). Please see the references below:

    http://creativyst.com/Doc/Articles/CSV/CSV01.htm
    Fields that contain double quote characters must be surounded by double-quotes, and the embedded double-quotes must each be represented by a pair of consecutive double quotes.
    So, John "Da Man" Doe would convert to "John ""Da Man""",Doe, 120 any st.,...

    http://www.csvreader.com/csv_format.php
    In an Excel escaped CSV file, in fields containing a double quote, the double quote must be escaped by replacing the single double quote with two double quotes.
    unix escaping VS Excel Escaping

    http://en.wikipedia.org/wiki/Comma-separated_values
    More sophisticated CSV implementations permit commas and other special characters in a field value. Many implementations use " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or newlines);

    Going back to the issue we are dealing with I think still the parser has a bug. The same way it accepts double quotes as an escape char for another double quote it should be able to parse the user.CSV file successfully with out any need to use a backslash.

    Thanks again for you help on this

    Best Regards
    Kazem

     
  • kazem
    kazem
    2011-09-28

    Hello,

    Could you please let me know when you can fix this issue?

    Thanks
    Kazem

     
  • kazem
    kazem
    2011-09-28

    • priority: 8 --> 9