Menu

#44 Multi-character separators and quote bugfixes

Unstable (example)
closed-rejected
5
2023-10-12
2013-01-04
FWennerdahl
No

While using OpenCSV in OpenRefine we found ourselves in need of support to parse CSV-files with multiple-character separators.

We created a patch for the latest trunk revision (112) where this is implemented. We also found some unexpected behaviours with quotation marks and have made improvements in how these are evaluated. For a full list of changes, please see the provided change log.

All previously existing tests passed after these changes with the exception of one (parseMultipleQuotes) where a stray newline character was expected in the test result which was not reflected in comments and which only passed due to what seems like buggy behaviour. Again, please see the change log for further info.

Discussion

  • FWennerdahl

    FWennerdahl - 2013-01-04

    Forgot to mention: we also added several new test cases for multi-character separators as well as tests displaying and validating the fixes for the bugs we experienced with quotes.

    Regards,
    Frank Wennerdahl
    Arcadelia AB

     
  • Tom Morris

    Tom Morris - 2013-01-08

    What is the related OpenRefine issue? I'm looking at the CSV support in OpenRefine now.

     
  • FWennerdahl

    FWennerdahl - 2013-01-09

    We haven't created an OpenRefine issue on github yet. Our hopes are to complete the changelog for OpenRefine this week after which we could create a pull request.

    Would you like us to include the patched version of OpenCSV in this pull request? Otherwise we could send you the OpenCSV patch so you can look it over and apply it yourself.

     
  • Tom Morris

    Tom Morris - 2013-01-10

    Thanks. I have a copy of the OpenCSV patch and I'm looking at integrating it into OpenRefine/opencsv on Github, but I'm not comfortable with the fact that it mixes changes in test cases, changes in behavior, and an API change all in a single patch (and I suspect the OpenCSV developers may feel the same). I think it's at least 3 patches: 1) test fixes/improvements, 2) behavior changes with associated test changes and 3) multi-character separator support with associated tests. There's also at least one incompatible change to the API - the type change for the separator constants.

    Ideally I'd like to see this stuff integrated here, but the project has been pretty quiet lately, so I'm willing to integrate on our fork if nothing happens here.

     
  • FWennerdahl

    FWennerdahl - 2013-01-18
     
  • FWennerdahl

    FWennerdahl - 2013-01-18

    Thanks for the pointers. We changed the CSVParser.DEFAULT_SEPARATOR back to a char to preserve the API.

    The fixes for quotation mark parsing were made after multi-char-separator support was implemented, so even if we separated the two features in two patches the fixes would still be dependent on the multi-char-support.

    Also, fixes for previously existing test cases depend on the fixes for quotation mark parsing, as the test cases passed only because quotes weren't handled correctly.

    Ideally the two features should have been developed independently, but as this is not the case we suggest treating it as one patch.

     
  • Tom Morris

    Tom Morris - 2013-01-27

    Very few projects will accept large patches which combine multiple things. It just makes them too hard to review and understand.

    This is mostly applied at https://github.com/OpenRefine/opencsv/tree/openrefine-2.6 but I changed some more stuff to preserve API compatibility and I split it into three pieces: 1) additional tests and improved tests which work with the existing code, 2) changes to the quote handling, and 3) multi-character separator support.

    One easy way to tell you've broken the API is if you have to modify a test to accommodate an API change. e.g. CSVParserBuilderTest

     
  • Andrew Rucker Jones

    • status: open --> closed-out-of-date
    • Group: --> Unstable (example)
     
  • Andrew Rucker Jones

    Cleaning up old issues. If this still needs attention, please open a new ticket.

     
  • Tom Morris

    Tom Morris - 2016-04-27

    There was never any feedback from any OpenCSV developer on this. Good idea? Bad idea? Good idea, but wrong implementation? Something else?

    Do you prefer the version at https://github.com/OpenRefine/opencsv/tree/openrefine-2.6

     
  • Andrew Rucker Jones

    • status: closed-out-of-date --> open
    • assigned_to: Andrew Rucker Jones
     
  • Andrew Rucker Jones

    Fair enough. I'll take a look at it.

     
  • Andrew Rucker Jones

    • status: open --> closed-rejected
     
  • Andrew Rucker Jones

    It took a mere 4.5 years, but I talked with the lead developer about your suggestion. I recognize and respect that you obviously have a use case, otherwise you wouldn't have gone to the trouble to create a patch. We feel, though, that very few if any of our other users would benefit from the feature. It would also be impossible to implement without breaking backward compatibility, which is something we try very hard to avoid. Additional minor arguments are that it would necessarily slow opencsv down at least a little, which users with 4 million records won't appreciate, and it's not really in keeping with the CSV format.

    We really do appreciate the time and effort you put into this, being volunteers ourselves, but we're politely declining.

     
  • Tom Morris

    Tom Morris - 2023-10-12

    For anyone else who stumbles across this, Apache Commons CSV implemented support for multi-character field delimiters a couple of years ago.

    https://issues.apache.org/jira/browse/CSV-206

     

Log in to post a comment.