There seems to be interest around multi-character separator support for OpenCSV:
https://stackoverflow.com/questions/8653797/java-csv-parser-with-string-separator-multi-character
A patch was submitted years ago for this, but was rejected as a breaking change which would potentially harm performance:
https://sourceforge.net/p/opencsv/patches/44/
The patch was accepted in OpenRefine, which uses a custom patched version of OpenCSV since then (which was forked off version 2.4):
https://github.com/OpenRefine/OpenRefine/issues/2268
I am trying to clean up OpenRefine's dependencies and as part of this process I would like to switch to a released version of OpenCSV. Intuitively, multi-character separator support seems sensible, so I would like to preserve this functionality in OpenRefine. Therefore I am opening this ticket to discuss some options:
I would ideally prefer the first option.
The problem is that changing CSVReader's separator parameter from char to String is a breaking change. The ICSVParser interface itself enforces a one-character separator: http://opencsv.sourceforge.net/apidocs/com/opencsv/ICSVParser.html
I don't see how to avoid switching from char to String in this interface. However, the performance overhead of the change can be avoided for existing users, by keeping CSVParser to only support a single character as separator (which would be converted from/to String in the accessors to comply with the new interface, but would internally be stored as char). Multi-character support would be added in a new class (potentially a subclass of CSVParser, or at least of AbstractCSVParser).
As an existing user who does not need multi-separator support, upgrading to a new major version where multi-separator support is implemented would just amount to adapting my interactions with CSVParser / RFC4180Parser to switch from char to String, with no performance change.
Would you be open to such a change? Or do you see any other way this could be implemented in OpenCSV?
If not, I would reluctantly look into maintaining a fork with multi-character support… but intuitively this is a feature that ought to be supported upstream.
Hmmmmm. I am at work now so I cannot give the problem a proper looking over but do remember the backwards compatibility concerns.
Because we added multiple parsersers around 2016 I got to thinking about it and my knee jerk reaction was pretty close to your subclass suggestion. I am thinking we could add a getQuoteCharString in ICSVParser and in AbstractCSVParser we have a quote string in getquotecharsring we just return the string value of quotechar if the string is null and then ignore it everywhere else.
You can then write your own MultiCharSeparatorParser that extends CSVParser or RFC4180Parser (your pick) that allows you to set the quote string and would call the getQuoteCharString instead of getQuoteChar. That way the existing code will be unaffected by this change and there should not be a performance hit on any existing code.
Will take a closer look tonight or tomorrow but do you think that would work for you?
Scott :)
If I understand your approach correctly, getQuoteChar would still exist in ICSVParser, so MultiCharSeparatorParser would also have to implement it?
In that case, it is even easier not to change anything at all in OpenCSV and just add getQuoteCharString on MultiCharSeparatorParser only - no need for this method to be part of the interface, I would say :) Or do you mean MultiCharSeparatorParser could be part of OpenCSV?
Anyway, the parsing code is not extremely modular, so I don't see any way to make MultiCharSeparatorParser reuse a significant part of the existing logic in other classes, no matter what we do with the interface… It will mostly be a cut and paste with just a few changes in some spots. Not great, but if you are happy with that, I could try writing a patch.
Sorry I said getQuoteChar but really meant getSeparator.
But yes I realize now that since the getSeparator and getQuote is only used in the CSVParser and RFC4180Parser you can create your own parser (potentially extending one of the existing parsers or just copy and modify) and use the string separator instead of the character separator and as long as it implements the ICSVParser, even if it has do nothing methods for the getSeparator and getQuoteChar as we have not used those methods outside the parser in a long time, it should work.
And yes I was thinking you would keep the MultiCharSeparatorParser in OpenRefine and pass it into the CSVReader you are using with the CSVReaderBuilder (Or CSVWriter with the CSVWriterBuilder). To me that would be a win/win: you get to update to the latest openCSV and have full control over the parser and let us worry about the rest. For us we keep opencsv simple and focus on the simple things and make changes that would make the most good while giving others the capability to modify it for their specific needs.
Scott :)
Is there a need for this ticket to remain open?
Sorry for the delay in responding to this. My understanding is that you recommend not making any changes to opencsv itself, so this ticket can likely be closed on your side.
On our side I don't see a way to fix this without effectively reimplementing everything in OpenRefine - the fact that our implementation uses your interface or not is pretty minor.
closing - sorry that this is too large of a task for you but if you cannot create an custom parser it is doubly hard for us. I prefer to focus on pure csv and have all other use cases handled by extension.
For anyone else with this need, it looks like Apache commons-csv has implemented support.
https://issues.apache.org/jira/browse/CSV-206