Escape Characters in CSVParser
Brought to you by:
aruckerjones,
sconway
OpenCSV version 5.1 - unable to use escape characters successfully.
Steps to recreate:
Sample code
import com.opencsv.CSVParser;
import com.opencsv.CSVParserBuilder;
...
private void handleEscapeDelims() throws URISyntaxException, IOException, Exception {
final CSVParser csvParser = new CSVParserBuilder()
.withSeparator(',')
.withEscapeChar('\\')
.build();
String[] parts1 = csvParser.parseLine("field one,field\\, wait for it\\, two,field three");
for (String part : parts1) {
System.out.println(part);
}
}
In the above example, the expected output is:
field one
field, wait for it, two
field three
But the actual output is:
field one
field
wait for it
two
field three
Just to add: I have also tried using a different escape character - for example ~ (tilde), so:
.withEscapeChar('~')
and:
csvParser.parseLine("field one,field~, wait for it~, two,field three")
The result was similar to the first example.
Question: Am I using this feature incorrectly, or is this a bug?
Thank you for your help.
I believe the escape character is used only to escape the quote character if the quote character is part of the data stream. To do what you want to do, you need to enclose the second field in quotes.
Thank you for your response.
In the javadoc, it is described as "escapeChar - The character to use for escaping a separator or quote" (emphasis mine).
(Enclosing the field in quotes is an alternative approach, I agree. And it's the typical approach, I think. But that is not an option for my specific scenario. I do not have control over the source data.)
Last edit: andrew james 2020-05-03
Sorry Andrew James but if there is a bug here it is in the javadocs and asciidocs. We need to clarify that if escape characters are used the data needs to be inside quotes. That or use the RFC4180Parser - you would still need quotes but then the only time you need an escape character is for quote characters in the actual data.
The rule is if you have any special characters in your data, quotes or separators for RFC4180Parser, quotes, separators, and escape characters for CSVParser, then the data must be within quotes.
Here is the actual code the CSVParser uses to determine if a character is escapable
I will take a look at beefing up the documentation but if you really do not have control of your data then you need to make your own parser by extending the CSVParser that does not require quotes around data and pass that into the CSVReaderBuilder or CSVReader directly.
Here is the documentation for the RFC4180 specification:
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
For the CSVParser it is the same except it is line breaks, quotes, separators, AND escape characters.
Reopening for discussion.
I have just taken Andrew James's test from ticket 213, added it to CSVParserTest, and made the necessary modifications to CSVParser for it to work. All other tests pass without modification.
I am hesitant to change things in the parser, which has been stable and accepted for years. However, since all tests pass, our contract with our users would continue to be fulfilled even with this change.
The necessary code changes are trivial and easily reversed if we decide against it. Scott, please review and tell me what you think. It is the last commit.
As long as all existing test pass that means we are not reintroducing a previous defect and we are maintaining all current contracts - So I think its awesome!
This change will be released with the upcoming version 5.2.
5.2 has been released.
Good that this bug was fixed but I'm wondering if there's any way of disabling this feature as it has changed the behavior of the parser? Or can I somehow extend a row processor to modify the line of data being parsed and escape the escape character so that the delimiter wouldn't be escaped?
brunnsbe: It would be helpful to see the exact problem you're having: input data, expected output, and actual output. Sometimes the code is helpful too.
Scott: I'm going to want you to weigh in on this one, I think.
Ahhh such is the hazards of adding new features to an library with so many users - someone is bound to be negatively impacted by a change.
But I agree with Andrew in that it would be very helpful to know your exact problem. What were you doing before that cannot be done now in 5.2?
Either an input, expected output, actual output like Andrew requested or an JUnit test that passes if run in the 5.1 code base but fails in the 5.2.
As for the row processor you can totally create one that will modify the processed data. But keep in mind that the file beforehand has to be a legal csv file.
I would also recommend trying the RFC4180Parser as it does not have escape characters - just quotes and separators.
Thank you Andrew and Scott for your quick replies!
Here's a unit test that passes with 5.1 but fails with 5.2:
So the problem is that the data that I get from a customer that I try to parse with OpenCSV is "faulty" as it has the escape character before the delimiter but as it just was skipped in 5.1 and older versions no one noticed it. Now with 5.2 and newer versions it breaks as the bug fix makes the output to only contain an array with the one value "abc;def".
The customer isn't too keen on fixing the data so any tips how I could handle this situation without downgrading to OpenCSV 5.1 would be great!
I also tried the RFC4180Parser, with both 5.1 and 5.2 the test below passes. But (notice that the delimiter is added to the first parsed value so the behavior is different than the normal CSVParser and therefore not something I unfortunately can use as it would change the data:
That's very helpful. Thank you.
It's also a nice little mess. In the end, this original bug report was, in my opinion, truly a bug: the escape character should apply to the delimiter. As such, I'm not willing to roll that back, and I doubt Scott sees that differently.
Very little is impossible in programming, as you know, but the idea of adding yet another parsing option to CSVParser to toggle escape character parsing for delimiters does not appeal to me at all. The thing is already overloaded with so many options it's not funny anymore, and, although opencsv has always strived to provide users with ways to deal with messed up data, there are limits to the hoops I'm willing to jump through. (That's not meant to be an aggressive or unkind statement.)
I think your original idea of using a RowProcessor might be best. If you do as you did for the second unit test and use the RFC4180Parser or define a different escape character with the CSVParser, you could then write a RowProcessor that snips off a trailing backslash if one appears. Is that workable for you?
Agree with Andrew on this one. But you can totally do a RowProcessor - just set your escape character to NULL (so there are no escape characters) and then have a RowProcessor remove them for you. That or use the RFC4180Parser and RowProcessor.
I did both and both of these tests pass in 5.2
Last edit: Scott Conway 2021-08-25
Thanks for the example code and suggestions, I highly appreciate it! However, the problem with the suggested approach above is that we now remove all escape chars so e.g. these two tests don't work:
This one is tricky to solve as we don't know in the RowProcessor implementation if the string containing the backslashes is inside quotes or not and now we are removing all the backslashes. :(