opencsv / Bugs / #212 Escape Characters in CSVParser

Andrew Rucker Jones - 2020-05-03

I believe the escape character is used only to escape the quote character if the quote character is part of the data stream. To do what you want to do, you need to enclose the second field in quotes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

andrew james - 2020-05-03

Thank you for your response.

In the javadoc, it is described as "escapeChar - The character to use for escaping a separator or quote" (emphasis mine).

(Enclosing the field in quotes is an alternative approach, I agree. And it's the typical approach, I think. But that is not an option for my specific scenario. I do not have control over the source data.)

Last edit: andrew james 2020-05-03

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2020-05-05

Sorry Andrew James but if there is a bug here it is in the javadocs and asciidocs. We need to clarify that if escape characters are used the data needs to be inside quotes. That or use the RFC4180Parser - you would still need quotes but then the only time you need an escape character is for quote characters in the actual data.

The rule is if you have any special characters in your data, quotes or separators for RFC4180Parser, quotes, separators, and escape characters for CSVParser, then the data must be within quotes.

Here is the actual code the CSVParser uses to determine if a character is escapable

/** * Checks to see if the character after the current index in a String is an * escapable character. * Meaning the next character is either a quotation character or the escape * char and you are inside quotes. * * Precondition: the current character is an escape. * * @param nextLine The current line * @param inQuotes True if the current context is quoted * @param i Current index in line * @return True if the following character is a quote */ protected boolean isNextCharacterEscapable(String nextLine, boolean inQuotes, int i) { return inQuotes // we are in quotes, therefore there can be escaped quotes in here. && nextLine.length() > (i + 1) // there is indeed another character to check. && isCharacterEscapable(nextLine.charAt(i + 1)); }

I will take a look at beefing up the documentation but if you really do not have control of your data then you need to make your own parser by extending the CSVParser that does not require quotes around data and pass that into the CSVReaderBuilder or CSVReader directly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2020-05-05

Here is the documentation for the RFC4180 specification:

Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:

"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx

For the CSVParser it is the same except it is line breaks, quotes, separators, AND escape characters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2020-05-05

status: open --> closed-works-for-me
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-05-06

status: closed-works-for-me --> open

assigned_to: Andrew Rucker Jones
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-05-06

Reopening for discussion.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-05-06

I have just taken Andrew James's test from ticket 213, added it to CSVParserTest, and made the necessary modifications to CSVParser for it to work. All other tests pass without modification.

I am hesitant to change things in the parser, which has been stable and accepted for years. However, since all tests pass, our contract with our users would continue to be fulfilled even with this change.

The necessary code changes are trivial and easily reversed if we decide against it. Scott, please review and tell me what you think. It is the last commit.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2020-05-07

As long as all existing test pass that means we are not reintroducing a previous defect and we are maintaining all current contracts - So I think its awesome!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-05-07

status: open --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-05-07

This change will be released with the upcoming version 5.2.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-05-17

status: pending --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-05-17

5.2 has been released.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

brunnsbe - 2021-08-20

Good that this bug was fixed but I'm wondering if there's any way of disabling this feature as it has changed the behavior of the parser? Or can I somehow extend a row processor to modify the line of data being parsed and escape the escape character so that the delimiter wouldn't be escaped?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2021-08-21

brunnsbe: It would be helpful to see the exact problem you're having: input data, expected output, and actual output. Sometimes the code is helpful too.

Scott: I'm going to want you to weigh in on this one, I think.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2021-08-22

Ahhh such is the hazards of adding new features to an library with so many users - someone is bound to be negatively impacted by a change.

But I agree with Andrew in that it would be very helpful to know your exact problem. What were you doing before that cannot be done now in 5.2?

Either an input, expected output, actual output like Andrew requested or an JUnit test that passes if run in the 5.1 code base but fails in the 5.2.

As for the row processor you can totally create one that will modify the processed data. But keep in mind that the file beforehand has to be a legal csv file.

I would also recommend trying the RFC4180Parser as it does not have escape characters - just quotes and separators.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

brunnsbe - 2021-08-24

Thank you Andrew and Scott for your quick replies!
Here's a unit test that passes with 5.1 but fails with 5.2:

@Test public void readLineWithEscapeBeforeDelimiter() throws Exception { try ( Reader reader = new StringReader("abc\\;def"); CSVReader csvReader = new CSVReaderBuilder(reader) .withSkipLines(0) .withCSVParser(new CSVParserBuilder().withSeparator(';').withEscapeChar('\\').build()) .build(); ) { String[] line = csvReader.readNext(); Assertions.assertArrayEquals(new String[]{"abc", "def"}, line); } }

So the problem is that the data that I get from a customer that I try to parse with OpenCSV is "faulty" as it has the escape character before the delimiter but as it just was skipped in 5.1 and older versions no one noticed it. Now with 5.2 and newer versions it breaks as the bug fix makes the output to only contain an array with the one value "abc;def".
The customer isn't too keen on fixing the data so any tips how I could handle this situation without downgrading to OpenCSV 5.1 would be great!

I also tried the RFC4180Parser, with both 5.1 and 5.2 the test below passes. But (notice that the delimiter is added to the first parsed value so the behavior is different than the normal CSVParser and therefore not something I unfortunately can use as it would change the data:

@Test public void readLineWithEscapeBeforeDelimiter() throws Exception { try ( final Reader reader = new StringReader("abc\\;def"); final CSVReader csvReader = new CSVReaderBuilder(reader) .withSkipLines(0) .withCSVParser(new RFC4180ParserBuilder().withSeparator(';').build()) .build(); ) { String[] line = csvReader.readNext(); Assertions.assertArrayEquals(new String[]{"abc\\", "def"}, line); } }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andrew Rucker Jones - 2021-08-24
  
  That's very helpful. Thank you.
  
  It's also a nice little mess. In the end, this original bug report was, in my opinion, truly a bug: the escape character should apply to the delimiter. As such, I'm not willing to roll that back, and I doubt Scott sees that differently.
  
  Very little is impossible in programming, as you know, but the idea of adding yet another parsing option to CSVParser to toggle escape character parsing for delimiters does not appeal to me at all. The thing is already overloaded with so many options it's not funny anymore, and, although opencsv has always strived to provide users with ways to deal with messed up data, there are limits to the hoops I'm willing to jump through. (That's not meant to be an aggressive or unkind statement.)
  
  I think your original idea of using a RowProcessor might be best. If you do as you did for the second unit test and use the RFC4180Parser or define a different escape character with the CSVParser, you could then write a RowProcessor that snips off a trailing backslash if one appears. Is that workable for you?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Agree with Andrew on this one. But you can totally do a RowProcessor - just set your escape character to NULL (so there are no escape characters) and then have a RowProcessor remove them for you. That or use the RFC4180Parser and RowProcessor.

I did both and both of these tests pass in 5.2

import com.opencsv.processor.RowProcessor;
import org.apache.commons.lang3.StringUtils;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;

import java.io.Reader;
import java.io.StringReader;

public class Bug212Test {
    @Test
    public void readLineWithEscapeBeforeDelimiter() throws Exception {
        RowProcessor removeEscapeChars = new RemoveEscapeChars('\\');
        try (
                Reader reader = new StringReader("abc\\;def");
                CSVReader csvReader = new CSVReaderBuilder(reader)
                        .withSkipLines(0)
                        .withRowProcessor(removeEscapeChars)
                        .withCSVParser(new CSVParserBuilder()
                                .withSeparator(';')
                                .withEscapeChar(ICSVParser.NULL_CHARACTER).build())
                        .build();
        ) {
            String[] line = csvReader.readNext();
            Assertions.assertArrayEquals(new String[]{"abc", "def"}, line);
        }
    }

    @Test
    public void readLineWithEscapeBeforeDelimiterRFC4180() throws Exception {
        RowProcessor removeEscapeChars = new RemoveEscapeChars('\\');
        try (
                final Reader reader = new StringReader("abc\\;def");
                final CSVReader csvReader = new CSVReaderBuilder(reader)
                        .withSkipLines(0)
                        .withRowProcessor(removeEscapeChars)
                        .withCSVParser(new RFC4180ParserBuilder().withSeparator(';').build())
                        .build();
        ) {
            String[] line = csvReader.readNext();
            Assertions.assertArrayEquals(new String[]{"abc", "def"}, line);
        }
    }

    private class RemoveEscapeChars implements RowProcessor {
        private char escapeChar;

        public RemoveEscapeChars(char escapeChar) {
            this.escapeChar = escapeChar;
        }

        @Override
        public String processColumnItem(String column) {
            return StringUtils.remove(column, escapeChar);
        }

        @Override
        public void processRow(String[] row) {
            for (int i = 0; i < row.length; i++) {
                row[i] = processColumnItem(row[i]);
            }
        }
    }
}

Last edit: Scott Conway 2021-08-25

Thanks for the example code and suggestions, I highly appreciate it! However, the problem with the suggested approach above is that we now remove all escape chars so e.g. these two tests don't work:

    @Test
    public void readLineWithEscapeInsideQuotes() throws Exception {
        RowProcessor removeEscapeChars = new RemoveEscapeChars('\\');
        try (
            Reader reader = new StringReader("\"\\\\abc\";def");
            CSVReader csvReader = new CSVReaderBuilder(reader)
                .withSkipLines(0)
                .withRowProcessor(removeEscapeChars)
                .withCSVParser(new CSVParserBuilder()
                    .withSeparator(';')
                    .withQuoteChar('"')
                    .withEscapeChar(ICSVParser.NULL_CHARACTER).build())
                .build();
        ) {
            String[] line = csvReader.readNext();
            Assertions.assertArrayEquals(new String[]{"\\abc", "def"}, line);
        }
    }

    @Test
    public void readLineWithEscapeInsideQuotesRFC4180() throws Exception {
        RowProcessor removeEscapeChars = new RemoveEscapeChars('\\');
        try (
            Reader reader = new StringReader("\"\\\\abc\";def");
            CSVReader csvReader = new CSVReaderBuilder(reader)
                .withSkipLines(0)
                .withRowProcessor(removeEscapeChars)
                .withCSVParser(new RFC4180ParserBuilder()
                    .withSeparator(';')
                    .withQuoteChar('"').build())
                .build();
        ) {
            String[] line = csvReader.readNext();
            Assertions.assertArrayEquals(new String[]{"\\abc", "def"}, line);
        }
    }

This one is tricky to solve as we don't know in the RowProcessor implementation if the string containing the backslashes is inside quotes or not and now we are removing all the backslashes. :(

Escape Characters in CSVParser

Group

Searches

Help

#212 Escape Characters in CSVParser

Discussion