#121 Unexpected behavior in OpenCSV when parsing CSV file with backslash characters

Milestone: v1.0 (example)

Status: closed

Owner: Scott Conway

Labels: None

Priority: 5

Updated: 2023-12-10

Created: 2023-11-02

Creator: data-engineer

Private: No

I'm seeing some unexpected behavior with the opencsv library and its handling of the backslash character when not used as an escape character.

Steps to reproduce

Run the following snippet:

// Create a File object with the name of the text file to create.
File csvFilePath = new File("my_text_file_with_content.txt");

// Try to create the file. If the file already exists, it will be overwritten.
try (FileWriter writer = new FileWriter(csvFilePath)) {
    // Write the text to the file.
    writer.write("\"ID\"|\"TYPE\"|\"PROFILE\"");
    writer.write("\n");
    writer.write("\"123\"|\"MISC\"|\"Profile ABC\"");
    writer.write("\n");
    writer.write("\"456\"|\"MISC\"|\"Profile DEF\\n& Profile GHI\"");
    writer.write("\n");
    writer.write("\"789\"|\"MISC\"|\"Profile \\\"JKL\\\"\"");
    writer.write("\n");
} catch (IOException e) {
    System.out.println("An error occurred while trying to create the file: " + e.getMessage());
}

// Print a message indicating that the file was created successfully.
System.out.println("The file was created successfully!");

 CSVParser csvParser = new CSVParserBuilder()
        .withSeparator('|')
        .withQuoteChar('\"')
        .withEscapeChar(CSVParser.DEFAULT_ESCAPE_CHARACTER)
        .withIgnoreQuotations(  false)
        .build();

try (CSVReader csvReader = new CSVReaderBuilder(new FileReader(csvFilePath.getPath())).withCSVParser(csvParser).build()) {
    // Read all lines from the CSV file
    List<String[]> allLines = csvReader.readAll();

    // Iterate over the lines and print them to the console
    for (String[] line : allLines) {
        for (String column : line) {
            System.out.print(column + "\t");
        }
        System.out.println();
    }

    // Close the CSV reader
    csvReader.close();
} catch (IOException | CsvException e) {
    System.out.println("An error occurred while trying to open the file: " + e.getMessage());
}

Expected behavior

The output is:

ID      TYPE    PROFILE
123     MISC    Profile ABC
456     MISC    Profile DEF\n& Profile GHI
789     MISC    Profile "JKL"

Actual behavior

The output is:

ID      TYPE    PROFILE
123     MISC    Profile ABC
456     MISC    Profile DEFn& Profile GHI
789     MISC    Profile "JKL"

Additional notes

The documentation (https://opencsv.sourceforge.net/apidocs/com/opencsv/CSVParserBuilder.html#:~:text=The%20CSVParserBuilder-,withEscapeChar,-public%C2%A0CSVParserBuilder) clearly states that the escapeChar parameter is the character to use for escaping a separator or quote. In this case, the escape character (backslash) is followed by an the character 'n', so my expectation is that the sequence (both backslash and 'n') is outputted as-is.
I tried using '\0' as escapeChar parameter, but the the output looks like this:

ID      TYPE    PROFILE
123     MISC    Profile ABC
456     MISC    Profile DEF\n& Profile GHI
789     MISC    Profile \"JKL\"

which shows that when the sequence is not considered an escapeChar followed by a separator or quote is evaluated correctly. However, in this case it obviously (and correctly) also not considers the \ before the quotes for data line nr #3

Discussion

Scott Conway - 2023-11-03

I will try and look at it this weekend but my gut reaction is you need one more slash in the \n (\\n) to get it to work the way you want it to given that you needed three slashes to produce the \" in the following line.

My recommendation if you are dealing with a lot of escape characters and that is becoming too painful that you switch over to the RFC4180Parser instead The CSVParser predates the RFC4180 csv specifications (https://www.ietf.org/rfc/rfc4180.txt) and allows more customization, and sometimes more headaches. But we left it as the default due to backwards compatibility concerns.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

data-engineer - 2023-11-03

Hi @sconway. thanks for your reply. Unfortunatly, I don't have any control of the input; that's the file I receive from my client and I won't be able to request any changes to it.

I did try using the RFC4180Parser, but that gave me the same behavior

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2023-11-04

So running your program the file I end up with is:

"ID"|"TYPE"|"PROFILE"
"123"|"MISC"|"Profile ABC"
"456"|"MISC"|"Profile DEF\n& Profile GHI"
"789"|"MISC"|"Profile \"JKL\""

The issue is you want the CSVParser to treat the escape character like an escape character depending on the character it is escaping. And sadly it does not work that way. If you have an escape character then if you want it in the actual output you have to escape it - hence my suggestion of adding an additional slash to the file. But since you said this came from a client gives you only two options in my opinion.

PREFERRED SOLUTION - Use sed to modify the file to legal CSV. If you have access to a *nix operating system then you have access to the sed command which can do very powerful search and replace inside a given file. Running the following command will convert \" to "":

sed -i '' "s/\\\"/\"\"/g" my_text_file_with_content.txt

yielded

cat my_text_file_with_content.txt
"ID"|"TYPE"|"PROFILE"
"123"|"MISC"|"Profile ABC"
"456"|"MISC"|"Profile DEF\n& Profile GHI"
"789"|"MISC"|"Profile ""JKL"""

And that would work with the RFC4180 parser because in that specification a double quote is escaped with a double quote. You can create a bash script to handle multiple files.

WAY LESS PREFERRED SOLUTION - copy the CSVParser to ClientXCSVPARSER and modify it to suit your needs. In this case go to the handleEscapeCharacter method and add the else statement in which will keep the escape character if the character after it is not an escapable character.

private void handleEscapeCharacter(String nextLine, StringFragmentCopier sfc, boolean inQuotes) {
if (isNextCharacterEscapable(nextLine, inQuotes(inQuotes), sfc.i - 1)) {
sfc.takeInput();
sfc.appendPrev();
} else {
sfc.appendPrev();
}
}

Now as to why I don't just put that into the CSVParser - its because when I did it caused two existing unit test to fail one of which was for a defect reported years ago so I do not want to bring back old defects.

As to why I think a new CSVParser is a less preferred solution to a sed script - that is because given choice I would rather maintain a shell script that just had a few lines than a custom parser just for a single customer. And it makes me think that if you are dealing with one customer with non standard csv then you may have another somewhere over the horizon and it would be easier to just have a couple of custom scripts that make the files legal csv than multiple parsers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2023-12-10

status: open --> closed

assigned_to: Scott Conway
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Conway - 2023-12-10

close for lack of response.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unexpected behavior in OpenCSV when parsing CSV file with backslash characters

Group

Searches

Help