I'm seeing some unexpected behavior with the opencsv library and its handling of the backslash character when not used as an escape character.
Run the following snippet:
// Create a File object with the name of the text file to create.
File csvFilePath = new File("my_text_file_with_content.txt");
// Try to create the file. If the file already exists, it will be overwritten.
try (FileWriter writer = new FileWriter(csvFilePath)) {
// Write the text to the file.
writer.write("\"ID\"|\"TYPE\"|\"PROFILE\"");
writer.write("\n");
writer.write("\"123\"|\"MISC\"|\"Profile ABC\"");
writer.write("\n");
writer.write("\"456\"|\"MISC\"|\"Profile DEF\\n& Profile GHI\"");
writer.write("\n");
writer.write("\"789\"|\"MISC\"|\"Profile \\\"JKL\\\"\"");
writer.write("\n");
} catch (IOException e) {
System.out.println("An error occurred while trying to create the file: " + e.getMessage());
}
// Print a message indicating that the file was created successfully.
System.out.println("The file was created successfully!");
CSVParser csvParser = new CSVParserBuilder()
.withSeparator('|')
.withQuoteChar('\"')
.withEscapeChar(CSVParser.DEFAULT_ESCAPE_CHARACTER)
.withIgnoreQuotations( false)
.build();
try (CSVReader csvReader = new CSVReaderBuilder(new FileReader(csvFilePath.getPath())).withCSVParser(csvParser).build()) {
// Read all lines from the CSV file
List<String[]> allLines = csvReader.readAll();
// Iterate over the lines and print them to the console
for (String[] line : allLines) {
for (String column : line) {
System.out.print(column + "\t");
}
System.out.println();
}
// Close the CSV reader
csvReader.close();
} catch (IOException | CsvException e) {
System.out.println("An error occurred while trying to open the file: " + e.getMessage());
}
The output is:
ID TYPE PROFILE
123 MISC Profile ABC
456 MISC Profile DEF\n& Profile GHI
789 MISC Profile "JKL"
The output is:
ID TYPE PROFILE
123 MISC Profile ABC
456 MISC Profile DEFn& Profile GHI
789 MISC Profile "JKL"
The documentation (https://opencsv.sourceforge.net/apidocs/com/opencsv/CSVParserBuilder.html#:~:text=The%20CSVParserBuilder-,withEscapeChar,-public%C2%A0CSVParserBuilder) clearly states that the escapeChar parameter is the character to use for escaping a separator or quote. In this case, the escape character (backslash) is followed by an the character 'n', so my expectation is that the sequence (both backslash and 'n') is outputted as-is.
I tried using '\0' as escapeChar parameter, but the the output looks like this:
ID TYPE PROFILE
123 MISC Profile ABC
456 MISC Profile DEF\n& Profile GHI
789 MISC Profile \"JKL\"
which shows that when the sequence is not considered an escapeChar followed by a separator or quote is evaluated correctly. However, in this case it obviously (and correctly) also not considers the \ before the quotes for data line nr #3
I will try and look at it this weekend but my gut reaction is you need one more slash in the \n (\\n) to get it to work the way you want it to given that you needed three slashes to produce the \" in the following line.
My recommendation if you are dealing with a lot of escape characters and that is becoming too painful that you switch over to the RFC4180Parser instead The CSVParser predates the RFC4180 csv specifications (https://www.ietf.org/rfc/rfc4180.txt) and allows more customization, and sometimes more headaches. But we left it as the default due to backwards compatibility concerns.
Hi @sconway. thanks for your reply. Unfortunatly, I don't have any control of the input; that's the file I receive from my client and I won't be able to request any changes to it.
I did try using the RFC4180Parser, but that gave me the same behavior
So running your program the file I end up with is:
"ID"|"TYPE"|"PROFILE"
"123"|"MISC"|"Profile ABC"
"456"|"MISC"|"Profile DEF\n& Profile GHI"
"789"|"MISC"|"Profile \"JKL\""
The issue is you want the CSVParser to treat the escape character like an escape character depending on the character it is escaping. And sadly it does not work that way. If you have an escape character then if you want it in the actual output you have to escape it - hence my suggestion of adding an additional slash to the file. But since you said this came from a client gives you only two options in my opinion.
sed -i '' "s/\\\"/\"\"/g" my_text_file_with_content.txt
yielded
cat my_text_file_with_content.txt
"ID"|"TYPE"|"PROFILE"
"123"|"MISC"|"Profile ABC"
"456"|"MISC"|"Profile DEF\n& Profile GHI"
"789"|"MISC"|"Profile ""JKL"""
And that would work with the RFC4180 parser because in that specification a double quote is escaped with a double quote. You can create a bash script to handle multiple files.
WAY LESS PREFERRED SOLUTION - copy the CSVParser to ClientXCSVPARSER and modify it to suit your needs. In this case go to the handleEscapeCharacter method and add the else statement in which will keep the escape character if the character after it is not an escapable character.
private void handleEscapeCharacter(String nextLine, StringFragmentCopier sfc, boolean inQuotes) {
if (isNextCharacterEscapable(nextLine, inQuotes(inQuotes), sfc.i - 1)) {
sfc.takeInput();
sfc.appendPrev();
} else {
sfc.appendPrev();
}
}
Now as to why I don't just put that into the CSVParser - its because when I did it caused two existing unit test to fail one of which was for a defect reported years ago so I do not want to bring back old defects.
As to why I think a new CSVParser is a less preferred solution to a sed script - that is because given choice I would rather maintain a shell script that just had a few lines than a custom parser just for a single customer. And it makes me think that if you are dealing with one customer with non standard csv then you may have another somewhere over the horizon and it would be easier to just have a couple of custom scripts that make the files legal csv than multiple parsers.
close for lack of response.