Here is the code snippet which I am using:
StringWriter writer = new StringWriter(); CSVWriter csvwriter = new CSVWriter(writer); String[] originalValues = new String[2]; originalValues[0] = "t\\est"; originalValues[1] = "t\\est"; System.out.println("Original values: " + originalValues[0] +"," + originalValues[1]); csvwriter.writeNext(originalValues); csvwriter.close(); CSVReader csvReader = new CSVReader(new StringReader(writer.toString())); String[] resultingValues = csvReader.readNext(); System.out.println("Resulting values: " + resultingValues[0] +"," + resultingValues[1]);
The output of the above snippet is:
Original values: t\est,t\est Resulting values: test,test
Back slash ('\') character is gone after conversion!!!
By some basic analysis I figured that it is happening because CSVReader
is using Back slash ('\') as default escape character where as CSVWriter
is using double quote (") as default escape character.
What is the reason behind this inconsistency in default behavior?
To fix above problem I managed to find following two solutions:
1) Overwriting default escape character of CSVReader with null character:
CSVParser csvParser = new CSVParserBuilder().withEscapeChar('\0').build(); CSVReader csvReader = new CSVReaderBuilder(new StringReader(writer.toString())).withCSVParser(csvParser).build();
2) Using RFC4180Parser which strictly follows RFC4180 standards:
RFC4180Parser rfc4180Parser = new RFC4180ParserBuilder().build(); CSVReader csvReader = new CSVReaderBuilder(new StringReader(writer.toString())).withCSVParser(rfc4180Parser).build();
Can using any of the above approach cause any side effects on any other characters?
Also why RFC4180Parser
is not default parser? Is it only for maintaining backward compatibility as RFC4180Parser
got introduced in later versions?
Hello Vatsal.
To try as best I can and answer your questions in order.
Why is there an inconsistency between the escape character in the CSVParser and the CSVWriter? Honestly I do not know as I started helping out with the project after it had been out for five years. So I don't know what the reasoning was but since it was out and had an established base of users I cannot break backward compatibility.
Can using either approach cause side effects on other characters? If that is your concern I would suggest you use the RFC4180Parser because if there is a quote or comma in your data it just expects it to be wrapped in quotes (or have double quotes internally) whereas the CSVParser wants it escaped. so if you really wanted the slash in the data it would have to be escaped ( so the original value would need to be t\est ).
Why is RFC4180Parser not the default parser? You actually hit the nail on the head with that one. As you could tell from my first answer I am very big on backwards compatibility. With all the people using OpenCSV there would be quite a number of existing programs that would break when they upgraded as the expected behavior would change.
That said the one area that I will willingly break backwards compatibiliy on is the RFC4180Parser if you can show me that it is not complying with the standards (other than the fact that I allow the quote and separator characters to be user defined and RFC4180 standard has a given set). Because as the name implies it is supposed to be compliant with the standard. But the CSVParser and CsvWriter having existing for over a decade now is a standard onto itself.
Hope that helps.
Scott Conway :)