Menu

#47 Parsing issue (file including content with backslash, double quotes and new lines)

v1.0 (example)
closed
None
5
2017-05-31
2017-05-22
No

Hi,
First many thanks for this project!
I am facing issues (using OpenCSV 3.9) with reading CSV files with content that include backslashes, double quotes and new lines (see attachment).
Neither the standard reader nor the RFC reader are able to process those files correctly... the former is choking on the backslash and the later on the new line
is there something I am doing wrong?

Here is some code that reproduces the issue:

import com.opencsv.CSVReader;
import com.opencsv.CSVReaderBuilder;
import com.opencsv.CSVWriter;
import com.opencsv.ICSVParser;
import com.opencsv.RFC4180Parser;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

public class CsvTest {

    public static void main(String[] args) {
        String file = "test.csv";
        try {
            CSVWriter writer = new CSVWriter(new FileWriter(file));
            String[][] data = {
                {"a1", "\"\"", "b1"},
                {"a2", "\n", "b2"},
                {"a3", "\\", "b4"},
                {"a4", "", "b5"}
            };
            for (String[] o : data) {
                writer.writeNext(o);
            }
            writer.close();

            CSVReader reader;

            System.out.println("==== standard reader ====");
            reader = new CSVReader(new FileReader(file));
            readAndCompare(reader, data);
            reader.close();

            System.out.println("==== RFC 4180 reader ====");
            ICSVParser rfc4180Parser = new RFC4180Parser();
            CSVReaderBuilder builder = new CSVReaderBuilder(new FileReader(file));
            reader = builder.withCSVParser(rfc4180Parser).build();
            readAndCompare(reader, data);
            reader.close();

        } catch (IOException ex) {
            System.out.println(ex);
        }
    }

    private static void readAndCompare(CSVReader reader, String[][] data) throws IOException {
        String[] dataRead;

        int row = 0;
        while ((dataRead = reader.readNext()) != null) {
            boolean issue = false;
            System.out.println("Row: " + row);
            if (row < data.length) {
                if (dataRead.length != data[row].length) {
                    issue = true;
                    System.out.println("-- row #" + row + " length are different - data: " + data[row].length + ", readData: " + dataRead.length);
                }
                for (int i = 0; i < dataRead.length; i++) {
                    if (i < data[row].length) {
                        if (i < data[row].length && !dataRead[i].equals(data[row][i])) {
                            issue = true;
                            System.out.println("-- cell content different: data:" + dataRead[i] + ", readData: " + data[row][i]);
                        }
                    } else {
                        issue = true;
                        System.out.println("-- cell (" + i + ") not present in the source data: " + dataRead[i]);
                    }
                }
            } else {
                issue = true;
                System.out.println("-- row (" + row + ") not present in the source data");
            }
            if (!issue) {
                System.out.println("  read row OK");
            } else {
                System.out.println(">>>>>>> issue");
            }
            row++;
        }
        if (data.length != row) {
            System.out.println("\nRead " + row + " rows, expecting " + data.length);
        }
    }
}
1 Attachments

Discussion

  • Etienne Giraudy

    Etienne Giraudy - 2017-05-22

    Note: I dug a little bit more and the issue comes form the fact that OpenCSV does not accept the same character for both quote and escape for the parser, when by default it uses the same for the writer.

    In my real life case, I am processing a file with the issue outlined above, I added the file creation in the test case to help reproducing the issue - in short I still need a solution!

     
  • Scott Conway

    Scott Conway - 2017-05-28

    Hello Etienne.

    While the settings in the CSVWriter and CSVParser are different (and sorry this has been brought up many times but I kept it this way to maintain backwards compatibility) underneath though it is fundamentally the same parser under the cover. So while it is impossible to configure a parser to understand the default writer it is possible with a simple tweak to the construction of the CSVWriter I got it to produce output that a default CSVParser will handle:

    public static void main(String[] args)  {
        StringWriter sw = new StringWriter();
        try {
            CSVWriter writer = new CSVWriter(sw, CSVParser.DEFAULT_SEPARATOR, CSVParser.DEFAULT_QUOTE_CHARACTER, CSVParser.DEFAULT_ESCAPE_CHARACTER);
            String[][] data = {
                    {"a1", "\"\"", "b1"},
                    {"a2", "\n", "b2"},
                    {"a3", "\\", "b4"},
                    {"a4", "", "b5"}
            };
            for (String[] o : data) {
                writer.writeNext(o);
            }
            writer.close();
    
            CSVReader reader;
    
            System.out.println("==== standard reader ====");
            reader = new CSVReader(new StringReader(sw.toString()));
            readAndCompare(reader, data);
            reader.close();
    
            System.out.println("==== RFC 4180 reader ====");
            ICSVParser rfc4180Parser = new RFC4180Parser();
            CSVReaderBuilder builder = new CSVReaderBuilder(new StringReader(sw.toString()));
            reader = builder.withCSVParser(rfc4180Parser).build();
            readAndCompare(reader, data);
            reader.close();
    
        } catch (IOException ex) {
            System.out.println(ex);
        }
    }
    

    By simply passing in the CSVParser defaults into the CSVWriter it is producing output the CSVParser can read back into its original form.

    ==== standard reader ====
    Row: 0
    read row OK
    Row: 1
    read row OK
    Row: 2
    read row OK
    Row: 3
    read row OK

    Now note that I did not even try and do anything with the RFC4180Parser and that is because it is a different Parser altogether thus I do not think it is possible to configure a CSVWriter to produce output 100% compatible to the RFC4180Parser. This is completely opposite to the view I held up until last year where I espoused that the CSVReader was so configurable that it was just a matter of configuration to produce RFC4180 output and successfully defended that belief for several years by providing configurations to anyone sending me a string they claimed was RF4180 compatible but they could not get to work in CSVParser. But last year someone not only sent me multiple examples they also sent the excel spreadsheet that created it - thus forcing me to swallow my pride and create the RFC4180Parser.

    I am saying all this because after creating the RFC4180Parser I realized we did have a very real issue that we had one writer with one parsing style but one reader with multiple parsers. It was my intention last year to modify both parsers to have a reverseParse method that will take a string array and produce the csv string given its parameters then add a constructor to the CSVWriter that will take an ICSVParser. That is still a stretch goal for this year but with everything going on I doubt it will be in a release this year.

    :)

     
  • Scott Conway

    Scott Conway - 2017-05-28
    • status: open --> closed
    • assigned_to: Scott Conway
     
  • Etienne Giraudy

    Etienne Giraudy - 2017-05-31

    Hi Scott,

    Thanks for taking time to respond my ticket.
    Unfortunatelly this does not help me solving my issue as my primary issue is reading files where double quotes are used both as escape and quote character: I am processing such files coming out of salesforce, internal system and csutomers...
    I ended up using for these files a custom version of the basic parser that works based on a few assumptions (the main one being that if a field is not empty, it will always be enclosed in double quotes). So far so good!

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.