Menu

#34 CSVReader stopped parsing when it encounters 'NUL' or '^@' character .

v1.0 (example)
closed
None
3
2015-03-08
2015-03-05
ThulasiRam
No

We are using OpenCSV's CSVReader to parse a big message into Array of strings . Initially we are using default constructor to read all . With this default constructor we are missing '\' characters from the parsed strings since default constructor uses '\' this as Default escape character .

We went through the blog (http://sourceforge.net/p/opencsv/support-requests/5/) and modified the code accordingly as per solution and we passed '\0' as escape character hoping it will accept and parse all characters .

But when we pass '\0' as escape character we faced another big issue . Our Input string has 'NUL' (NUL : this is how its displayed in Notepad++ and in Unix box logs its displaying as '^@') . Whenever this character appears , CSVReader stopped reading next contents after the character .

Now the problem becomes big . Earlier we were trimming off backlash characters . But after this code change part of the mesage after 'NUL' character is missing at all .

Can someone help me like how to parse all characters using CSVReader .

Discussion

  • Scott Conway

    Scott Conway - 2015-03-06

    I created the following unit test and it passes.

    @Test
    public void readerCanHandleNullInString() throws IOException {
        StringBuilder sb = new StringBuilder(CSVParser.INITIAL_READ_SIZE);
        sb.append("a,\0b,c");
    
        StringReader reader = new StringReader(sb.toString());
    
        CSVReaderBuilder builder = new CSVReaderBuilder(reader);
        CSVReader defaultReader = builder.build();
    
        String[] nextLine = defaultReader.readNext();
        assertEquals(3, nextLine.length);
        assertEquals("a", nextLine[0]);
        assertEquals("\0b", nextLine[1]);
        assertEquals(0, nextLine[1].charAt(0));
        assertEquals("c", nextLine[2]);
    }
    

    I am leaning towards the reader you are using.

    Please send a sample file with maybe two fields and two lines each. The first with your ^@ character and the other with a null (the actual alphabetic characters can be random so no security concerns are raised. Then send a sample program that tries to parse it so I can see the reader you are using and the settings you are using.

    Thanks

     
  • Scott Conway

    Scott Conway - 2015-03-06

    Another thing I recommend to rule out the reader is once you get the test above working comment out the CSVReader. Just have a simple program that calls the reader and writes the output so you see what your reader is doing.

    If that works wrap your reader in a BufferedReader (which is what CSVReader does) and call the readLine method and print that output and see if that duplicates the issue.

     
  • ThulasiRam

    ThulasiRam - 2015-03-06

    package com.test.csvreader;

    import java.io.BufferedReader;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.util.ArrayList;
    import java.util.Arrays;
    import java.util.List;

    import au.com.bytecode.opencsv.CSVParser;
    import au.com.bytecode.opencsv.CSVReader;

    public class TestCSVReader {

    public static void main(String[] args)
    {
    
        try {
        List<List<String>> publishedMessageTokenList = new ArrayList<List<String>>();
    
        //Below two lines is just to see what is the message printing to Console .
        //In console we can see 'NUL' in the file displayed as " ".
        //Also If you try to copy and paste content from console to another file , you will see 10,"IBM","2015063","064230733910"," will be printed .
        //The characters after 'NUL' character are missing .
        BufferedReader bufferedStringReaderMessage = new BufferedReader(new  InputStreamReader(new FileInputStream("C:\\logs\\NULSpecialChar2.log"),"UTF-8"));
        System.out.println("Actual Input Sting "+ bufferedStringReaderMessage.readLine());
    
        BufferedReader bufferedStringReader = new BufferedReader(new  InputStreamReader(new FileInputStream("C:\\logs\\NULSpecialChar2.log"),"UTF-8"));
        CSVReader reader = new CSVReader(bufferedStringReader,CSVParser.DEFAULT_SEPARATOR,CSVParser.DEFAULT_QUOTE_CHARACTER,'\\');
    
        List<String[]> rawTokens;
            rawTokens = reader.readAll();
            for(String[] tokenRow : rawTokens){
                publishedMessageTokenList.add(new ArrayList<String>(Arrays.asList(tokenRow)));
            }
    
            System.out.println(">>Output of parser is when used Default Escape character "+ publishedMessageTokenList.toString());
    
    
    
            List<List<String>> publishedMessageTokenList1 = new ArrayList<List<String>>();
            BufferedReader bufferedStringReader1 = new BufferedReader(new  InputStreamReader(new FileInputStream("C:\\logs\\NULSpecialChar2.log"),"UTF-8"));
            CSVReader reader1 = new CSVReader(bufferedStringReader1,CSVParser.DEFAULT_SEPARATOR,CSVParser.DEFAULT_QUOTE_CHARACTER,'\0');
    
            List<String[]> rawTokens1;
            rawTokens1 = reader1.readAll();
                for(String[] tokenRow1 : rawTokens1){
                    publishedMessageTokenList1.add(new ArrayList<String>(Arrays.asList(tokenRow1)));
                }
    
                System.out.println(">>Output of parser is when used '\0' '\\0 Escape character "+ publishedMessageTokenList1.toString());
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    
    
    
    
    
    }
    

    }

    Hi

    Please find the sample program I used to test the string with 'NUL' character . In this you can find three sys print statements .

    1) Prints the actual message in the file

    2) Prints the parsed output when Default Escape character '\' is used . When used this we will miss backslash characters present if any .

    3) Prints the parsed output when \0 used as Escape character . In this output you can see CSVReader stopped parsing after 'NUL' character in the file .

    Note : In the file when opened using Notepad++ we can see 'NUL' . But when the actual message is outputs to console , in console we are seeing it as ' ' (blank ) .

        If we try to copy  the contents of first print statement and paste it in a seperate notepad file we can see that prints till NUL character only . Contents after NUL will be missed .
    

    Please let me know if you need more details .

    Thanks much for your help in advance .

     
  • Scott Conway

    Scott Conway - 2015-03-07

    Now I see what you are doing. Whatever character you have as an escape character if you have that in your original input and you want it there then you need to escape it <bg>. So in the case of the null you have to have two nulls to have a single null show up in the output. What your code was doing was escaping the second set of quotes this confused the parser and caused it to lose the rest of the data. </bg>

    I attached a copy of your file with the double quote and wrote the following test to show what I was seeing.

    private static final String TEST_FILE = "src/test/java/integrationTest/SR34/NULSpecialChar2.log";
    private static final String DOUBLE_NULL_FILE = "src/test/java/integrationTest/SR34/NULSpecialChar3.log";

    @Test
    public void usingNullAsDelimeterWillFailBecauseYouAreEscapingTheQuote() throws IOException {
        BufferedReader bufferedStringReader1 = new BufferedReader(new  InputStreamReader(new FileInputStream(TEST_FILE),"UTF-8"));
        CSVReader reader1 = new CSVReader(bufferedStringReader1, CSVParser.DEFAULT_SEPARATOR,CSVParser.DEFAULT_QUOTE_CHARACTER, '\0');
    
        List<String[]> rawTokens1;
        rawTokens1 = reader1.readAll();
    
        assertEquals(1, rawTokens1.size());
    
        String[] line = rawTokens1.get(0);
        assertEquals(4, line.length);
        assertEquals("10", line[0]);
        assertEquals("IBM", line[1]);
        assertEquals("2015063", line[2]);
        assertEquals("064230733910", line[3]);
    }
    
    @Test
    public void youNeedToEscapeTheNullCharactersIfUsingNullAsEscape() throws IOException {
        BufferedReader bufferedStringReader1 = new BufferedReader(new  InputStreamReader(new FileInputStream(DOUBLE_NULL_FILE),"UTF-8"));
        CSVReader reader1 = new CSVReader(bufferedStringReader1, CSVParser.DEFAULT_SEPARATOR,CSVParser.DEFAULT_QUOTE_CHARACTER, '\0');
    
        List<String[]> rawTokens1;
        rawTokens1 = reader1.readAll();
    
        assertEquals(1, rawTokens1.size());
    
        String[] line = rawTokens1.get(0);
        assertEquals(6, line.length);
        assertEquals("10", line[0]);
        assertEquals("IBM", line[1]);
        assertEquals("2015063", line[2]);
        assertEquals("064230733910", line[3]);
        assertEquals("\0", line[4]);
        assertEquals("01 ", line[5]);
    }
    

    I did not see the ^@ in the file you sent. was that a purposeful or did that get translated into a space at the end of the last field during upload/download?

     
    • ThulasiRam

      ThulasiRam - 2015-03-07

      Hi Scott ,

      As per wikipedia , '^@' symbol is same as NUL character and also same as '\0' . Please see more details in the wiki page .
      

      http://en.wikipedia.org/wiki/Control_character

      As per your recent reply , if we are using '\0' as escape character , there should be two 'NUL' characters available in file inorder one to appear in the output of parser .

      As I already told if I am using Default escape character '\' 'NUL' characters are retained in the output but '\' is trimmed from the output and CSVReader was able to parse all the message atleast . where as in 'NUL' case CSVReader is stopped parsing after it encounters NUL character .
      Since whatever character I am sending as a paramenter is being either trimmed or stopped parsing after that character .

      Like is there any way we can tell CSVReader to consider all kind of characters ?

      Or as per you do I need to modify the file with two NULs where ever I have one NUL , so that one will be escaped and other will be available in output .

      Please let me know . Thank you very much for all the assistance you are providing .

       
  • Scott Conway

    Scott Conway - 2015-03-08

    I understand a little better what you are asking. I am sorry but opencsv requires four things: a Reader object, a separator character (default \ ) so we can tell that a new field has started, a quote character (default " ) which is tells us that everything between them is one field, and an escape character when dealing with quote or unprintable characters.

    There is really no way to make any of them optional - sorry.

    For what you are doing if you cannot easily expect where you are getting the file from to add the escape characters you need I would consider writing a preprocessor program that would add them in when needed.

     
  • ThulasiRam

    ThulasiRam - 2015-03-08

    Thanks Scott for your quick response . It seems we have to write a proprocess program or let the product itself publish the message with two backlashes wherever they have one backlash (if we go with Default Escape character as '\') . The same is happening for double quoute it seems . wherever the source system has one double quote (") , product is publishing it as two double quotes ("") , so that when Default Quote Charater as '"' , it will consider one double quote .

     
  • Scott Conway

    Scott Conway - 2015-03-08
    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB