#84 parser

open
nobody
None
5
2011-11-18
2011-11-18
hanbo
No

input: String line = "hi, \"hello, hanbo\", 30";
output:
token_1:hi
token_2: "hello, hanbo
token_3: 30

CSVParser csvParser = new CSVParser(
CSVParser.DEFAULT_SEPARATOR, CSVParser.DEFAULT_QUOTE_CHARACTER,
CSVParser.DEFAULT_ESCAPE_CHARACTER,
CSVParser.DEFAULT_STRICT_QUOTES,
false);

Input is not good for the patser, but, I think the ouput is alse bad.
Throw Exception, or output is : token_2: hello, hanbo

how to explain it?
thank you @

Discussion

  • hanbo
    hanbo
    2011-11-18

    code file

     
    Attachments
  • Scott Conway
    Scott Conway
    2011-11-18

    good catch. This is a bug - but not in the way you think it is.

    The problem is that in the last parameter you set ignore leading white space to false. In your input line there is a space between the separator and the quote (\") so if you look at your result string for the second line it is actually <space><quote>hello, hanbo

    What it should be is <space><quote>hello, hanbo<quote> because a quote did not start the token a quote does not end the token (its the comma afterwards that does that. So the quote should be taken as a literal part of the token same as the first quote.

    If you turn on ignore leading white space (which is the default) then you would get the string as you expected it.

    Time permitting I will try and look into this before the holidays.

     
  • hanbo
    hanbo
    2011-11-21

    line 228-244 in file CSVParser.java

    update by next
    // the tricky case of an embedded quote in the middle: a,bc"d"ef,g
    if (!strictQuotes) {
    if (i > 2 //not on the beginning of the line
    && nextLine.charAt(i - 1) != this.separator //not at the beginning of an escape sequence
    && nextLine.length() > (i + 1) &&
    (nextLine.charAt(i + 1) != this.separator || !inQuotes) //modify
    //not at the end of an escape sequence
    ) {

    if (ignoreLeadingWhiteSpace && sb.length() > 0 && isAllWhiteSpace(sb)) {
    sb.setLength(0); //discard white space leading up to quote
    } else {
    sb.append(c);
    continue; //add
    }

    }
    }

    It can work :
    input: 1, \"2\",3
    output:
    token_1:1
    token_2: "2"
    token_3:3

    But i think the format of the input string is invalid for the CSVParser with setting ignore leading white space to false.