Menu

#47 Error by reading line with end of line symbol ';\n'

Outstanding
closed
None
1
2015-01-23
2013-10-10
Tiran1984
No

Get an error 'The number of columns to be processed (9) must match the number of CellProcessors (8)' by reading a line '1;1;aaa1;a;01.01.2011 00:00:05;20110201;;1.12;\n' The last line symbol is an ';\n'. In my preferences the end of line symbols is defined ';\n'. How can I setup the reader, without write my on Tokenizer?

Discussion

  • James Bassett

    James Bassett - 2013-10-10

    Hi,

    Super CSV uses LineNumberReader which can only handle standard newline characters (newline, carriage return). That's why the javadoc for CsvPreference and CsvPreferenceBuilder both state that

    the end of line symbols are only used for writing

    Writing your own Tokenizer would work but might be overkill. It looks to me like your data always ends with a ';' (i.e. an empty column). If so, you can simply add an extra element to your cell processor array - using null is probably best. You'd also need to add a null element to your header array (if you're using one) - this will ensure the empty column is read, but ignored.

     
  • Tiran1984

    Tiran1984 - 2013-10-14

    To add an Null cell processor at the end of the array, looks like an work around. Why you not use the defined end of line symbol from the CsvPreference to determine the end of line.

     
  • Kasper B. Graversen

    • status: open --> closed
     
  • Kasper B. Graversen

    • assigned_to: Kasper B. Graversen
     
  • Pat

    Pat - 2013-10-15

    Hi,

    we're facing similar issues while processing CSV files from federal authorities which formats we can't change.
    They all have in common that the last cell is concluded by an additional ";", followed by the line terminator (often "\n"), e.g.:

    VALUE1;VALUE2;VALUE3;\n

    SuperCSV 2.1.0 still has issues with these files, which already showed up in release 1.x.
    Please note that such files do not violate the CSV RFC spec which SuperCSV claims to implement.

    AFAIK and from what I learned within this bug report, two workarounds exist:

    1: Write an own CellProcessor which delegates its actions to the default one but which does some preprocessing of the raw line in order to remove the problematic ";" at the end of the line afore

    2: Fake the CSV format by adding a null element at the end to the cell processor definitions to make SuperCSV happy

    I agree with the bug reporter Tiran considering option #2 as a hack - for this reason, we went for option #1

    However, for future releases, we're really looking forward getting support for such file format out-of-the-box (for reading and writing). I don't think this will be a big deal for the SuperCSV developers.

    The best would be if we could set some preference, e.g something like
    CsvPreference.setCellDelimiterExpectedAtEndOfLine(true) which will be respected
    both for reading and writing CSV files to be able to deal with this format.

    => Please reopen bug.

    Cheers,
    -Patric

     
  • James Bassett

    James Bassett - 2013-10-15

    Hi Patric/Tiran,

    If you read rule 4 of RFC4180 (replacing all references to ',' with ';' as that's the delimiter in your case) you'll see it explicitly says

    The last field in the record must not be followed by a comma

    Of course there's heaps of CSV files out there that don't conform to any common standard. Having a trailing ';' is one of the least bizarre ones - a lot of them have a different number of columns in each row...

    If you think option 2 is a hack, you can write your own tokenizer in a couple of lines that will fix this.

        @Test
        public void testTokenizerRemovingTrailingDelimiter() throws IOException{
            class TrailingDelimiterTokenizer extends Tokenizer {
    
                private final CsvPreference prefs;
    
                public TrailingDelimiterTokenizer(Reader reader, CsvPreference preferences) {
                    super(reader, preferences);
                    this.prefs = preferences;
                }
    
                @Override
                public boolean readColumns(List<String> columns) throws IOException {
                    boolean columnsRead = super.readColumns(columns);
    
                    // remove columns created by trailing delimiters
                    if (columnsRead && getUntokenizedRow().endsWith(String.valueOf((char)prefs.getDelimiterChar()))){
                        columns.remove(columns.size() - 1);
                    }
                    return columnsRead;
                }
            }
    
            String input = "a,b,c,";
            StringReader reader = new StringReader(input);
            Tokenizer t = new TrailingDelimiterTokenizer(reader, CsvPreference.STANDARD_PREFERENCE);
            List<String> output = new ArrayList<String>();
            t.readColumns(output);
            System.out.println(output);
        }
    

    This prints

    [a, b, c]

    This is the reason Super CSV was written with extensibility in mind - if it doesn't do something you need, you can just plug in your own functionality.

    I'll reopen the issue, but I'm not promising anything :P

     
  • James Bassett

    James Bassett - 2013-10-15
    • status: closed --> open
     
  • Pat

    Pat - 2013-10-15

    James,

    thank you for your helpful answer.

    You're right - the RFC indeed forbids the use of a trailing comma/semicolon after the last value - I totally missed that point.
    However, a RFC lawyer might argue that the ;\n is the CRLF symbol which can be defined arbitrarily according to the RFC ;)

    But even that (really hackish) option will not work for SuperCSV, as it seems that it does not allow cell delimiters in the CRLF symbol definition (I would expect an exception in this case, BTW).

    Don't get me wrong, I like SuperCSV and its extension APIs, however I personally think it should cover most common use cases (even that flawed CSV formats) out-of-the-box for increased usage scenarios.

    Many thanks for sharing the code - however I have stick to option #1 for now, because it's tested and in production.

    And secondly, thanks for reopening this report - it would be really cool (I'm sure that both the bug reporter and myself are not the only persons dealing with such CSV) to see that feature within upcoming releases - even without a promise that this will ever happen ;)

    Cheers,
    -Patric

     

    Last edit: Pat 2013-10-15
  • James Bassett

    James Bassett - 2014-04-24
    • assigned_to: Kasper B. Graversen --> James Bassett
    • Group: 2.1.0 --> Outstanding
     
  • James Bassett

    James Bassett - 2015-01-23
     
  • James Bassett

    James Bassett - 2015-01-23
    • status: open --> closed
     

Log in to post a comment.