Menu

#31 Read with an unknown number of columns, get OutoffMemory err

Outstanding
closed
nobody
None
5
2015-04-09
2012-04-30
Byu
No

Read() with an unknown number of columns cannot handle mismatched " (double quote), it will keep reading until heap ran out of memory, it should stop at the next line break if the line break is not escaped.

Discussion

  • James Bassett

    James Bassett - 2012-04-30

    Hi Byu,

    You've mentioned read() with an unknown number of columns, so I assume you are using CsvListReader, but as all of the readers use the same underlying Tokenizer, I'm expecting the same issue will occur with CsvMapReader and CsvBeanReader.

    The reason it keeps on reading is that while in double quotes, a CSV field can span multiple lines (so you can't stop reading at the end of the line). Ideally this situation shouldn't occur - your CSV should be valid - but I'll look into this and see if it's really a bug.

    Can you give some example code and CSV so I can replicate this problem?

    James

     
  • Byu

    Byu - 2012-05-02

    Thank you James for the quick response!!!
    You are right it is a bad formatted CSV field. Like you said - "a CSV field
    can span multiple lines (so you can't stop reading at the end of the line)....",

    But according to standard, a field that contains embedded line-breaks must be surounded by double-quotes. If it is not escaped, can you assume it is a real line break and stop reading? or give some option to user to stop at the real line break?
    To reproduce it- I use CsvListReader, and CSV file can be a large file and one field contains only one " (double quote).

     
  • James Bassett

    James Bassett - 2012-05-02

    Line breaks are not escaped, even if they are inside double quotes (the only character that is escaped is a double quote - when it appears inside double quotes).

    The only solutions I can think of are:
    1) to add an option (CsvPreference) that disables multi-line fields
    2) to add a multiLineMaxLines option (CsvPreference) that enforces the number of lines that can be in a multiline field

    I'm not really happy with either solution though - this is such an obvious violation of the CSV format that I'm not sure it's worth catering for.

    I'll keep thinking about this, but if you have any suggestions let me know!

     
  • Byu

    Byu - 2012-05-02

    I see your point now. In my case, we disable multi-line feeds.
    In another open source tool (BeanIO), they have a setting called "multilineEnabled" ( with default value: false ) to help user to handle this case. I know it is not a perfect, but can help user a lot...

    Thanks for your consideration!!!

    -Bo Yu

     
  • James Bassett

    James Bassett - 2012-09-16
    • milestone: --> Outstanding
     
  • David Roberts

    David Roberts - 2014-07-06

    Hello,

    I agree that mismatched quotes are a violation of the CSV format, but I'd still like an option for some form of defence against it. At the moment a large CSV file with a mismatched quote (leading to a huge value) simply exhausts the JVM heap. This means that any program using Super CSV is vulnerable to intentional or unintentional denial of service attacks.

    Another option that could be added to defend against this would be a maximum allowed row length. Setting this to something large but not big enough to use all the JVM heap, say 100MB, wouldn't make much difference for valid CSV files. But it would mean that a better exception than out-of-memory would get thrown for invalid input with a particularly badly placed mismatched quote.

     
  • James Bassett

    James Bassett - 2015-01-23
    • status: open --> closed
     
  • James Bassett

    James Bassett - 2015-01-23
     

Log in to post a comment.