Read() with an unknown number of columns cannot handle mismatched " (double quote), it will keep reading until heap ran out of memory, it should stop at the next line break if the line break is not escaped.
You've mentioned read() with an unknown number of columns, so I assume you are using CsvListReader, but as all of the readers use the same underlying Tokenizer, I'm expecting the same issue will occur with CsvMapReader and CsvBeanReader.
The reason it keeps on reading is that while in double quotes, a CSV field can span multiple lines (so you can't stop reading at the end of the line). Ideally this situation shouldn't occur - your CSV should be valid - but I'll look into this and see if it's really a bug.
Can you give some example code and CSV so I can replicate this problem?
James
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you James for the quick response!!!
You are right it is a bad formatted CSV field. Like you said - "a CSV field
can span multiple lines (so you can't stop reading at the end of the line)....",
But according to standard, a field that contains embedded line-breaks must be surounded by double-quotes. If it is not escaped, can you assume it is a real line break and stop reading? or give some option to user to stop at the real line break?
To reproduce it- I use CsvListReader, and CSV file can be a large file and one field contains only one " (double quote).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Line breaks are not escaped, even if they are inside double quotes (the only character that is escaped is a double quote - when it appears inside double quotes).
The only solutions I can think of are:
1) to add an option (CsvPreference) that disables multi-line fields
2) to add a multiLineMaxLines option (CsvPreference) that enforces the number of lines that can be in a multiline field
I'm not really happy with either solution though - this is such an obvious violation of the CSV format that I'm not sure it's worth catering for.
I'll keep thinking about this, but if you have any suggestions let me know!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I see your point now. In my case, we disable multi-line feeds.
In another open source tool (BeanIO), they have a setting called "multilineEnabled" ( with default value: false ) to help user to handle this case. I know it is not a perfect, but can help user a lot...
Thanks for your consideration!!!
-Bo Yu
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I agree that mismatched quotes are a violation of the CSV format, but I'd still like an option for some form of defence against it. At the moment a large CSV file with a mismatched quote (leading to a huge value) simply exhausts the JVM heap. This means that any program using Super CSV is vulnerable to intentional or unintentional denial of service attacks.
Another option that could be added to defend against this would be a maximum allowed row length. Setting this to something large but not big enough to use all the JVM heap, say 100MB, wouldn't make much difference for valid CSV files. But it would mean that a better exception than out-of-memory would get thrown for invalid input with a particularly badly placed mismatched quote.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Byu,
You've mentioned read() with an unknown number of columns, so I assume you are using CsvListReader, but as all of the readers use the same underlying Tokenizer, I'm expecting the same issue will occur with CsvMapReader and CsvBeanReader.
The reason it keeps on reading is that while in double quotes, a CSV field can span multiple lines (so you can't stop reading at the end of the line). Ideally this situation shouldn't occur - your CSV should be valid - but I'll look into this and see if it's really a bug.
Can you give some example code and CSV so I can replicate this problem?
James
Thank you James for the quick response!!!
You are right it is a bad formatted CSV field. Like you said - "a CSV field
can span multiple lines (so you can't stop reading at the end of the line)....",
But according to standard, a field that contains embedded line-breaks must be surounded by double-quotes. If it is not escaped, can you assume it is a real line break and stop reading? or give some option to user to stop at the real line break?
To reproduce it- I use CsvListReader, and CSV file can be a large file and one field contains only one " (double quote).
Line breaks are not escaped, even if they are inside double quotes (the only character that is escaped is a double quote - when it appears inside double quotes).
The only solutions I can think of are:
1) to add an option (CsvPreference) that disables multi-line fields
2) to add a multiLineMaxLines option (CsvPreference) that enforces the number of lines that can be in a multiline field
I'm not really happy with either solution though - this is such an obvious violation of the CSV format that I'm not sure it's worth catering for.
I'll keep thinking about this, but if you have any suggestions let me know!
I see your point now. In my case, we disable multi-line feeds.
In another open source tool (BeanIO), they have a setting called "multilineEnabled" ( with default value: false ) to help user to handle this case. I know it is not a perfect, but can help user a lot...
Thanks for your consideration!!!
-Bo Yu
Hello,
I agree that mismatched quotes are a violation of the CSV format, but I'd still like an option for some form of defence against it. At the moment a large CSV file with a mismatched quote (leading to a huge value) simply exhausts the JVM heap. This means that any program using Super CSV is vulnerable to intentional or unintentional denial of service attacks.
Another option that could be added to defend against this would be a maximum allowed row length. Setting this to something large but not big enough to use all the JVM heap, say 100MB, wouldn't make much difference for valid CSV files. But it would mean that a better exception than out-of-memory would get thrown for invalid input with a particularly badly placed mismatched quote.
Migrated to GitHub issues: https://github.com/super-csv/super-csv/issues/3