A Vote for Row Processing Modifying Field Structure.
Brought to you by:
aruckerjones,
sconway
I use opencsv to handle csv's delivered by 3rd party and I cannot influence.
Unfortuately with recent release one field of their files occasionally contains a comma, i.e. the separator character. Handling this flaw after breaking a line into parts, would be be trivial: If there are 8 parts (the number of fields expected by my bean), just use it, if there are 9 parts, merge field 5&6 and return 1,2,3,4,5&6,7,8 as new row to the bean handling.
However I did not find how to solve this by the existing interceptors. A RowProcessor enabled to return a new String[] array as row, would easily allow for this.
I'm curious: how are you parsing/using your data? Because if you're just taking the String[] results directly from CSVParser, you could already do this. If you're doing something else, maybe we can find a different solution for you.
Usually I am that low tech guy, doing it all in my code, but this time I tried to be smart and climb on the shoulders of giants, wrote a useful bean and applied opencsv. Was working nicely for a while, until that 3rd party missed the bug in their csv export . If only they had used opencsv for export. Now I am here with an opencsv solution that only works on old files. Do you want to tell me, I missed the API to simply push String[] myself into the opencsv bean machinery? So far I just let opencsv work on the whole file and wonder if I can fix things without losing the bean stuff. For company reasons, I have no access to my code right now, I can post more detail tomorrow.
http://opencsv.sourceforge.net/#reading_into_beans That will give you an overview.
If you actually do a pull of the source code or download it from the Files section there are several unit tests you can look at for working examples.
Hello Scott, hello Andrew, thank you both for your answers.
The problem is not reading into beans, the problem is reading broken CSVs.
My code is as simple as, as much is done in the annotated Bean.
I exactly understand, how and where these additional commas slip into these CSVs, so modifying the parsing result as described above, would be the solution. However writing filters to fix the CSVs, is actually like doing all the dirty work in own code and there would be little to no benefit left from using opencsv as a 2nd step after filtering.
I agree with you. You're doing things the right way. I've thought on it some to try to come up with some clever bean trick (e.g. performing additional steps in the bean's setters), but I'm not coming up with anything obvious.
The essential problem is that the data are corrupt. There's only so much one can do with corrupt data.
I will then leave this as the initial request: RowProcessors should be able to alter the line structure.
Scott, that puts it back in your court, since the Row Processors are yours. It doesn't look like this request can easily be fulfilled without breaking backward compatibility.
Sorry for taking so long to respond. I was going through all the open tickets and found this. I am closing this as won't fix. The RowProcessor is designed to modify individual Row elements not take and array of size 8 and turn it into an array of size 7.
We are NOT here to interpet corrupt data! There is just too many things to cover in that case. Users adding commas to their data but not encapsulating the data in double quotes is one of them.
Personally I would say in your specific case since you seem to know which field would have the corruption is to write a pre-parser program that would read the user input file and spit out a "cleansed" file where on rows that have eight elements instead of seven you combine back the two rows into one and then in the CSVWriter you enable strictQuotes to force quotes around all fields. Or use the RFC4180Parser in your writer - I believe the effect will be the same. Then you can run that resulting output file into your current application without worry and use the first program to handle all the corrupt possibilities.
A nice separation of concerns.
Well, Scott, thank you for your answer. Unfortunately it just misses the point. This is not about creating JSONs, I know how to create valid JSONs, but as explained above, I get JSONs from an external company and I now know CSVWriter is not and will never be the tool to handle these files. Somewhat bad investment of my time. But maybe you are right, just don't expect me to be convinced.
JSON? Who said anything about JSON? Frank are you posting feature requests in the wrong projects again???? <bg> just kidding. No your request was that you have an client sending you a corrupt csv file and you are trying to pase it but in this case the corruption is always consistent... If I am reading your original description correct. The client data pasrses out to eight fields when it should be seven because one of their columns has a comma in it but they did not put quotes around that field. Bad on them!!! </bg>
Sry, to many things in parallel, in this case it was all about broken CSV, the broken JSONs were a different one.