Parse files starting with UTF-8 byte order mark
Brought to you by:
aruckerjones,
sconway
When parsing files starting with the 3-byte UTF-8 BOM value (0xEF,0xBB,0xBF) the first column header is not picked up correctly. Adding -Dfile.encoding=UTF-8 does not fix it. After parsing bad_file.csv the description column remains null:
$ xxd bad_file.csv 00000000: efbb bf64 6573 6372 6970 7469 6f6e 2c49 ...description,I 00000010: 5349 4e0a 7878 782c 7979 790a SIN.xxx,yyy. $ xxd good_file.csv 00000000: 6465 7363 7269 7074 696f 6e2c 6973 696e description,isin 00000010: 0a61 6161 2c62 6262 0a .aaa,bbb.
dos2unix and sed '1s/^\xEF\xBB\xBF//' will both strip that BOM
I see you're answering all of your own questions today. :)
Our project takes the view that the input must be a well-formed Unicode string. How the user gets it there is outside the scope of our project. This question has been raised at least once before in another ticket.
It would have been really nice to get some warning like "HEY!!!! we found a UTF-8 BOM at the start of this CSV file and we're including it in the XXX column heading. You probably don't want this. Go fix your input file."
It took me ages to figure out why a daily file parser just broke. The ASCII still looked fine. Finally I opened it in a hex editor to look for unprintable characters and there they were.
I could add something like that warning. To suppress it maybe add to CsvToBeanBuilder:
.withBomWarningEnabled(false)
Last edit: Andrew M 2020-08-13