opencsv / Feature Requests / #137 Parse files starting with UTF-8 byte order mark

#137 Parse files starting with UTF-8 byte order mark

Milestone: Next Release (example)

Status: wont-fix

Owner: Andrew Rucker Jones

Labels: None

Priority: 5

Updated: 2020-08-13

Created: 2020-08-13

Creator: Andrew M

Private: No

When parsing files starting with the 3-byte UTF-8 BOM value (0xEF,0xBB,0xBF) the first column header is not picked up correctly. Adding -Dfile.encoding=UTF-8 does not fix it. After parsing bad_file.csv the description column remains null:

$ xxd bad_file.csv 
00000000: efbb bf64 6573 6372 6970 7469 6f6e 2c49  ...description,I
00000010: 5349 4e0a 7878 782c 7979 790a            SIN.xxx,yyy.
$ xxd good_file.csv 
00000000: 6465 7363 7269 7074 696f 6e2c 6973 696e  description,isin
00000010: 0a61 6161 2c62 6262 0a                   .aaa,bbb.

Discussion

Andrew M - 2020-08-13

dos2unix and sed '1s/^\xEF\xBB\xBF//' will both strip that BOM

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-08-13

status: open --> wont-fix

assigned_to: Andrew Rucker Jones
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Rucker Jones - 2020-08-13

I see you're answering all of your own questions today. :)

Our project takes the view that the input must be a well-formed Unicode string. How the user gets it there is outside the scope of our project. This question has been raised at least once before in another ticket.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andrew M - 2020-08-13
  
  It would have been really nice to get some warning like "HEY!!!! we found a UTF-8 BOM at the start of this CSV file and we're including it in the XXX column heading. You probably don't want this. Go fix your input file."
  It took me ages to figure out why a daily file parser just broke. The ASCII still looked fine. Finally I opened it in a hex editor to look for unprintable characters and there they were.
  
  I could add something like that warning. To suppress it maybe add to CsvToBeanBuilder:
  .withBomWarningEnabled(false)
  
  Last edit: Andrew M 2020-08-13
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Parse files starting with UTF-8 byte order mark

Group

Searches

Help

#137 Parse files starting with UTF-8 byte order mark

Discussion