CsvJdbc - CSV file JDBC driver / Feature Requests / #103 Read UTF-8 files with first three bytes containing Byte Order Mark (BOM)

Read UTF-8 files with first three bytes containing Byte Order Mark (BOM)

#103 Read UTF-8 files with first three bytes containing Byte Order Mark (BOM)

Milestone: None

Status: closed

Owner: Simon Chenery

Labels: None

Priority: 5

Updated: 2019-04-30

Created: 2019-01-31

Creator: Simon Chenery

Private: No

Java does not read UTF-8 files with the first three bytes containing 0xEF, 0xBB, 0xBF (Byte Order Mark) correctly.

See https://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker

Writing a BOM at the start of a UTF-8 is not a Unicode standard. Some programs read/write it correctly, some not. Java will not read a UTF-8 file correctly and the first column name at the start of the first line in the CSV file is not read correctly by CsvJdbc.

Extend CsvJdbc to skip any BOM at the start of a UTF-8 file.

An example CSV file with this problem is at https://github.com/hadley/readr/files/407456/utf8-bom.zip

Originally reported by Hutchenson in Help Discussion forum.

Discussion

Simon Chenery - 2019-04-29

Skip any three byte Byte Order Mark 0xEF 0xBB 0xBF at the start of a UTF-8 CSV file.
Java does not support BOM for UTF-8, so CsvJdbc has to check for any BOM and skip it, if found.

Added unit test TestCsvDriver.testSkippingUtf8ByteOrderMark.

Files changed:
src/main/java/org/relique/jdbc/csv/CsvStatement.java
src/test/java/org/relique/jdbc/csv/TestCsvDriver.java
src/testdata/utf8_bom.csv

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simon Chenery - 2019-04-29

status: open --> pending

Group: -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simon Chenery - 2019-04-30

status: pending --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simon Chenery - 2019-04-30

Included in Csvjdbc version 1.0-35.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Read UTF-8 files with first three bytes containing Byte Order Mark (BOM)

Group

Searches

Help

#103 Read UTF-8 files with first three bytes containing Byte Order Mark (BOM)

Discussion