Menu

#103 Read UTF-8 files with first three bytes containing Byte Order Mark (BOM)

None
closed
None
5
2019-04-30
2019-01-31
No

Java does not read UTF-8 files with the first three bytes containing 0xEF, 0xBB, 0xBF (Byte Order Mark) correctly.

See https://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker

Writing a BOM at the start of a UTF-8 is not a Unicode standard. Some programs read/write it correctly, some not. Java will not read a UTF-8 file correctly and the first column name at the start of the first line in the CSV file is not read correctly by CsvJdbc.

Extend CsvJdbc to skip any BOM at the start of a UTF-8 file.

An example CSV file with this problem is at https://github.com/hadley/readr/files/407456/utf8-bom.zip

Originally reported by Hutchenson in Help Discussion forum.

Discussion

  • Simon Chenery

    Simon Chenery - 2019-04-29

    Skip any three byte Byte Order Mark 0xEF 0xBB 0xBF at the start of a UTF-8 CSV file.
    Java does not support BOM for UTF-8, so CsvJdbc has to check for any BOM and skip it, if found.

    Added unit test TestCsvDriver.testSkippingUtf8ByteOrderMark.

    Files changed:
    src/main/java/org/relique/jdbc/csv/CsvStatement.java
    src/test/java/org/relique/jdbc/csv/TestCsvDriver.java
    src/testdata/utf8_bom.csv

     
  • Simon Chenery

    Simon Chenery - 2019-04-29
    • status: open --> pending
    • Group: -->
     
  • Simon Chenery

    Simon Chenery - 2019-04-30
    • status: pending --> closed
     
  • Simon Chenery

    Simon Chenery - 2019-04-30

    Included in Csvjdbc version 1.0-35.

     
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.