#7 utf-8 byte-order mark in gedcom

closed
elsapo
None
5
2007-04-29
2004-04-26
Robert Simms
No

If it's not considered to always be desireable, perhaps an option
could be added (by users through user/options) to control whether
the byte order mark gets inserted at the beginning of a Gedcom
export.

This also serves to identify the data as being in the UTF-8 encoding
to programs that are importing the data, if the exported data is in
UTF-8. In that case the byte order mark is $ef$bb$bf (not
sensitive to byte order). Keeping it optional would be good, since
some programs make still look for the "0 " as the first characters
of a Gedcom file. In which case, those programs should allow
specification of imported data as being in the UTF-8 encoding.

This option would have to be sensitive to the GedcomCodeset
option.

If UTF-16 or UTF-32 where an option, then the byte order mark
should probably be handled as a character so that a system's byte
order would apply to it as well.

Discussion

  • elsapo
    elsapo
    2005-06-23

    Logged In: YES
    user_id=1195173

    Relevant code:

    src/liflines/loadsave.c:,save_gedcom()

    calls

    src/liflines/export.c: archive_in_file()

    I'm not sure how to tell if we're writing UTF-8, but it
    probably can be figured out from archive_in_file's local
    variable xlat_gedout and possibly the global uu8.

     
  • elsapo
    elsapo
    2007-04-14

    Logged In: YES
    user_id=1195173
    Originator: NO

    The load code is checking for BOM now, in import.c, do_import, with function call check_file_for_unicode.

    I don't think the export code is writing a BOM however, so that is still outstanding.

     
  • elsapo
    elsapo
    2007-04-14

    • assigned_to: nobody --> elsapo
     
  • elsapo
    elsapo
    2007-04-14

    Logged In: YES
    user_id=1195173
    Originator: NO

    BOMs in edit files are currently written on win32 and nowhere else. (They're only written if the editor codeset is UTF-8, of course.)

    This is controlled by existing function should_write_bom (I just renamed it to that to clarify its purpose), in
    src/gedlib/nodeio.c.

    I'll leave the logic like that for now; that function could always be extended to read a user option, if it desirable to change that default behavior.

    I'll use that same function to control writing BOMs to GEDCOM exports.

     
  • elsapo
    elsapo
    2007-04-14

    Logged In: YES
    user_id=1195173
    Originator: NO

    Fixed cvs to write BOM to GEDCOM where appropriate.

     
  • elsapo
    elsapo
    2007-04-14

    Logged In: YES
    user_id=1195173
    Originator: NO

    I've revised the cvs to believe first the input GEDCOM BOM, and if there isn't one, then the input GEDCOM CHAR declaration, and if there isn't that either, to fallback to the option variable GedcomCodeset.

    Therefore I believe this is reasonably implemented (with the caveat that only UTF-8 BOMs are handled).

     
  • elsapo
    elsapo
    2007-04-14

    • status: open --> pending
     
  • Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 14 days (the time period specified by
    the administrator of this Tracker).

     
    • status: pending --> closed