Announcing GEDCOM Lexer Plugin

  • smitchell

    smitchell - 2014-01-25


    I have authored a lexer plugin (GedcomLexer.dll) for GEDCOM files, the standard file format used by genealogy applications to exchange data.

    There are user defined languages for GEDCOM based on using tags as keywords. However, to perform syntax-checking and provide level folding, a lexer is needed.

    The lexer follows the data representation grammar of GEDCOM specification version 5.5.1. It recognizes the possible tokens in a line: level, xref_id, tag, user tag, pointer, value, and escape. Each of these tokens has a default style supplied by GedcomLexer.xml and can be customized by the Style Configurator. When an invalid character in a token is detected, the lexer enters the Invalid state and outputs the remainder of the line in the Invalid style (default red). The Invalid state is reset when the end of line is reached.

    In the current release, folding is based on the line level. In GEDCOM files, logical records begin at line level 0. Subordinate lines with levels 1 or higher contribute to the logical record which was defined by the level 0 line that preceded it. So, folding allows a user to see only level 0 lines (logical record starts) or level 0 lines plus selected additional levels, giving the user some control over the amount of detail displayed.

    The plugin has been tested with Notepad++ 6.5.2, on Win 7, 8, and XP.
    It has been tested with a variety of GEDCOM files (*.ged), including UTF-8, UTF-16, ANSI, and ASCII.
    In this release the ANSEL character set is not supported.

    This plugin project is hosted at SourceForge:
    where the DLL and source files can be found.

    To find some GEDCOM files for testing, perform a Google search for:

    "0 head" filetype:ged

    I welcome any feedback.


  • Mark Baines

    Mark Baines - 2014-01-26

    Thank you for this, it is far better than the User-defined attempt I made.
    I fear there is too much folding and when completely folded it means nothing to any one as all you're left with is a list of meaningless IDs.
    One thing I would like is some way to show up the NAME Tag field better. At the moment I've made the ID stand out as that is unique and related to an individual but to have the surnames highlighted would be great so I can scan down the page looking for people more easily. Could the NAME Tag and it's contents be highlighted somehow?
    all the best

  • smitchell

    smitchell - 2014-01-27

    Thanks for your comments Mark.
    I will look into some ways of exposing NAMEs to make navigation through the file easier. Highlighting the NAME tag and its contents are certainly possibilities.

  • Mark Baines

    Mark Baines - 2014-01-27

    Brilliant - thanks Stan.

    • smitchell

      smitchell - 2014-02-12

      GedcomLexer, v0.2 was released a couple of days ago.

      With this release, there is support for using the Function List, to display NAMEs from INDI records, and locate persons more easily.

      I have a detailed description on how set it up at the project website:

  • Mark Baines

    Mark Baines - 2014-02-13

    Great idea, very clever, well done!
    Unfortunately sorting is on the first name in the field which isn't, obviously, the surname. Is that the fault of Function List or can you tweak it in the parser element?
    all the best

    • smitchell

      smitchell - 2014-02-13

      Yes, the sorting on given name bothered me too.
      I believe that is a limitation of having a parser based purely on a pair of regular expressions - you get a match string and cannot rearrange the parts of it, to make the surname come first. There are a lot of things that Perl regular expressions can do that surprised me, so someone with more expertise might find a way!


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks