Menu

issue with parsing multi-part family names in bibutils 4.17

JeffH
2013-02-28
2013-03-02
  • JeffH

    JeffH - 2013-02-28

    Given a bibtex input such as:

    @ARTICLE{amoeba-os,
      author = {Sape J. Mullender and Guido van Rossum and Andrew Tannenbaum and
        Robbert van Renesse and Hans van Staveren},
      title = {{Amoeba: A Distributed Operating System for the 1990s}},
      journal = {Computer},
      year = {1990},
      volume = {23},
      pages = {44-53},
      number = {5},
      address = {Washington, DC, USA},
      publisher = {IEEE}
    }
    

    bib2xml will map it to MODS such that the "van" portions of the family names are typed as part of "given" names:

        <name type="personal">
            <namePart type="given">Guido</namePart>
            <namePart type="given">van</namePart>
            <namePart type="family">Rossum</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">author</roleTerm>
            </role>
        </name>
    

    this is suboptimal in various ways, one of which is when then converting the MODS entry to another format (such as xml2rfc [RFC2629]) then one ends up with, e.g., initials for "Guido van Rossum" being "G. v." which is incorrect:

        <author fullname="Guido van Rossum" initials="G. v." surname="Rossum"/>
    

    There are likely other similar multi-part family names that are so affected.

    is a workaround to use the --asis switch? what is the format of the asis file? in Guido's case, do I simply put "van Rossum" on a separate line in the asis file?

    thanks.

     
  • Nick Bart

    Nick Bart - 2013-03-01

    bibutils does already distinguish name suffixes by mapping them to, e.g.,

    <namePart type="suffix">Jr.</namePart>
    

    It would probably be a good idea if bibutils did the same thing for name prefixes, e.g.,

    <namePart type="prefix">van</namePart>
    

    As to parsing bibtex/biblatex entries, AFAIK bibtex and biblatex simply assume that all lower-case words at the beginning of the last/family name are part of the name prefix.

    The biblatex manual mentions a few examples of name prefixes (“Name lists are parsed and split up into the individual items at the and delimiter. Each item in the list is then dissected into four name components: the first name, the name prefix (von, van, of, da, de, della, ...), the last name, and the name suffix (junior, senior, ...). ”) – but as judged by checking the source files, it does not specifically search for strings such as "von, van, of, da, de, della, ...", so I’m assuming a search for lower-case words at the start of the family name would do for bibutils. If anyone knows better, please correct me.

     
  • JeffH

    JeffH - 2013-03-01

    bibutils does already distinguish name suffixes by mapping them to, e.g.,

    <namePart type="suffix">Jr.</namePart>

    Hm, so it does. Well, this is incorrect according to the MODS 3.4 schema:

    http://www.loc.gov/standards/mods/userguide/name.html#namepart

    It should be:

    <namePart type="termsOfAddress">Jr.</namePart>
    

    from http://www.loc.gov/standards/mods/v3/mods-3-4.xsd:

    <xs:simpleType name="namePartTypeAttributeDefinition">
      <xs:restriction base="xs:string">
        <xs:enumeration value="date"/>
        <xs:enumeration value="family"/>
        <xs:enumeration value="given"/>
        <xs:enumeration value="termsOfAddress"/>
      </xs:restriction>
    </xs:simpleType>
    

    It would probably be a good idea if bibutils did the same thing for name
    prefixes, e.g.,

    <namePart type="prefix">van</namePart>

    well, since there's no "prefix" value in namePartTypeAttributeDefinition, it (bibutils) probably shouldn't do that.

    rather, it would be good if it could regonize multi-token family names such that for eg "Guido van Rossum" it produced..

    <name type="personal">
        <namePart type="given">Guido</namePart>
        <namePart type="family">van Rossum</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    

    As to parsing bibtex/biblatex entries, AFAIK bibtex and biblatex simply
    assume that all lower-case words at the beginning of the last/family name are
    part of the name prefix.

    The biblatex manual mentions a few examples of name prefixes (“Name lists are
    parsed and split up into the individual items at the and delimiter. Each item
    in the list is then dissected into four name components: the first name, the
    name prefix (von, van, of, da, de, della, ...), the last name, and the name
    suffix (junior, senior, ...). ”) – but as judged by checking the source
    files, it does not specifically search for strings such as "von, van, of, da,
    de, della, ...", so I’m assuming a search for lower-case words at the start
    of the family name would do for bibutils. If anyone knows better, please
    correct me.

    Well, first, in looking at bibutil code, it simply takes the last whitespace-bounded token that occurs before any common american-style name suffixes, if any. see name_multielement_nocomma() in name.c. so yeah, that logic could be enhanced to figure out multi-token family names.

    However, performing the latter is pretty hairy except for relatively simple cases. See for example Spanish personal naming traditions/customs https://en.wikipedia.org/wiki/Spanish_naming_customs -- an ostensible multi-token surname may not have any "de" in it.

    Dutch customs appear to be somewhat more simple https://en.wikipedia.org/wiki/Dutch_name

    American/English hyphenated surnames ought to also be recognized. Plus there's likely particular customs for other countries/languages (eg spanish also uses hyphenated surnames)

    In nosing around I see various packages for "parsing human names" in PHP, Python, Ruby, Javascript, but nothing obvious popped up for C, which is sortof odd because someone is likely to have tackled this in the past. Or maybe something can be adapted/ported.

    In any case, in the meantime, yeah, maybe a few simplistic fixes to bibutils would be a reasonable start.

    hope this helps.

     
  • Chris Putnam

    Chris Putnam - 2013-03-01

    Bibutils current will do the right thing if the name is:

    van Rossum, Guido

    or (for bib2xml and biblatex2xml)

    Guido {van Rossum}

    or even

    {van Rossum}, Guido

    The proposal you're making, however, will also improve parsing for other formats that don't have a natural way to group items ala bibtex/biblatex. So it seems like a net win, even if breaks the occasional name and doesn't require me to try and keep a comprehensive list of all humanly-given prefixes to given names.

    The --asis/--corp file don't help here. They only apply to a full name and tell the software not to touch the name at all. (This is also redundant for use of brackets in bib2xml and biblatex2xml, but again is useful for other bibliography formats.)

    The format for these, which I guess isn't documented here--I try to add it, is just a text file with one name per file. The strings much match exactly, though (including spaces and capitalization).

     
  • JeffH

    JeffH - 2013-03-01

    or (for bib2xml and biblatex2xml)

    Guido {van Rossum}

    or even

    {van Rossum}, Guido

    Oh yeah huh

    thanks Chris.

    I'll edit my bibtex

    meanwhile, mentioning tricks like that somewhere hereabouts would be good :)

     
    • Chris Putnam

      Chris Putnam - 2013-03-01

      I'm surprised to find out that "Guido van Rossum" works. My recollection from (now quite old) bibtex days was that you had no choice but to use "Guido {van Rossum}" or "van Rossum, Guido" to get the reference properly parsed.

      As I said, I do like the biblatex way of recognizing lower case prefixes to given names. bibutils should be doing that.

       
      • JeffH

        JeffH - 2013-03-02

        As I said, I do like the biblatex way of recognizing lower case prefixes to
        given names. bibutils should be doing that.

        I sorta replied on this down below, but to add to it, I'd suggest taking a real close look at biblatex code to figure out what&how they are doing family name parsing, because if they are doing something really simplistic, e.g. looking for lower case family name "prefixes", it may not work all that well in actuality.

         
  • JeffH

    JeffH - 2013-03-01

    The format for [ --asis/--corp files ] ... is just a text file
    with one name per file

    you mean "one name per line" I presume?

    I tried it to see if it would help with the handling of say "Anne van Kesteren" and it didn't appear to work -- the name was still parsed into different nameParts and "van" was typed as "given", do I guess I still don't understand what --asis is supposed to do in the context of bib2xml.

    thanks again.

     
    • Chris Putnam

      Chris Putnam - 2013-03-01

      Yes, one name per line.

      Hmm. Looks like --asis comparisons are broken for bib2xml and biblatex2xml (but not the other converters). I'm guessing the latex processing is being done prior to asis list comparisons. I'll have to fix that.

      But I'd argue that --asis and --corp are really redundant for these input formats since use of brackets will quickly get you what you want.

       
  • JeffH

    JeffH - 2013-03-01

    But I'd argue that --asis and --corp are really redundant for these input formats > since use of brackets will quickly get you what you want.

    agreed.

     
  • JeffH

    JeffH - 2013-03-02

    The proposal you're making, however, will also improve parsing for other
    formats that don't have a natural way to group items ala bibtex/biblatex. So
    it seems like a net win, even if breaks the occasional name and doesn't
    require me to try and keep a comprehensive list of all humanly-given prefixes
    to given names.

    If you're referring to Nick's proposal (above) to add "...a search for lower-case words at the start of the family name ..." and use that heuristic in parsing for family name -- I don't think it would be a terribly good idea because even though it might work ok overall for say Dutch, it would hardly work at all for Spanish (for example).

    So that's why I was suggesting if you really wish to do more thorough parsing of human personal names, then perhaps leveraging some package that already does that would be the way to go.

     
  • Nick Bart

    Nick Bart - 2013-03-02

    First, bibtex’s parsing of prefix - or "von" - name parts is indeed based on checking for upper/lower case. It also looks for a few other things; this is described rather fully in
    http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf. I assume – but it’s no more than an assumption – that biblatex does the same.

    Next, my first reaction was to propose introducing a "prefix" element for name prefixes.

    (And, yes, I know neither "suffix" nor "prefix" are in the official MODS specification, but "termsOfAddress" strikes me as utterly silly and useless since it may be used indiscriminately for both prefixes and suffixes: http://www.loc.gov/standards/mods/userguide/name.html, e.g., has "Dr.", "Jr.", "II", "Pope", etc. Someone should complain at MODS; and for the sake of clarity, I’d strongly vote for bibutils sticking to "suffix".)

    On second thoughts, however, it seems that just introducing "prefix" seems like a rather incomplete solution.

    I’d say, now, that bibutils’ behaviour concerning name prefixes should either not be changed at all, the main supporting argument here being that most database formats do not offer any more than two name fields anyway, one for given name(s) and one for family name(s), and that processors making use of these data I am aware of (bibtex, biblatex, the various citeprocs) all have their own more or less well-functioning heuristics to parse two-field names for prefixes (and, sometimes, suffixes).

    The other, more ambitious solution would be to try and use all information available in the source data (in particular, biblatex has additional info in an option useprefix=true/false) and output, at least in MODS, data structured like in the CSL model, which looks as follows (from http://citationstyles.org/downloads/specification.html):

    Personal names require a "family" name-part, and may also contain "given", "suffix", "non-dropping-particle" and "dropping-particle" name-parts. These name-parts are defined as:

    • "family" - surname minus any particles and suffixes
    • "given" - given names, either full ("John Edward") or initialized ("J. E.")
    • "suffix" - name suffix, e.g. "Jr." in "John Smith Jr." and "III" in "Bill Gates III"
    • "non-dropping-particle" - name particles that are not dropped when only the surname is shown ("de" in the Dutch surname "de Koning") but which may be treated separately from the family name, e.g. for sorting
    • "dropping-particle" - name particles that are dropped when only the surname is shown ("van" in "Ludwig van Beethoven", which becomes "Beethoven")

    In bibutils, this would probably have to be reassembled in different patterns according to the expectations of the various target formats, but citeproc-hs, e.g., could be updated to use such five-part names directly.

    For biblatex2xml this means that if a biblatex entry contains options={useprefix=false}, the "von part" is a "dropping-particle"; if it contains options={useprefix=true}, the "von part" is a "non-dropping-particle".

    In addition, biblatex’s sortname field might carry relevant information as well:

    E.g., author={Jean de La Fontaine} would be parsed into

    • First = "Jean"
    • von = "de"
    • Last = "La Fontaine"

    If this entry in addition contained sortname={Fontaine}, the reasonable assumption would be that the "La" from "La Fontaine" was a "non-dropping-particle".

    This is probably not a complete solution yet, but I’d like to put it up for discussion.

     

    Last edit: Nick Bart 2013-03-02

Log in to post a comment.