bibutils / Discussion / General Discussion: issue with parsing multi-part family names in bibutils 4.17

Given a bibtex input such as:

@ARTICLE{amoeba-os,
  author = {Sape J. Mullender and Guido van Rossum and Andrew Tannenbaum and
    Robbert van Renesse and Hans van Staveren},
  title = {{Amoeba: A Distributed Operating System for the 1990s}},
  journal = {Computer},
  year = {1990},
  volume = {23},
  pages = {44-53},
  number = {5},
  address = {Washington, DC, USA},
  publisher = {IEEE}
}

bib2xml will map it to MODS such that the "van" portions of the family names are typed as part of "given" names:

    <name type="personal">
        <namePart type="given">Guido</namePart>
        <namePart type="given">van</namePart>
        <namePart type="family">Rossum</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>

this is suboptimal in various ways, one of which is when then converting the MODS entry to another format (such as xml2rfc [RFC2629]) then one ends up with, e.g., initials for "Guido van Rossum" being "G. v." which is incorrect:

    <author fullname="Guido van Rossum" initials="G. v." surname="Rossum"/>

There are likely other similar multi-part family names that are so affected.

is a workaround to use the --asis switch? what is the format of the asis file? in Guido's case, do I simply put "van Rossum" on a separate line in the asis file?

thanks.

Nick Bart - 2013-03-01

bibutils does already distinguish name suffixes by mapping them to, e.g.,

<namePart type="suffix">Jr.</namePart>

It would probably be a good idea if bibutils did the same thing for name prefixes, e.g.,

<namePart type="prefix">van</namePart>

As to parsing bibtex/biblatex entries, AFAIK bibtex and biblatex simply assume that all lower-case words at the beginning of the last/family name are part of the name prefix.

The biblatex manual mentions a few examples of name prefixes (“Name lists are parsed and split up into the individual items at the and delimiter. Each item in the list is then dissected into four name components: the first name, the name prefix (von, van, of, da, de, della, ...), the last name, and the name suffix (junior, senior, ...). ”) – but as judged by checking the source files, it does not specifically search for strings such as "von, van, of, da, de, della, ...", so I’m assuming a search for lower-case words at the start of the family name would do for bibutils. If anyone knows better, please correct me.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JeffH - 2013-03-01

bibutils does already distinguish name suffixes by mapping them to, e.g.,

<namePart type="suffix">Jr.</namePart>

Hm, so it does. Well, this is incorrect according to the MODS 3.4 schema:

http://www.loc.gov/standards/mods/userguide/name.html#namepart

It should be:

<namePart type="termsOfAddress">Jr.</namePart>

from http://www.loc.gov/standards/mods/v3/mods-3-4.xsd:

<xs:simpleType name="namePartTypeAttributeDefinition"> <xs:restriction base="xs:string"> <xs:enumeration value="date"/> <xs:enumeration value="family"/> <xs:enumeration value="given"/> <xs:enumeration value="termsOfAddress"/> </xs:restriction> </xs:simpleType>

It would probably be a good idea if bibutils did the same thing for name
prefixes, e.g.,

<namePart type="prefix">van</namePart>

well, since there's no "prefix" value in namePartTypeAttributeDefinition, it (bibutils) probably shouldn't do that.

rather, it would be good if it could regonize multi-token family names such that for eg "Guido van Rossum" it produced..

<name type="personal"> <namePart type="given">Guido</namePart> <namePart type="family">van Rossum</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name>

As to parsing bibtex/biblatex entries, AFAIK bibtex and biblatex simply
assume that all lower-case words at the beginning of the last/family name are
part of the name prefix.

The biblatex manual mentions a few examples of name prefixes (“Name lists are
parsed and split up into the individual items at the and delimiter. Each item
in the list is then dissected into four name components: the first name, the
name prefix (von, van, of, da, de, della, ...), the last name, and the name
suffix (junior, senior, ...). ”) – but as judged by checking the source
files, it does not specifically search for strings such as "von, van, of, da,
de, della, ...", so I’m assuming a search for lower-case words at the start
of the family name would do for bibutils. If anyone knows better, please
correct me.

Well, first, in looking at bibutil code, it simply takes the last whitespace-bounded token that occurs before any common american-style name suffixes, if any. see name_multielement_nocomma() in name.c. so yeah, that logic could be enhanced to figure out multi-token family names.

However, performing the latter is pretty hairy except for relatively simple cases. See for example Spanish personal naming traditions/customs https://en.wikipedia.org/wiki/Spanish_naming_customs -- an ostensible multi-token surname may not have any "de" in it.

Dutch customs appear to be somewhat more simple https://en.wikipedia.org/wiki/Dutch_name

American/English hyphenated surnames ought to also be recognized. Plus there's likely particular customs for other countries/languages (eg spanish also uses hyphenated surnames)

In nosing around I see various packages for "parsing human names" in PHP, Python, Ruby, Javascript, but nothing obvious popped up for C, which is sortof odd because someone is likely to have tackled this in the past. Or maybe something can be adapted/ported.

In any case, in the meantime, yeah, maybe a few simplistic fixes to bibutils would be a reasonable start.

hope this helps.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Chris Putnam - 2013-03-01

Bibutils current will do the right thing if the name is:

van Rossum, Guido

or (for bib2xml and biblatex2xml)

Guido {van Rossum}

or even

{van Rossum}, Guido

The proposal you're making, however, will also improve parsing for other formats that don't have a natural way to group items ala bibtex/biblatex. So it seems like a net win, even if breaks the occasional name and doesn't require me to try and keep a comprehensive list of all humanly-given prefixes to given names.

The --asis/--corp file don't help here. They only apply to a full name and tell the software not to touch the name at all. (This is also redundant for use of brackets in bib2xml and biblatex2xml, but again is useful for other bibliography formats.)

The format for these, which I guess isn't documented here--I try to add it, is just a text file with one name per file. The strings much match exactly, though (including spaces and capitalization).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JeffH - 2013-03-01

or (for bib2xml and biblatex2xml)

Guido {van Rossum}

or even

{van Rossum}, Guido

Oh yeah huh

thanks Chris.

I'll edit my bibtex

meanwhile, mentioning tricks like that somewhere hereabouts would be good :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Chris Putnam - 2013-03-01
  
  I'm surprised to find out that "Guido van Rossum" works. My recollection from (now quite old) bibtex days was that you had no choice but to use "Guido {van Rossum}" or "van Rossum, Guido" to get the reference properly parsed.
  
  As I said, I do like the biblatex way of recognizing lower case prefixes to given names. bibutils should be doing that.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - JeffH - 2013-03-02
    
    As I said, I do like the biblatex way of recognizing lower case prefixes to
    given names. bibutils should be doing that.
    
    I sorta replied on this down below, but to add to it, I'd suggest taking a real close look at biblatex code to figure out what&how they are doing family name parsing, because if they are doing something really simplistic, e.g. looking for lower case family name "prefixes", it may not work all that well in actuality.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JeffH - 2013-03-01

The format for [ --asis/--corp files ] ... is just a text file
with one name per file

you mean "one name per line" I presume?

I tried it to see if it would help with the handling of say "Anne van Kesteren" and it didn't appear to work -- the name was still parsed into different nameParts and "van" was typed as "given", do I guess I still don't understand what --asis is supposed to do in the context of bib2xml.

thanks again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Chris Putnam - 2013-03-01
  
  Yes, one name per line.
  
  Hmm. Looks like --asis comparisons are broken for bib2xml and biblatex2xml (but not the other converters). I'm guessing the latex processing is being done prior to asis list comparisons. I'll have to fix that.
  
  But I'd argue that --asis and --corp are really redundant for these input formats since use of brackets will quickly get you what you want.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JeffH - 2013-03-01

But I'd argue that --asis and --corp are really redundant for these input formats > since use of brackets will quickly get you what you want.

agreed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JeffH - 2013-03-02

The proposal you're making, however, will also improve parsing for other
formats that don't have a natural way to group items ala bibtex/biblatex. So
it seems like a net win, even if breaks the occasional name and doesn't
require me to try and keep a comprehensive list of all humanly-given prefixes
to given names.

If you're referring to Nick's proposal (above) to add "...a search for lower-case words at the start of the family name ..." and use that heuristic in parsing for family name -- I don't think it would be a terribly good idea because even though it might work ok overall for say Dutch, it would hardly work at all for Spanish (for example).

So that's why I was suggesting if you really wish to do more thorough parsing of human personal names, then perhaps leveraging some package that already does that would be the way to go.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nick Bart - 2013-03-02

First, bibtex’s parsing of prefix - or "von" - name parts is indeed based on checking for upper/lower case. It also looks for a few other things; this is described rather fully in
http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf. I assume – but it’s no more than an assumption – that biblatex does the same.

Next, my first reaction was to propose introducing a "prefix" element for name prefixes.

(And, yes, I know neither "suffix" nor "prefix" are in the official MODS specification, but "termsOfAddress" strikes me as utterly silly and useless since it may be used indiscriminately for both prefixes and suffixes: http://www.loc.gov/standards/mods/userguide/name.html, e.g., has "Dr.", "Jr.", "II", "Pope", etc. Someone should complain at MODS; and for the sake of clarity, I’d strongly vote for bibutils sticking to "suffix".)

On second thoughts, however, it seems that just introducing "prefix" seems like a rather incomplete solution.

I’d say, now, that bibutils’ behaviour concerning name prefixes should either not be changed at all, the main supporting argument here being that most database formats do not offer any more than two name fields anyway, one for given name(s) and one for family name(s), and that processors making use of these data I am aware of (bibtex, biblatex, the various citeprocs) all have their own more or less well-functioning heuristics to parse two-field names for prefixes (and, sometimes, suffixes).

The other, more ambitious solution would be to try and use all information available in the source data (in particular, biblatex has additional info in an option useprefix=true/false) and output, at least in MODS, data structured like in the CSL model, which looks as follows (from http://citationstyles.org/downloads/specification.html):

Personal names require a "family" name-part, and may also contain "given", "suffix", "non-dropping-particle" and "dropping-particle" name-parts. These name-parts are defined as:

"family" - surname minus any particles and suffixes

"given" - given names, either full ("John Edward") or initialized ("J. E.")

"suffix" - name suffix, e.g. "Jr." in "John Smith Jr." and "III" in "Bill Gates III"

"non-dropping-particle" - name particles that are not dropped when only the surname is shown ("de" in the Dutch surname "de Koning") but which may be treated separately from the family name, e.g. for sorting

"dropping-particle" - name particles that are dropped when only the surname is shown ("van" in "Ludwig van Beethoven", which becomes "Beethoven")

In bibutils, this would probably have to be reassembled in different patterns according to the expectations of the various target formats, but citeproc-hs, e.g., could be updated to use such five-part names directly.

For biblatex2xml this means that if a biblatex entry contains options={useprefix=false}, the "von part" is a "dropping-particle"; if it contains options={useprefix=true}, the "von part" is a "non-dropping-particle".

In addition, biblatex’s sortname field might carry relevant information as well:

E.g., author={Jean de La Fontaine} would be parsed into

First = "Jean"

von = "de"

Last = "La Fontaine"

If this entry in addition contained sortname={Fontaine}, the reasonable assumption would be that the "La" from "La Fontaine" was a "non-dropping-particle".

This is probably not a complete solution yet, but I’d like to put it up for discussion.

Last edit: Nick Bart 2013-03-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

issue with parsing multi-part family names in bibutils 4.17

Bibliography format interconversion.

Forums

Help

issue with parsing multi-part family names in bibutils 4.17

issue with parsing multi-part family names in bibutils 4.17

Bibliography format interconversion.

Forums

Help

issue with parsing multi-part family names in bibutils 4.17 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

issue with parsing multi-part family names in bibutils 4.17