Menu

#1005 Handling of prefixes in 2nd and later surnames

phpGedView
open
nobody
5
2008-07-02
2008-07-02
No

PhpGedView has started providing support for the multiple surname system used in Spain and other areas using the comma notation described in GEDCOM 5.5

Helpful as it is, the provision in GEDCOM does not really help in handling later surnames, only in giving an unambiguous delimiter for the first surname. The problem is that second and later surnames may contain prefixes too. And moreover, it is customary (I think it is actually the law in Spain, though it is not followed consistently) to include a copulative conjunction 'y' (that because of phonetical reasons mutates to 'e' in some contexts and is usually 'i' in Catalan-speaking areas).

The attached patch shows a one-line change that improves greatly the result avoiding that later surnames end up indexed under 'D' or 'Y'. You can see the effect of this patch in:

http://www.enredo.es

Possibly something more flexible or ellaborate that this is desirable.

I file this under Feature Requests because GEDCOM does not require this, it is a proposal for how to handle these situations.

Discussion

  • Julio Sánchez Fernández

     
  • Greg Roach

    Greg Roach - 2008-07-04

    Logged In: YES
    user_id=1466942
    Originator: NO

    The NAME field should contain the name *as written*. e.g.

    1 NAME AAA /BBB y CCC/

    The SURN should just contain a list of surnames to be used for indexing/sorting/grouping (but not for display). e.g.

    2 SURN BBB,CCC

    It should not simply contain the text between the // characters. It should not contain the conjunction, prefix, etc.

    Here is an example from my Gedcom (a daughter of "BELLÊME" and "de ALENCON").

    1 NAME Mabel /BELLÊME de ALENCON/
    2 GIVN Mabel
    2 SURN BELLÊME,ALENCON

    It has to work like this, because sometimes there are surnames like "La Marche", where the "La" is *not* a prefix. e.g. the difference between

    1 NAME Jean /La Marche/
    2 GIVN Jean
    2 SURN La Marche

    1 NAME Jean /La Marche/
    2 GIVN Jean
    2 SURN Marche

    In one case, Jean will be sorted under "L" and in the other he will be listed under "M".

    Your patch would stop this working. You need to fix the data, not fix the code. This might mean fixing the GUI to better edit the data.

     
  • Julio Sánchez Fernández

    Logged In: YES
    user_id=45958
    Originator: YES

    I find very distressing that the NAME line contains information that is not present on any of the pieces. This is guaranteed to be hostile to round-trip edition with applications that understand the piecess (GRAMPS, for instance). Only applications understanding the pieces have a flying chance of returning something remotely ressembling the original. Your approach breaks this for them. The others were broken anyway.

    I could live with this approach, I can implement the necessary export logic in my copy of GRAMPS if I become convinced it is the best course, but I don't really like it. Isn't there anything better we could do?

    In fact, when I experimented with something like this previously, I did not try to index later surnames. After a few minutes of consideration, I decided it was a can of worms and cannot be done really without special syntax trickery for many reasons, including those you make. That's why I left if for later, if ever.

    BTW, a certain usage has developed in Spain of using the third part of the NAME line as the container for the second surname. This is typical of PAF users and the GDS program (that has separate fields for two surnames) exports them this way. It would be nice to have an option to interpret this format while importing. It's of no use to me, but others I'm sure would find it handy.

     
  • Greg Roach

    Greg Roach - 2008-07-05

    Logged In: YES
    user_id=1466942
    Originator: NO

    This approach is not just for names with 2 surnames.

    It is also for names in baltic-countries. Here, there is a "base" surname, and people have different suffixes for male/female, married/single, etc. So, the "family name" might be "Duda", but individuals might write their name as "Dudas", "Dudaite", "Dudaiene", etc.

    We would write "Dudaiene" in the NAME field, but DUDA in the SURN field. This way all male/female/married/single members of the DUDA family would get grouped/listed/sorted together.

    Similarly, my girlfriends family have surnames Froggitt, Froggatt, Froggett, Froggott, etc. These are all the same family. We can put FROGGITT in the SURN field, and the various spellings in the NAME field.

    This will group all the Froggitt's together, sort them correctly, even though they are spelt "wrongly".

    This solution lets us differentiate between my cousin "/Hall Palmer/" (one surname, index only under "H") from a Spaniard (two surnames, index under "H" and "P").

    This use of the SURN field (as per the gedcom specification!) is a solution to *many* problems.

    <<Isn't there anything better we could do?>>

    Rewrite the gedcom specification ;-)

    I have looked at this problem for a year. This is the best solution I can find (for everyone's problems, not just Spanish/Portuguese).

    If you have other ideas, I am happy to see them.

     
  • Julio Sánchez Fernández

    Logged In: YES
    user_id=45958
    Originator: YES

    I am glad to hear it is not only for people with two surnames, many people in my database have three or four. And patronymics are hard too, they were the system before 1200 and survived for centuries after that in very random forms.

    There are things that cannot be understood in GEDCOM. For instance, using commas in the surname prefix field is very weird and I cannot imagine what it is for, though no doubt someone will find it useful. The suggestion that a future system be devised to do this automatically based on some sort of knowledge is laughable: we (humans) cannot do it correctly without a lot of context. The recent 120 years or so is alright, but before that all bets are off.

    But I still don't think your arrangement is mandated by the specification. It may be allowed, but having information in NAME_PERSONAL that is not in the pieces I don't think anyone would have expected. I don't think any other application will manage to preserve this. And this is important for me. In this form, I cannot use clippings of data produced by my users and just merge them with my offline database, I'll have to do it by hand.

    I'll think about this. This is not the only problem, I am still in search of something I can point people at and say, yes sure, you can migrate to that and keep your data. It is a mess, no good genealogy program supports this and when they do they do it incompatibly with everyone else. It is no problem for me because I use Open Source and I change anything I don't like or to create some compatibility, but I cannot recommend this to others without creating a fork, and I won't do that. So I keep saying, yeah, it's cool but you can't have it. I want a way out.

     
  • Julio Sánchez Fernández

    Logged In: YES
    user_id=45958
    Originator: YES

    I think that I need a longer explanation on the topic, that exceeds the scope of this tracker item. I'll prepare something for General Discussion

     
  • Greg Roach

    Greg Roach - 2008-12-04

    Latest SVN lets you write

    1 NAME given /surname1/ y /surname2/

    or

    1 NAME given /surname1 y surname2/
    2 GIVN given
    2 SURN surname1, surname2

    The name should then be indexed under both surname1 and surname2.

     
  • Julio Sánchez Fernández

    I know you can do that, but it does not interoperate usefully with any genealogy program more advanced than NOTEPAD.EXE

    Every program you feed that to, will either drop the prefixes or drop the surname split.

    Programs that do not understand the parts will always be at a disadvantage, but programs that understand the parts think that they can create the NAME line from the parts, that it contains no information that cannot be deduced from the parts. So they will drop information that, on export, will not be there anymore. And they will drop it because they never found anything like this.

    If your interpretation of the standard is correct then every part-handling program besides PGV is at fault.

    I don't think that's the case, I think your intepretation of the standard is only one among many, and one that does not interoperate usefully with other programs.

    That's why it is not useful to me, and that's on reason why my PGV copy is locally patched.

     
  • Greg Roach

    Greg Roach - 2008-12-05

    <<If your interpretation of the standard is correct then every part-handling
    program besides PGV is at fault.>>

    This is possible :-)

    Can you give some examples of the gedcom format used by these other programs?

     
  • Julio Sánchez Fernández

    Well, one of my users adds a person, say:

    0 @I0001@ INDI
    1 NAME Vicente Pedro /Sánchez de la Torre/
    2 GIVN Vicente Pedro
    2 SURN Sánchez, Torre
    1 SEX M

    I approve it, add it to my clippings cart and download it to add to my offline GRAMPS database. GRAMPS understands the GIVN, etc. parts and, if present, ignores the NAME line. If the parts were not there, it would guess them from the NAME line but in this case discards it because having the parts the NAME line is unnecessary. Or used to be, that is. I guess most, if not all, genealogy programs that have separate fields for the different parts will do exactly the same.

    Days go by, the offline database evolves (I do most bulk additions offline), while incorporating the odd online addition. When I think it's necessary, I upload a replacement GEDCOM to PGV. The abovementioned person is exported like this:

    0 @I0001@ INDI
    1 NAME Vicente Pedro /Sánchez, Torre/
    2 GIVN Vicente Pedro
    2 SURN Sánchez, Torre
    1 SEX M

    Notice first that there is a comma in the NAME line. GRAMPS will not delete it, since it is perfectly legal and has been used for centuries for, guess what, separating given names or surnames. The GEDCOM spec did not invent this.

    The big problem, however, is that the prefixes for the second surname have been lost completely.

    The root cause is the assumption that the NAME line can contain information not present in any of the parts. Only PGV, as far as I know, handles this this way and only because it is closely tied to GEDCOM, other programs translate on import and export mapping the information into internal fields. Since they have fields for each of the parts, the extra information in NAME is lost.

    So this arrangement works for PGV but does not interoperate well with any other program, important information WILL be lost.

    A possible solution would be to have all surname prefixes in the SPFX field, but they botched up the spec this way:

    (quoting from page 56 of 5.5.1):
    Surname prefix or article used in a family name. Different surname articles are separated by a comma, for example in the name "de la Cruz", this value would be "de, la".

    That is plainly absurd. It is apparent that they did not polish this part very much. Then, on page 38 you find this pearl of wishful thinking:

    "Future GEDCOM releases (6.0 or later) will likely apply a very different strategy to resolve this problem, possibly using a sophisticated parser and a name-knowledge database."

    That is plainly impossible. No one who understands the evolution of names and surnames in Spain from before the year 1000 to our days can think it can be done without comparing how the person is named in different documents, telling apart the cases where the person is named differently from the cases when the person's name is just abbreviated from a longer form, etc. Humans make mistakes all the time with this, I have just been made aware of one such mistake that has been propagated by author after author on what was the surname of a certain family and was only solved because one descendant contacted me. A markup system at least has the advantage that once the truth (or a good approximation to it) is known, then no further mistakes are needed.

    For completeness, notice anyway that the only genealogy programs claiming to support two surnames for Spanish (GenealHispana, that is an orphaned fork of PhpGedView itself, and GDS) will produce just:

    1 NAME Vicente Pedro /Sánchez/ de la Torre

    That is a different abuse of GEDCOM but that works well enough in many cases. But it loses the place the suffix would usually go, only manages two surnames and does not try to do anything whatsoever with surname prefixes, neither on the first surname or the second. I don't like this de-facto standard, but the PGV format, a format that cannot be produced by any other program, is a complete nonstarter except for people willing to use only PGV.

     
  • Greg Roach

    Greg Roach - 2008-12-09

    Page 38 of the spec says that;

    <<The NPFX, GIVN, NICK, SPFX, SURN, and NSFX tags are provided optionally for systems that cannot operate effectively with less structured information. For current future compatibility, all systems must construct their names based on the <NAME_PERSONAL> structure. Those using the optional name pieces should assume that few systems will process them, and most will not provide the name pieces.>>

    Note that it says <<All systems must>>.

    When you say that Gramps will ignore NAME if GIVN/SURN are present, then Gramps is clearly at fault. If it tries to recreate NAME from GIVN/SURN, then it will presumably also break names that have the surname first. e.g. from

    1 NAME /Surname/ Given
    2 GIVN Given
    2 SURN Surname

    to

    1 NAME Given /Surname/
    2 GIVN Given
    2 SURN Surname

     
  • Julio Sánchez Fernández

    I read the specification and I still don't think "construct" means that. I think it means that all systems must fill and parse the NAME because the pieces need not be there.

    The deficiency you find in Gramps is true and very deep in it and shows especially if the pieces are not there. It first splits the NAME line in three parts and assigns them to given, surname and surname suffix, unconditionally. When the pieces, if any, come later, they just override those values. So it is not prepared for Eastern names, for instance, that I suppose is the main user of that piece ordering.

    But it is besides the point I was making, i.e. that programs that use the pieces will all have problems. This morning I started running a survey in some Spanish-language lists where I have asked people to import into a clean database a test file, export it to GEDCOM and send it to me. The only answer I have received so far for a program that handles the pieces (Genopro 2.0.1.4) produces a result similar to that of Gramps.

    Let's wait a few days for more responses, but I expect the problem to be widespread.

     
  • Julio Sánchez Fernández

    Results are still coming, I have collected what I have at:

    http://www.enredo.es/surname-handling-survey/

    I will add more results there as they arrive.

    The conclusion is probably that the best interoperable way to handle Spanish surnames is the way everyone has been doing it so far and that I hate, i.e. put the second surname after the second slash in NAME_PERSONAL and ignore the name pieces and the comma thing. It even sort of works in PGV, though a few minor patches might make it more palatable. You lose a lot of things coding names this way, except valuable data. So I guess it wins hands down. The method in PGV 4.1.6 is only good if you don't plan to leave it or export data to other programs. I suppose people who omit anyway surname prefixes (they are not unheard of) might not be harmed either. Same with people who don't care their second surnames are wrongly indexed. Depends on how much inconvenience you are willing to bear.

    As for me, I will keep doing it as now for the time being, but I will have to start thinking about a different approach.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.