#277 invalid xml:lang= values

GREEN
closed-accepted
6
2014-01-23
2011-05-19
Syd Bauman
No

I noticed that in HD11 there is an example that includes xml:lang="fra" and xml:lang="eng". These are not registered IANA language codes, thus not valid BCP 47 tags, and thus not valid values for xml:lang=.

I then looked at all of the xml:lang= values in the Guidelines (r8902):
3412 fr
2112 zh-tw
2102 ja
2080 es
2014 it
1965 kr
635 en
221 und
202 de
85 mul
79 la
21 pt
16 is
11 fm
8 pl
7 zh
7 en-US
5 grc
4 ru
4 non
4 gd
4 da
3 gmh
3 en-x-Scots
3 el
3 LA
2 zh-cn
2 lt
2 lat
2 gr
2 fre
2 fra
2 cy
2 ang
2 FR
1 zh-archaic
1 zh-Hant-tw
1 zh-Hans-cn
1 sl
1 ja-Hani
1 fro
1 es
1 enm
1 eng
1 af

I know that all those that occur > 15 times are valid tags, and I'm pretty sure that "fm" is not. I did not look at the rest.

Discussion

1 2 > >> (Page 1 of 2)
  • Laurent Romary

    Laurent Romary - 2011-05-19

    The question is whether we should accept these codes in the TEI, knowing that they are valid ISO 639-2 (and a fortiori -3) codes and allows as 3-letter codes a much wider coverage of the existing languages of the world (cf. the Ethnologue registry)

     
  • BODARD Gabriel

    BODARD Gabriel - 2011-05-19

    My understanding is that usual and recommended practice is to use the 2-character codes where they exist, and 3-character codes where they don't (since as Laurent points out 639-3 is a much more comprehensive list): so for example @xml:lang="el" is correct 639-2 for Modern Greek, but since there is no -2 code for Classical Greek, we should use the -3 @xml:lang="grc".

    How does TEI propose to constrain / validate this list anyway? The IANA tags are almost infinitely extensible, aren't they, with regional variants, writing systems, user-defined suffixes, etc...?

     
  • Syd Bauman

    Syd Bauman - 2011-05-19

    > The question is whether we should accept these codes in the TEI,

    Sorry, Laurent, but no, this is not the question. The semantics of xml:lang= are not controlled by TEI, but by W3C, and they are very explicit: “The values of the attribute are language identifiers as defined by [IETF BCP 47], Tags for the Identification of Languages” [http://www.w3.org/TR/xml/#sec-lang-tag]

    So the question may be whether or not TEI likes the old ISO 639-2 codes enough that we should create a another lang= attribute.

     
  • Lou Burnard

    Lou Burnard - 2011-05-19

    The three letter codes are OK, if they are taken from ISO-639-3.
    I quote
    "Languages are identified by a language subtag, which may be a two letter code taken from ISO 639-1 or a three letter code taken from ISO 639-2. "
    See further...
    http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CH.html#CHSH

    So I don't think this is a bug.

     
  • Lou Burnard

    Lou Burnard - 2011-05-19
    • status: open --> closed-rejected
     
  • Laurent Romary

    Laurent Romary - 2011-05-19

    That would not be optimal (euphemism)... As a matter of fact the XML rec is clearly outdated, since the link there (ftp://ftp.isi.edu/in-notes/bcp/bcp47.txt) is broken.
    But if you have a look at: http://tools.ietf.org/rfc/bcp/bcp47.txt you can see that it clearly refers to the additional 639 parts. I guess we can close the ticket, can't we? ;-)

     
  • Syd Bauman

    Syd Bauman - 2011-05-19

    Addressing Gabriel's comment:

    The rule for xml:lang= is to use a BCP 47 tag. (See the tagdoc for data.language.) "grc" is the BCP 47 tag for ancient Greek.

    Values of xml:lang= are validated as BCP 47 codes the same way values of target= are validated as URIs: the schema says that they are of datatype xsd:language just as target= is of datatype xsd:anyURI. Whether or not your software knows how to validate these things is a different story. (Jing does not do a particularly good job — I filed a bug report years ago, but I don’t think it has been addressed.)

     
  • Syd Bauman

    Syd Bauman - 2011-05-19

    Addressing Laurent’s comment of 10:35:18 EDT:

    Optimal or not, I don’t think it’s a choice. If people really feel it is important to be able to use codes other than BCP 47, then we get W3C to update semantics of xml:lang= or TEI should invent another mechanism. I think it is very bad practice for TEI to recommend that users violate W3C specs.

    And although BCP 47 does refer to ISO 639 parts, it does not endorse their blanket use. Rather, the discussion of ISO 639 is about how the primary language tags in the IANA registry was designed to match the ISO 639 code where reasonable.

    So no, I don't think this ticket can be closed. At least, not until the GLs are fixed. :-)

     
  • Lou Burnard

    Lou Burnard - 2011-05-19

    If there is a bug to be fixed here, it is that the rules given in the tagdoc for data.language and the rules given in the discussion at #CHSH are not identical. That should be rectified. Also, if any of the codes actually deployed in the Guidelines are not valid (thanks for the list syd!) that should be corrected.

    Have reopened the ticket accordingly.

     
  • Lou Burnard

    Lou Burnard - 2011-05-19
    • status: closed-rejected --> open-rejected
     
  • Syd Bauman

    Syd Bauman - 2011-05-19

    Addressing Lou’s of 10:29:19 EDT … (wow, you guys are way too fast for me! :-)

    “Languages are identified by a language subtag, which may be a two letter code taken from ISO 639-1 or a three letter code taken from ISO 639-2.”

    Indeed, that's what CH says, but the “which may be” there doesn’t mean “which you the user may choose” but rather “which may have been chosen by IANA”. I think CH is a bit ambiguous here (probably my fault), and some word-smithing is in order.

     
  • Piotr Banski

    Piotr Banski - 2011-05-19

    I think at some point it says that you can use the three-digit codes IF 639-1 are missing. I don't think it is possible to have variation between "pl" and "pol", "de" and "deu", "en" and "eng", etc. I believe only the former of each pair are valid in this context. Can't support this with a link right now though, maybe at night.

     
  • Piotr Banski

    Piotr Banski - 2011-05-19

    "When languages have both an ISO 639-1 two-character code and a three-
    character code (assigned by ISO 639-2, ISO 639-3, or ISO 639-5), only
    the ISO 639-1 two-character code is defined in the IANA registry."

    http://www.ietf.org/rfc/bcp/bcp47.txt sect. 2.2.1

     
  • Laurent Romary

    Laurent Romary - 2011-05-19

    OK. I don't have a clue on how to change this. But the bcp47 default rule is not visionary. Someone coding a variety of languages would probably want to stick to one ISO pool of codes rather than switching all the time.

     
  • BODARD Gabriel

    BODARD Gabriel - 2011-05-19

    Laurent, I don't know if that's really the case. We encode both modern and classical Greek, for example, and are quite happy to use "el" and "grc" (rather than "gre" and "grc"). My feeling is we should just record (and stick to) the rule below. There are a few things in Syd's list that I can see at a glance should presumably be changed, such as:

    eng -> en
    fre -> fr
    lat -> la

    Are "ang" and "fra" French-language tags for anglais and français respectively?

    (While we're at it, my understand is that while language codes are not case sensitive, it is conventional to write the language part in lower case, the region part in uppercase, and the script part in title case, so perhaps we should change:

    zh-cn -> zh-CN
    zh-tw -> zh-TW
    FR -> fr
    LA -> la
    It -> it

    etc.?)

     
  • Laurent Romary

    Laurent Romary - 2011-05-19

    Well both ARE case sensitive. fr is univoquely for French (in 639-1) and FR for France (in 3166). Besides, there is nothing like a French code for a language. The (not so struct) rule is that the code is derived from the word used to express the language in itself. (en for English, fr for Français, de for Deutsch), with variations, of course.

     
  • BODARD Gabriel

    BODARD Gabriel - 2011-05-19

    The 2-letter codes are based on the native language (en, fr, de, es, el) but the 3-letter codes were based on English, weren't they? (eng, fre, ger, spa, gre.)

    We're getting off the point now. I think we're in agreement that language codes ought to be treated as case-sensitive, anyway, so there are a few fixes to be made to the codes Syd has identified.

     
  • Laurent Romary

    Laurent Romary - 2011-05-19

    I did not want to touch this rather painful issue, but the library community (LoC, I think) imposed a series of codes parallel to the basic ones. So you have a whole series for which there are two possible codes.... something ISO should not be proud of. So fre and fra do exist.

     
  • BODARD Gabriel

    BODARD Gabriel - 2011-05-19

    I didn't know that, but it was what I was guessing at 18:16:45 BST.

    In that case, both fre and fra (and presumably FR) should be corrected to "fr", right?

     
  • Laurent Romary

    Laurent Romary - 2011-05-19

    Well, if you really want to be compliant to the BCP and use Alpha2 codes per force, yes. Sigh... (sometimes I dream I would have a magic stick)

     
  • Lou Burnard

    Lou Burnard - 2011-05-19

    I dont think we have any choice about conforming to the BCP. As Syd points out, we can't change the rules about what is valid for xml:lang because we don't own it. We *might* consider adding an "isolang" attribute I suppose, as an alternative, but I really don't think that would be a good idea.

     
  • Lou Burnard

    Lou Burnard - 2011-05-19

    p.s. the "fm" is indeed erroneous -- all occurrences are in the tagdoc for label , which I have just fixed

     
  • Piotr Banski

    Piotr Banski - 2011-07-24

    Hi, I've managed to completely miss the development of this discussion. I understand from the last comment that Lou has fixed the codes to conform to BCP-47, so shouldn't this be closed+fixed rather than open+rejected?

     
  • Syd Bauman

    Syd Bauman - 2011-09-17

    partly annotated list of //@xml:lang in r9324

     
  • Syd Bauman

    Syd Bauman - 2011-09-18

    Not quite, Piotr. I looked again (using xmlstarlet on r9324, so looking at all real xml:lang= attrs, regardless of namespace of element they're on, but ignoring mentions of xml:lang= in content or comment), and found there are still some errors:

    3426 fr
    2112 zh-tw (probably should be zh-TW)
    2102 ja
    2080 es
    2014 it
    1965 kr = Kanuri (Korean is 'ko')
    636 en
    221 und
    202 de
    85 mul
    82 la
    21 pt
    16 is
    8 pl
    7 zh
    7 en-US
    5 grc
    4 ru
    4 non
    4 gd
    4 da
    3 gmh
    3 en-x-Scots
    3 el
    2 zh-cn (probably should be zh-CN)
    2 lt
    2 lat = ERROR
    2 gr = ERROR
    2 fro
    2 fre = ERROR
    2 fra
    2 cy
    2 ang
    1 zh-archaic = ERROR AFAIK
    1 zh-Hant-tw (probably should be zh-Hant-TW)
    1 zh-Hans-cn (probably should be zh-Hans-CN)
    1 sl
    1 ja-Hani
    1 frm
    1 es
    1 enm
    1 eng = ERROR
    1 LA = region( Lao People's Democratic Republic )

    Since Sourceforge will likely mess up the whitespace that makes that list line up and thus easier to read, I've uploaded that info as a file, too.

     
1 2 > >> (Page 1 of 2)

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks