Menu

#148 Support localized linktrails

PFE
pfe
nobody
2013-07-17
2013-06-24
Anonymous
No

In de.wikipedia a [[Biskuit]]gebäck should be equivalent to [[Biskuit|Biskuitgebäck]]. Currently the link stops before the "ä". In MessagesDe.php there is a $linkTrail = '/^([äöüßa-z]+)(.*)$/sDu';. Icelandic (is) and some other languages also have a link prefix ($linkPrefixExtension = true;), but I can't tell you which characters should be included in the link (perhaps the same as in the link trail). --Schnark

Discussion

  • gnosygnu

    gnosygnu - 2013-06-25
    • labels: --> parser, lnki, lang
    • status: new --> investigating
     
  • gnosygnu

    gnosygnu - 2013-06-25

    Thanks for the thorough detail.

    Well, XOWA currently does support linktrails, but only for English. Yeah, sorry :(...

    The limited support is because each language defines a linkTrail with a regular expression, and....

    • XOWA doesn't use regular expressions for the parser. It uses characters / byte-sequences.

    • XOWA would have to parse the regular expression to a character / byte-sequence. I really didn't want to get into the business of parsing regexs -- even simple ones.

    So, I hardcoded a set of characters for English. Something like this: "a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z". There's other logic that assumes the ".*".

    I could do something similar for German (since it's the same + 4 other characters), but I'd have to do this continually on a language by language basis. Or come up with some shoddy regex parser (which I really didn't want to do).

    I'll review this and put something in for a v0.7 release. Let me know if you're running across a lot of these link trails, and I'll hard-code something for German just like I did for English.

    Icelandic (is) and some other languages also have a link prefix ($linkPrefixExtension = true;)

    This is harder, as I'd have to change the parser. It may not be difficult, but this will probably be in an even later version.

     

    Last edit: gnosygnu 2013-06-25
  • Anonymous

    Anonymous - 2013-06-25

    Let me know if you're running across a lot of these link trails

    It took me half a year to find a page with an umlaut in the link trail.

    So, I hardcoded a set of characters for English. Something like this: "a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z".

    Capital letters shouldn't be in the link trail, neither for English nor for German.

    I don't think there are many languages (if any) that really need a regular expression. Is it possible to add support to read the additional characters from XOWA language files? Then you could just add the additional characters there when a user requests them (like the interface localisation). --Schnark

     
  • gnosygnu

    gnosygnu - 2013-06-26
    • status: investigating --> queued
    • Milestone: PFE --> v0.8.0
     
  • gnosygnu

    gnosygnu - 2013-06-26

    Capital letters shouldn't be in the link trail

    Oops. Typed the string from memory. I checked the code now, and it does use only lowercase letters.

    I don't think there are many languages (if any) that really need a regular expression?

    Agreed. They generally use byte sequences. However, getting the byte sequence means parsing the regex and they all specify slightly different formats

    // MessagesBr.php
    $linkTrail = "/^((?:c\'h|C\'H|C\'h|c’h|C’H|C’h|[a-zA-ZàâçéèêîôûäëïöüùñÇÉÂÊÎÔÛÄËÏÖÜÀÈÙÑ])+)(.*)$/sDu";
    
    // MessagesOs.php
    $linkTrail = '/^((?:[a-z]|а|æ|б|в|г|д|е|ё|ж|з|и|й|к|л|м|н|о|п|р|с|т|у|ф|х|ц|ч|ш|щ|ъ|ы|ь|э|ю|я|“|»)+)(.*)$/sDu';
    
    // MessagesTa.php
    $linkTrail = "/^([\xE0\xAE\x80-\xE0\xAF\xBF]+)(.*)$/sDu";
    
    // MessagesZh.php
    $linkTrail = '/^()(.*)$/sD';
    

    I just didn't want to open up a can of worms, and start trying to parse regex.

    That said, parsing these is absolutely simple compared to parsing wikitext. I'll make an attempt for a v0.7.* release.

    Is it possible to add support to read the additional characters from XOWA language files?

    Ok. I'll try to put this in for v0.7.0. Worse comes to worse, I'll hard-code something for German until whenever I complete the above.

     
  • gnosygnu

    gnosygnu - 2013-06-29
    • status: queued --> in-progress
     
  • gnosygnu

    gnosygnu - 2013-06-29

    I put in support for de.gfs and fr.gfs tonight. It will be part of v0.7.0.

    Specifically, de.gfs now has the following:

    this.link_trail.add_range('a', 'z');
    this.link_trail.add_bulk('äöüß');
    

    I'll try to parse the regex for the other languages in a v0.7.* release. I may create another PFE ticket to handle link prefixes as they are slightly difficult.

     
  • gnosygnu

    gnosygnu - 2013-07-11
    • status: in-progress --> pfe
    • Milestone: v0.7.* --> PFE
     
  • gnosygnu

    gnosygnu - 2013-07-11

    I'm going to push this to PFE. I've added support to control this behavior through the language .gfs files. I still plan to do a mass parse, but I probably won't have time for several weeks. If any requests come in before then, I'll manually add the script for the requested language. Otherwise, it should be done sometime in v0.8 or v0.9.

     
  • gnosygnu

    gnosygnu - 2013-07-17
    • Milestone: --> PFE
     

Anonymous
Anonymous

Add attachments
Cancel