Menu

#426 Searching Notes for accented characters gets incorrect results

v1.0 (example)
closed-wont-fix
None
5
2022-06-15
2013-12-29
No

If I search Publication Notes for the word chunk "übers" (which generally indicates someone has listed a translator, but entered the note in German) it gives a substantial number of false hits on "Cyber". (After some cleanup, as of the submission of this bug, those were the only hits it gave.)

Discussion

  • Ahasuerus

    Ahasuerus - 2022-06-15
    • status: open --> closed-wont-fix
    • assigned_to: Ahasuerus
     
  • Ahasuerus

    Ahasuerus - 2022-06-15

    "übers" finds "cyber" in all ISFDB searches regardless of the field being searched. The reason for it is that our database uses latin1_swedish_ci to determine which characters are identical for searching purposes. Some of it obvious, a search on "asimov" finds "Asimov" because "a" and "A" are treated as the same character for searching purposes. Some is less obvious, but makes sense when you thin about it, e.g. a search on "u" finds not only "U", but also "Ù", "Ú", "Û", "ù", "ú" and "û". However, latin1_swedish_ci also has a few surprising equivalences, including treating "y"/"Y" as the same as "Ü", "ü", "Ý" and "ý" for searching purposes. "[", "\" and ]" have even more unexpected equivalences -- see the chart for details.

    Eventually, we will upgrade the database to Unicode and change the collation accordingly. Until then, we have to choose between different Latin-1-based collations. All of them have their pluses and minuses. For example, the German collation https://collation-charts.org/mysql60/mysql604.latin1_german1_ci.html doesn't have most of the problems that our Swedish collation has, but it still treats "÷" and "ÿ" as identical for search purposes.

    For now, I will close this Bug report and wait until we revisit this issue during the Unicode migration.

     

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB