ISFDB Bibliographic Tools / Bugs / #426 Searching Notes for accented characters gets incorrect results

#426 Searching Notes for accented characters gets incorrect results

Milestone: v1.0 (example)

Status: closed-wont-fix

Owner: Ahasuerus

Labels: None

Priority: 5

Updated: 2022-06-15

Created: 2013-12-29

Creator: Darrah Chavey

Private: No

If I search Publication Notes for the word chunk "übers" (which generally indicates someone has listed a translator, but entered the note in German) it gives a substantial number of false hits on "Cyber". (After some cleanup, as of the submission of this bug, those were the only hits it gave.)

Discussion

Ahasuerus - 2022-06-15

status: open --> closed-wont-fix

assigned_to: Ahasuerus
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ahasuerus - 2022-06-15

"übers" finds "cyber" in all ISFDB searches regardless of the field being searched. The reason for it is that our database uses latin1_swedish_ci to determine which characters are identical for searching purposes. Some of it obvious, a search on "asimov" finds "Asimov" because "a" and "A" are treated as the same character for searching purposes. Some is less obvious, but makes sense when you thin about it, e.g. a search on "u" finds not only "U", but also "Ù", "Ú", "Û", "ù", "ú" and "û". However, latin1_swedish_ci also has a few surprising equivalences, including treating "y"/"Y" as the same as "Ü", "ü", "Ý" and "ý" for searching purposes. "[", "\" and ]" have even more unexpected equivalences -- see the chart for details.

Eventually, we will upgrade the database to Unicode and change the collation accordingly. Until then, we have to choose between different Latin-1-based collations. All of them have their pluses and minuses. For example, the German collation https://collation-charts.org/mysql60/mysql604.latin1_german1_ci.html doesn't have most of the problems that our Swedish collation has, but it still treats "÷" and "ÿ" as identical for search purposes.

For now, I will close this Bug report and wait until we revisit this issue during the Unicode migration.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous

Searching Notes for accented characters gets incorrect results

Group

Searches

Help

#426 Searching Notes for accented characters gets incorrect results

Discussion