Menu

#423 Pasting in accented characters fails

v1.0 (example)
closed-fixed
None
4
2015-01-13
2013-12-29
No

When working with WorldCat records, a natural tactic is to copy & paste names from WorldCat, including authors, publishers, and publication series. At least for the last two, this results in records that look identical to other records in our system, but are kept as distinct publishers or distinct series. For example, if you do a publisher search for "Calmann", you will get two different publishers named "Calmann-Lévy". One appears to have the accented é as a Unicode character, and the other as the html character e+#769; (with "&" replaced by "+" so it doesn't auto convert the display here). Similarly, if you search for the publication series "du Futur", you will get two different series whose names both appear to be "Présence du Futur", but which are internally different (for the same reason). Such copy & paste errors should be corrected on data entry. They're not too hard to correct individually, but I have left these examples as is so you have something to work with.
I suspect that this same error would apply for other editors copying & pasting from other web pages as well, but have not checked this.

Discussion

  • Ahasuerus

    Ahasuerus - 2014-12-29

    After reviewing the data in the database I see that "́" is a separate case. It's a special "Combining Acute Accent character, a part of the "Combining Diacritical Marks" family of characters. These characters are used to modify the preceding character, so what's actually stored in the database is strings like "Sala de Autopsias Número 4", "Harry Potter agus an órchloch" and "Exilé". As reported above, this can happen during copy-and-paste operations when using third party Web sites as a source.

    I think we need to do two things to address this issue. First we need to modify the data entry logic to convert "known culprits" like "ó" and "é" to their Latin-1 equivalents. Second, we need to create one or more cleanup reports to find "&amp#769;".

     
  • Ahasuerus

    Ahasuerus - 2015-01-12

    It turns out that the problem is not limited to the "Combining Acute Accent character". Many other combining diacritics can cause similar issues and need to be converted to standard Unicode characters at data entry time.

     
  • Ahasuerus

    Ahasuerus - 2015-01-12
    • summary: Pasting in accented characters fails (publishers & pub series) --> Pasting in accented characters fails
    • assigned_to: Ahasuerus
     
  • Ahasuerus

    Ahasuerus - 2015-01-12

    Changed the title to more accurately reflect the nature of the problem.

     
  • Ahasuerus

    Ahasuerus - 2015-01-12

    Part 1 - Fixed the data entry algorithm; converted title and publication records -- implemented in:

    common/library.py 1.72
    scripts/change_combining_diacritics.py 1.1
    

    Installed in r2015-01-05 on 2015-01-1. Keeping open since we still need to create cleanup reports for Author, Series, Publisher and Publication Series records.

     
  • Ahasuerus

    Ahasuerus - 2015-01-13

    Part 2 - Create a cleanup report for publishers:

    common/library.py 1.73
    mod/cleanup.py 1.81
    mod/cleanup_report.py 1.1
    mod/common.py 1.26
    mod/TARGETS 1.70
    nightly/nightly_update.py 1.80
    

    Installed in r2105-006 on 2015-01-12. Keeping the Bug open since we need 3 more cleanup reports for authors, series and publication series.

     
  • Ahasuerus

    Ahasuerus - 2015-01-13
    • status: open --> closed-fixed
     
  • Ahasuerus

    Ahasuerus - 2015-01-13

    Part 3 - Create cleanup reports for Authors, Series and Publication Series:

    mod/cleanup.py 1.82
    mod/cleanup_report.py 1.2
    mod/common.py 1.27
    nightly/nightly_update.py 1.81
    

    Installed in r2015-007 on 2015-01-13. Closing the Bug report. Any subsequent changed in this are will be done as new Bugs/FRs.

     

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB