Pasting in accented characters fails

Brought to you by: ahasuerus_isfdb, alvonruff, mkupper

#423 Pasting in accented characters fails

Milestone: v1.0 (example)

Status: closed-fixed

Owner: Ahasuerus

Labels: None

Priority: 4

Updated: 2015-01-13

Created: 2013-12-29

Creator: Darrah Chavey

Private: No

When working with WorldCat records, a natural tactic is to copy & paste names from WorldCat, including authors, publishers, and publication series. At least for the last two, this results in records that look identical to other records in our system, but are kept as distinct publishers or distinct series. For example, if you do a publisher search for "Calmann", you will get two different publishers named "Calmann-Lévy". One appears to have the accented é as a Unicode character, and the other as the html character e+#769; (with "&" replaced by "+" so it doesn't auto convert the display here). Similarly, if you search for the publication series "du Futur", you will get two different series whose names both appear to be "Présence du Futur", but which are internally different (for the same reason). Such copy & paste errors should be corrected on data entry. They're not too hard to correct individually, but I have left these examples as is so you have something to work with.
I suspect that this same error would apply for other editors copying & pasting from other web pages as well, but have not checked this.

Discussion

Ahasuerus - 2014-12-29

After reviewing the data in the database I see that "́" is a separate case. It's a special "Combining Acute Accent character, a part of the "Combining Diacritical Marks" family of characters. These characters are used to modify the preceding character, so what's actually stored in the database is strings like "Sala de Autopsias Número 4", "Harry Potter agus an órchloch" and "Exilé". As reported above, this can happen during copy-and-paste operations when using third party Web sites as a source.

I think we need to do two things to address this issue. First we need to modify the data entry logic to convert "known culprits" like "ó" and "é" to their Latin-1 equivalents. Second, we need to create one or more cleanup reports to find "&amp#769;".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ahasuerus - 2015-01-12

It turns out that the problem is not limited to the "Combining Acute Accent character". Many other combining diacritics can cause similar issues and need to be converted to standard Unicode characters at data entry time.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ahasuerus - 2015-01-12

summary: Pasting in accented characters fails (publishers & pub series) --> Pasting in accented characters fails

assigned_to: Ahasuerus
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ahasuerus - 2015-01-12

Changed the title to more accurately reflect the nature of the problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ahasuerus - 2015-01-12

Part 1 - Fixed the data entry algorithm; converted title and publication records -- implemented in:

common/library.py 1.72 scripts/change_combining_diacritics.py 1.1

Installed in r2015-01-05 on 2015-01-1. Keeping open since we still need to create cleanup reports for Author, Series, Publisher and Publication Series records.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ahasuerus - 2015-01-13

Part 2 - Create a cleanup report for publishers:

common/library.py 1.73 mod/cleanup.py 1.81 mod/cleanup_report.py 1.1 mod/common.py 1.26 mod/TARGETS 1.70 nightly/nightly_update.py 1.80

Installed in r2105-006 on 2015-01-12. Keeping the Bug open since we need 3 more cleanup reports for authors, series and publication series.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ahasuerus - 2015-01-13

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ahasuerus - 2015-01-13

Part 3 - Create cleanup reports for Authors, Series and Publication Series:

mod/cleanup.py 1.82 mod/cleanup_report.py 1.2 mod/common.py 1.27 nightly/nightly_update.py 1.81

Installed in r2015-007 on 2015-01-13. Closing the Bug report. Any subsequent changed in this are will be done as new Bugs/FRs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous