I'm having strange characters like “ †’ – showing instead of " or ' often inside notes
How can I fix this?
PhpGedView 4.3.1
PHP version: 5.6.30
Cheers Paul.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You have a browser configuration problem. PhpGedView uses UTF-8 to produce all of its displays. Your browser isn't configured to support this. It's probably set to support only ASCII.
It's also possible that some of your Notes text was double-encoded. This can happen when you use an external text editor such as Notepad. You open a UTF-8 encoded file in ASCII mode and then incorrectly save it in UTF-8 mode. Unless you tell it otherwise, Notepad always saves in UTF-8 mode.
When UTF-8 files are read in ASCII mode, special characters will be represented by two or more consecutive odd-looking characters such as what you showed in the original post. When you then save that file that was read in ASCII mode using UTF-8 mode, those funny-looking characters will be individually encoded into the UTF-8 character set, thus resulting in Notes text where the original special characters are no longer interpreted properly.
G'Day & Thanks Gerry,
It's happening in all browsers & on different computers & for other users viewing my pages too.
I guess they must have been encoded into ASCII mode at some point.
Is there some way of correcting this in the database? or am I stuck manually fixing each note?
Cheers, Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The trouble is, not all of your Notes that contain special characters have this problem.
You can export your database to a GEDCOM file that you can then feed into Notepad in UTF-8 mode. Inspect the GEDCOM file until you find a Note whose text exhibits this problem. Do a replace all, replacing that particular sequence (it should be 2 characters) with the correct special character. Repeat for other weird sequences, always doing just the first 2 characters of that sequence, until there are no more errors.
Note: You may encounter the same erroneous coding in other places, such as place names. These can be corrected in the same way.
When there are no more weird sequences in your GEDCOM, look at the first few lines of the GEDCOM. There should be a 1 CHAR UTF-8 line in there. If not, change the 1 CHAR line that you find to say "UTF-8".
When you're completely done, save the resultant file in UTF-8 mode. It's preferable to save the file in UTF-8 without BOM. If there's a 3-character BOM at the beginning of the file, PhpGedView will tell you. The Byte Order Mark self-identifies the file as being in UTF-8. In PhpGedView, this just causes a warning message.
You can zip the resultant file and then upload and import it into your database. Be sure to tell PhpGedView to replace the existing database contents. If you're asked whether you want to keep media links (shouldn't happen), say "no". The GEDCOM you previously downloaded has all the required media information in it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I said, "should be 2 characters". That's not true. Depending on which character was originally represented, you can have a 3-byte sequence. You'll just have to play it by ear.
The UTF-8 specification says that you can have an up to 4-byte sequence. 4 bytes are very rarely used, and I doubt that you'll run into those.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
UTF-8 encoding is quite tricky. I had to become intimately familiar with it when I cooked up some UTF-8 functions and sorting algorithms that are used extensively in PhpGedView.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm having strange characters like “ †’ – showing instead of " or ' often inside notes
How can I fix this?
PhpGedView 4.3.1
PHP version: 5.6.30
Cheers Paul.
You have a browser configuration problem. PhpGedView uses UTF-8 to produce all of its displays. Your browser isn't configured to support this. It's probably set to support only ASCII.
It's also possible that some of your Notes text was double-encoded. This can happen when you use an external text editor such as Notepad. You open a UTF-8 encoded file in ASCII mode and then incorrectly save it in UTF-8 mode. Unless you tell it otherwise, Notepad always saves in UTF-8 mode.
When UTF-8 files are read in ASCII mode, special characters will be represented by two or more consecutive odd-looking characters such as what you showed in the original post. When you then save that file that was read in ASCII mode using UTF-8 mode, those funny-looking characters will be individually encoded into the UTF-8 character set, thus resulting in Notes text where the original special characters are no longer interpreted properly.
This Wikipedia article discusses UTF-8:
https://en.wikipedia.org/wiki/UTF-8
G'Day & Thanks Gerry,
It's happening in all browsers & on different computers & for other users viewing my pages too.
I guess they must have been encoded into ASCII mode at some point.
Is there some way of correcting this in the database? or am I stuck manually fixing each note?
Cheers, Paul
The trouble is, not all of your Notes that contain special characters have this problem.
You can export your database to a GEDCOM file that you can then feed into Notepad in UTF-8 mode. Inspect the GEDCOM file until you find a Note whose text exhibits this problem. Do a replace all, replacing that particular sequence (it should be 2 characters) with the correct special character. Repeat for other weird sequences, always doing just the first 2 characters of that sequence, until there are no more errors.
Note: You may encounter the same erroneous coding in other places, such as place names. These can be corrected in the same way.
When there are no more weird sequences in your GEDCOM, look at the first few lines of the GEDCOM. There should be a 1 CHAR UTF-8 line in there. If not, change the 1 CHAR line that you find to say "UTF-8".
When you're completely done, save the resultant file in UTF-8 mode. It's preferable to save the file in UTF-8 without BOM. If there's a 3-character BOM at the beginning of the file, PhpGedView will tell you. The Byte Order Mark self-identifies the file as being in UTF-8. In PhpGedView, this just causes a warning message.
You can zip the resultant file and then upload and import it into your database. Be sure to tell PhpGedView to replace the existing database contents. If you're asked whether you want to keep media links (shouldn't happen), say "no". The GEDCOM you previously downloaded has all the required media information in it.
Addendum:
I said, "should be 2 characters". That's not true. Depending on which character was originally represented, you can have a 3-byte sequence. You'll just have to play it by ear.
The UTF-8 specification says that you can have an up to 4-byte sequence. 4 bytes are very rarely used, and I doubt that you'll run into those.
Hmm... Ok I'll give it a go ... watch this space ....
Cool..all done, Thank you Gerry, not as painful as I thought it might be. Cheers.
Good. Glad to have been able to help.
UTF-8 encoding is quite tricky. I had to become intimately familiar with it when I cooked up some UTF-8 functions and sorting algorithms that are used extensively in PhpGedView.