[Refdb-users] character encoding stuff
Status: Beta
Brought to you by:
mhoenicka
|
From: <Mar...@en...> - 2003-03-24 19:07:17
|
On Fri, 21 Mar 2003, ref...@li... wrote: > Message: 6 > Date: Fri, 21 Mar 2003 12:30:11 -0500 > From: "Bruce D'Arcus" <bd...@fa...> > To: ref...@li... > Subject: [Refdb-users] character encoding stuff > > More setting up issues: > > I have a variety of characters that -- for whatever reason -- are not > making it through the translation from Endnote to RIS. The most common > problem is that curly single and double quotes get replaced by ?. So I > have notes with ?quotes like this?. I also have words like "don?t." > > Beyond figuring out how to fix this in a huge file (am not sure the > regular expression code to use to find this in jEdit, because the ? > character has a special meaning), will I need to worry about this in > the future? Ideally I'd like a clean database where I can move the > data in and out without worrying about these encoding issues. I > understand MySQL supports Latin-1 encoding by default, so I assume > there's no problem there. Is that right? > In case you do not already know, be aware that microsoft software (Word etc.) frequently uses non-latin1 characters like the (in-)famous "smart quote", which infest web pages for instance. Extract from: <http://www.fourmilab.ch/webtools/demoroniser/> You see, "state of the art" Microsoft Office applications sport a nifty feature called "smart quotes." (Rule of thumb--every time Microsoft use the word "smart," be on the lookout for something dumb). This feature is on by default in both Word and PowerPoint, and can be disabled only by finding the little box buried among the dozens of bewildering option panels these products contain. If enabled, and you type the string, "Halt," he cried, "this is the police!" "smart quotes" transforms the ASCII quote characters automatically into the incompatible Microsoft opening and closing quotes. Other useful links: <http://home.earthlink.net/~bobbau/platforms/specialchars/#windows> <http://www.cs.tut.fi/~jkorpela/www/windows-chars.html> <http://czyborra.com/charsets/iso8859.html#ISO-8859-1> Hopefully doing some search/replace on these characters in your documents is feasible for you=A0? Or you can let the "demoroniser" do it for you :-) You can also try GNU recode, with something like this: recode -f windows-1252..latin1 myfile Cheers, Marc. |