From: William P. <wil...@ya...> - 2009-05-29 16:20:19
|
On May 29, 2009, at 11:12 AM, Jon Auman wrote: > If the SQL inserts all contain UTF8 characters, then there should be > no problem with the import into a UTF8 postgresql database. If there > are non-UTF8 characters in the SQL file, they can be stripped out with > iconv or converted with a shell program called "recode" Given the history of the legacy TreeBASE data, I believe that the vast majority of diacriticals will be properly formed in utf8, but there will be some malformed ones (1) dating from when we were entering data through a Mac application (Apple8 characters) and (2) as a result of people submitting data via web browsers that don't comply with our meta tags regarding character codings. I think it's fine to leave these malformed ones in (rather than auto-stripping them out) because we will want to fix them by hand later on, and they help alert us to where things need fixing. bp |