From: Jean-Marc L. <jea...@gm...> - 2011-10-05 09:11:18
|
I think there is no need for dumping the data. It's just a matter of ALTERing the tables to change the collation (the actual encoding is not changed). We could detect it in the installer and use a modified version of the existing function in the installer. As lph said, the only way it could fail is for people who have conflicting unique key values. And their Tiki will keep running just as well as now (that is, very difficult to distinguish from utf8_unicode_ci). Cheers, J-M On Tue, Oct 4, 2011 at 4:47 PM, Jonny Bradley <jo...@ti...> wrote: > > +1 for 8.x - gives more testing before the LTS > > I can't see a (simple/safe) way to automatically fix legacy data, but maybe > a "how to" describing doing a dump of the data, editing that and reimporting > it might help? > > jb > > > On 4 Oct 2011, at 15:01, Jean-Marc Libs wrote: > > > Hello, > > > > Due to my ongoing contract, i have had the occasion to dive into the > subject of utf8_unicode_ci vs utf8_general_ci (vs utf8_bin). > > > > One of the best references is here : > http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html > > > > And it says that while utf8_general_ci is "slightly faster" for > collation, it is less correct. Both use the same encoding and are largely > similar. The difference only appears when one does alphabetic sorts or > searches. Since multilinguism is one of the strong points of Tiki, and a lot > of Tiki users come to Tiki because of it, we should get it as right as we > can. > > > > Therefore, for tiki sites in general, I would suggest that we switch the > install routine in newer versions, maybe 8.x or at least starting with trunk > (future 9.x which will be next LTS version) to convert to utf8_unicode_ci > not to utf8_general_ci. On the reference site it explains that one of the > key errors is that it fails to correctly collate the German letter "ß" > (sharfes-s). With German being the 2nd or 3rd most frequently used language > for tiki, and meanwhile one into which tiki is even more fully translated > than into French, German collation should work. Also, it states that, > overall, the utf8_general_ci encoding cannot handle so-called character > combinations. This may even apply to French, when text is copied from some > input locales in which accents and base characters are separately encoded, > stating that ... > > > > "For any Unicode character set, operations performed using the > xxx_general_ci collation are faster than those for the xxx_unicode_ci > collation. For example, comparisons for the utf8_general_ci collation are > faster, but slightly less correct, than comparisons for utf8_unicode_ci. The > reason for this is that utf8_unicode_ci supports mappings such as > expansions; that is, when one character compares as equal to combinations of > other characters. For example, in German and some other languages “ß” is > equal to “ss”. utf8_unicode_ci also supports contractions and ignorable > characters. utf8_general_ci is a legacy collation that does not support > expansions, contractions, or ignorable characters. It can make only > one-to-one comparisons between characters. " > > > > The change would be relatively straightforward: change all occurences of > utf8_general_ci to utf8_unicode_ci in the installer. > > The only tricky part would be to decide how to handle people already in > utf8_general_ci > > > > Any strong opinions ? > > > > Would it be for Tiki8 or trunk ? > > > > Cheers, > > Jean-Marc "Jyhem" Libs > > > > On Sun, Jun 19, 2011 at 7:29 PM, Louis-Philippe Huberdeau < > lph...@lp...> wrote: > > Hello, > > > > I can't really say what happened here. I had tested the conversion script > in Berlin with you last summer and it all worked fine as far as I can > remember. The script does depend on MySQL being a fairly recent version to > work and there are adjustments that need to be done in db/local.php. I have > not run the script myself for a long time, but this should all be > documented. > > > > The script does not do a manual selection of which fields will be > converted. It simple converts any field of the varchar or text family. Did > you get any error while processing the conversion? Did you perform both > steps of the conversion? > > > > As for the general vs unicode decision, it's quite arbitrary. I > personally do not know all of the differences between the collations and > 'general' seemed suitable for general purposes. I may be wrong. > > > > > > -- > > LP > > > > > > On Sat, Jun 18, 2011 at 3:21 PM, Olaf-Michael Stefanov < > oms...@gm...> wrote: > > Dear Louis-Philippe, Dear tiki-devel colleagues, > > > > This is definitely NOT a case of a rename not having taken place because > of a conflict. > > In fact, while all references, everywhere in all wiki pages are being > converted, the column pageName in table tiki_pages are NOT being converted. > > In this table, tiki_pages, although the column "pageName" [ varchar(160) > ] is now utf8_general_ci the contents were not converted during the utf8 > fix, unlike the whole rest of the database (as far as I've reviewed it), > including the column "data" [longtext] in the same table. > > > > At jiamcatt we have several pages in Arabic, Chinese, or Russian. Some > have latin page names, but others have pagenames which contain the full utf8 > char set now permitted in URLs. Here are a few examples of pagenames NOT > converted, with the accompanying data column, taken from PHPMyAdmin Version: > 3.4.0, taken from a database which during install said: > > Tiki version: 7.0 > > PHP version: 5.2.17 > > Server: > > www.jiamcatt.ch > > > > Sent: Tue, 14 Jun 11 23:36:12 +0200 > > > > > > <moz-screenshot-27.png> > > > > By manually renaming page 1081 to <首页> links to this page now work, since > the rest of the tiki's contents were converted. And since the contents were > also correctly converted it shows correctly. > > Similarly, renaming page 1080 to <الصفحة الرئيسية>, and page 1082 to > <Главная> has the same positive effects.The same applies to pages such as > page 1198 which should read < > > DomeñoBlais-11> but, in table tiki_pages continues to have the > uncorrected name <DomeñoBlais-11> which references to this page have been > corrected. > > This also applies to page 1206 <González-Martínez-11> with uncorrected > name <González-MartÃnez-11 > > >, and > > and to page 1114 <Guillén-11> with the uncorrected name <Guillén-11>. > > > > NONE OF THESE PAGES has a "counterpart" page with the same name as > without an accent. > > > > The only exception, at least in this wiki, is page 1313 with the > uncorrected pageName < > > JIAMCATT 2011 Présentations> which should read <JIAMCATT 2011 > Présentations> but cannot be renamed as such because there IS a page with > the English equivalent <JIAMCATT 2011 Presentations > > >. > > > > What I'm saying, in short, is that - in actual support of your premise, > Louis-Philippe - values in the pageName column of table tiki_pages SHOULD be > converted to unicode, along with the values of all other fields. In the rare > cases where this results in a duplicate page name, let it fail then. > > > > The current situation, in version 7.0 is that all references to pages > with pageName values in any language other than English FAIL, because the > references have been converted but the pageNames themselves have not. > > > > Or, at least make it an option, to either convert pageName values or not. > The latter may be useful in French/English bilingual tikis with lots of > présentation/presentation potential duplicates. In all other cases, and in > all other languages conversion (utf8 fixing) should apply to the pageName > values in table tiki_pages, the same as it applies to all other data in the > tiki sql database. > > > > Finally, does anyone know why the utf8 fix goes to collation > > utf8_general_ci and not utf8_unicode_ci > > ? > > > > Kind regards, > > olaf-michael (omstefanov) > > > > > > On 2011-06-13 14:08, Louis-Philippe Huberdeau wrote: > >> The rename most likely failed on the rename for that page because of a > conflict. When using UTF-8, MySQL will consider accentuated characters to be > equivalent to the non-accentuated ones. This mostly helps in sorting the > content properly, but it does have this annoying side-effect that differs > from the behavior with the previously mangled data. If you don't have too > many of those issues, it's probably best to just rename those pages > manually. > >> > >> -- > >> LP > >> > >> On Mon, Jun 13, 2011 at 7:54 AM, Olaf-Michael Stefanov < > oms...@gm...> wrote: > >> I've just done a major upgrade of a database from tikiwiki 4.2 to both > 6.3-LTS and tiki 7.0. > >> > >> In both cases I needed to run the "Convert database and tables to UTF-8" > part of step 5 Install/Upgrade of the tiki-installer. > >> In both cases afterwards it appeared that all was converted. No further > message appeared. > >> > >> Now I've found at least one case where the conversion DID NOT TAKE > PLACE. > >> We have a pair of wiki pages, in English and French, where the French > page (page 1313) is called <JIAMCATT+2011+Présentations>. > >> Yet, when the page name is displayed at the top, it shows, in both > 6.3-LTS and 7.0 as: > >> JIAMCATT 2011 Présentations > >> The screen shot of the upper part of the screen, attached below shows, > the large number of other French accented characters are showing correctly. > >> It also shows that the file name itself appears to have been properly > converted, as it appears correctly in the URL. > >> > >> Could someone have a look why this does not apply to the page name in > the wiki-page body header. > >> > >> Here the screen capture: > >> <Mail Attachment.png> > >> > >> The problem appears to exist in the case of all pages with accents in > the page name. > >> We don't have many, but here's another example, and it perhaps also > provides the solution. > >> The VISIBLE URL is: http://www.jiamcatt.ch/tiki-7.0/PinarGarcía (with > the second last character being a lower-case "i" with an acute accent in > place of the dot). > >> However, when one highlights the URL and copies it is shows as: > >> URL: http://www.jiamcatt.ch/tiki-7.0/PinarGarc%C3%ADa > >> where the string "%C3%AD" certainly isn't the usual version of "í", > which shows up in the body header, where it is: > >> PinarGarcÃa > >> mix-up > >> How does one prevent this mixup of names? > >> In the case of this site, > >> the English page name is "JIAMCATT+2011+Presentations", while > >> the French page name is "JIAMCATT+2011+Présentations", i.e., identical, > except for the accent. > >> In the second case the page is the result of a pretty-tracker which > creates the page name based on the surname of someone registering for a > meeting. > >> In either case, stripping off the accents isn't really a solution. > >> > >> Kind regards, > >> olaf-michael (omstefanov) > >> > >> > >> > ------------------------------------------------------------------------------ > >> EditLive Enterprise is the world's most technically advanced content > >> authoring tool. Experience the power of Track Changes, Inline Image > >> Editing and ensure content is compliant with Accessibility Checking. > >> http://p.sf.net/sfu/ephox-dev2dev > >> _______________________________________________ > >> TikiWiki-devel mailing list > >> Tik...@li... > >> https://lists.sourceforge.net/lists/listinfo/tikiwiki-devel > >> > >> > >> > >> > ------------------------------------------------------------------------------ > >> EditLive Enterprise is the world's most technically advanced content > >> authoring tool. Experience the power of Track Changes, Inline Image > >> Editing and ensure content is compliant with Accessibility Checking. > >> > >> http://p.sf.net/sfu/ephox-dev2dev > >> > >> _______________________________________________ > >> TikiWiki-devel mailing list > >> > >> Tik...@li... > >> https://lists.sourceforge.net/lists/listinfo/tikiwiki-devel > > > > > ------------------------------------------------------------------------------ > > EditLive Enterprise is the world's most technically advanced content > > authoring tool. Experience the power of Track Changes, Inline Image > > Editing and ensure content is compliant with Accessibility Checking. > > http://p.sf.net/sfu/ephox-dev2dev > > _______________________________________________ > > TikiWiki-devel mailing list > > Tik...@li... > > https://lists.sourceforge.net/lists/listinfo/tikiwiki-devel > > > > > > > > > ------------------------------------------------------------------------------ > > EditLive Enterprise is the world's most technically advanced content > > authoring tool. Experience the power of Track Changes, Inline Image > > Editing and ensure content is compliant with Accessibility Checking. > > http://p.sf.net/sfu/ephox-dev2dev > > _______________________________________________ > > TikiWiki-devel mailing list > > Tik...@li... > > https://lists.sourceforge.net/lists/listinfo/tikiwiki-devel > > > > > > > ------------------------------------------------------------------------------ > > All the data continuously generated in your IT infrastructure contains a > > definitive record of customers, application performance, security > > threats, fraudulent activity and more. Splunk takes this data and makes > > sense of it. Business sense. IT sense. Common sense. > > > http://p.sf.net/sfu/splunk-d2dcopy1_______________________________________________ > > TikiWiki-devel mailing list > > Tik...@li... > > https://lists.sourceforge.net/lists/listinfo/tikiwiki-devel > > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > _______________________________________________ > TikiWiki-devel mailing list > Tik...@li... > https://lists.sourceforge.net/lists/listinfo/tikiwiki-devel > |