Got it! Thanks heaps to Kai and Christiaan for their help.

 

Here’s what I had to do on my windows PC to get diacritics and other special characters working nicely for R2.1rc3 –

 

Added to my.ini under mysqld-

default-character-set=utf8

collation_server=utf8_unicode_ci

character_set_server=utf8

skip-character-set-client-handshake

 

Added to JAVA_OPTS environmental variable –

 

-Dfile.encoding=UTF-8

 

Added to beginning of tomcat startup.bat (positioning is important! See also http://www.mail-archive.com/solr-user@lucene.apache.org/msg05556.html)–

set JAVA_OPTS="-Dfile.encoding=UTF-8"  (may not be necessary for unix/linux)

 

Added to solr schema.xml for “index” and “query” of fieldtype=”text” (this lets users search in plain text, without the diacritic character)

<filter class="solr.ISOLatin1AccentFilterFactory"/>

 

Changed class.fulltext_tools, function convertFile line to (required to encode PDF fulltext etc correctly)–

exec(APP_PDFTOTEXT_EXEC.' -ENC utf-8 -q -nopgbrk'.$filename.' '.$textfilename);

 

bern

 

 

From: Bernadette Houghton [mailto:bernadette.houghton@deakin.edu.au]
Sent: Tuesday, 1 September 2009 11:02 AM
To: 'Jauslin Kai'
Cc: 'fez-users@lists.sourceforge.net'
Subject: Re: [Fez-users] Diacritics

 

Hi Kai, thanks for your email. Still having issues, though. I suspect we have an underlying problem somewhere which is preventing the ISOLatin1AccentFilterFactory from working.

 

Tables in SQLyog (e.g. frsk_author) are displaying diacritics correctly, e.g. Coté, J. But in solr admin they are displaying such as Coté, J. In fez editing form, display is correct (Coté, J). In record view, all is well. In list view, all is not. If I switch solr off, all is OK everywhere.

 

I’ve set JAVA_OPTS="-Dfile.encoding=UTF-8" in my env. Variables but it had no impact.

 

Can you make any further suggestions?

 

Regards

bern

 

From: Jauslin Kai [mailto:kai.jauslin@library.ethz.ch]
Sent: Monday, 31 August 2009 6:54 PM
To: Bernadette Houghton
Cc: fez-users@lists.sourceforge.net
Subject: AW: Diacritics

 

Hi Bern,

 

We use a special filter for text fields in Solr: ISOLatin1AccentFilter (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-4ebf7aea23b3d6d34a1f8314f9de17334a3e2fac). You need to add this to the “index” and “query” section of the fieldtype “text” in the Solr schema.xml. The “all” field should have this type.

 

The second thing you need to make sure is that you have a perfect UTF-8 workflow. I have encountered several issues (unfortunately I did not have time to report them all back). Possibly, they are already corrected in the newest trunk version. Let’s see:

-          MySQL connection (as you mentioned), plus: if you are upgrading the database, check that the fez_record_search_key* tables are really on utf8_unicode_ci  (if not: you can change it manually with the MySQL query browser for each table).

-          Fulltext indexing: file class.fulltext_tools: check that it has the “-enc UTF-8” flag in the line exec(APP_PDFTOTEXT_EXEC." -enc UTF-8 -nopgbrk $filename $textfilename");

-          Class.fulltext_index: check updateFulltextCache to contain the line $fulltext = utf8_encode($fulltext);

 

You can do checks at several levels:

1.       MySQL table fez_fulltext_cache: should contain correct UTF-8 – i.e. when viewing in MySQL query browser or in SQLyog. This is the source for Solr, if it’s wrong here (e.g. double characters for diacritics), it will be wrong in Solr

2.       Solr Admin Backend: search for pid, e.g. “eth:12345” and check the XML. This should be correct when viewing in the web browser.

3.       Fez Editing Form (if correct here, it should be also correct when viewing). Make sure that your Smarty templates all have UTF-8 file encoding and UTF-8 character set.

 

Cheers, Kai

 

 

--

ETH Zürich, Kai Jauslin, ETH-Bibliothek, Prozesse und IT, Integration und Entwicklung, Rämistrasse 101, CH-8092 Zürich, Tel +41-44-6324972, Büro HG H29.5, kai.jauslin@library.ethz.ch, www.ethbib.ethz.ch

 

Von: Bernadette Houghton [mailto:bernadette.houghton@deakin.edu.au]
Gesendet: Montag, 31. August 2009 03:19
An: Jauslin Kai
Betreff: Diacritics

 

Hi Kai, I note that you seem to have diacritics set up nicely at ETH – you can search with and without the diacritic character, e.g. either “hafliger” or Häfliger” will retrieve this author. This isn’t happening for us, though – we can only retrieve by searching with the diacritic.

 

I’ve added the following to my.ini, as per a previous message from you on fez-users –

 

default-character-set=utf8

collation_server=utf8_unicode_ci

character_set_server=utf8

skip-character-set-client-handshake

 

(We also have a bit of an issue with diacritics displaying with strange characters in List view, with SOLR turned on, but this seems to be another story).

 

Any suggestions you can offer will be much appreciated.


Regards

bern

 

 

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_houghton@hotmail.com
Email: bernadette.houghton@deakin.edu.au
Website: http://www.deakin.edu.au
Deakin University CRICOS Provider Code 00113B (Vic)

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus free