The original instructions I provided (2 Sept) work beautifully on my windows XP box. But we had to make some slight amendments for our DEV/UAT/PROD linux boxes to get ft indexing working correctly on diacritics. No guarantees that you’ll need to make these same changes at your end – much seems to depend on local configuration factors.  The linux amendments we had to make -


·         Don’t do this bit –

Changed class.fulltext_tools, function convertFile line to (required to encode PDF fulltext etc correctly)–

exec(APP_PDFTOTEXT_EXEC.' -ENC utf-8 -q -nopgbrk'.$filename.' '.$textfilename);

·         Comment out this line in

$fulltext = utf8_encode($fulltext);


Without the above changes, there seems to be a “double-encoding” problem that results in garbage being stored in fulltext_cache and solr.


We do still have a minor issue on our linux boxes, whereby some PDFs that are classed as “usable” on windows are “unusable” on linux. This mainly affects non-English PDFs.



From: Bernadette Houghton
Sent: Wednesday, 2 September 2009 3:31 PM
To: ''; 'Jauslin Kai'
Subject: RE: Diacritics


Got it! Thanks heaps to Kai and Christiaan for their help.


Here’s what I had to do on my windows PC to get diacritics and other special characters working nicely for R2.1rc3 –


Added to my.ini under mysqld-






Added to JAVA_OPTS environmental variable –




Added to beginning of tomcat startup.bat (positioning is important! See also–

set JAVA_OPTS="-Dfile.encoding=UTF-8"  (may not be necessary for unix/linux)


Added to solr schema.xml for “index” and “query” of fieldtype=”text” (this lets users search in plain text, without the diacritic character)

<filter class="solr.ISOLatin1AccentFilterFactory"/>


Changed class.fulltext_tools, function convertFile line to (required to encode PDF fulltext etc correctly)–

exec(APP_PDFTOTEXT_EXEC.' -ENC utf-8 -q -nopgbrk'.$filename.' '.$textfilename);





From: Bernadette Houghton []
Sent: Tuesday, 1 September 2009 11:02 AM
To: 'Jauslin Kai'
Cc: ''
Subject: Re: [Fez-users] Diacritics


Hi Kai, thanks for your email. Still having issues, though. I suspect we have an underlying problem somewhere which is preventing the ISOLatin1AccentFilterFactory from working.


Tables in SQLyog (e.g. frsk_author) are displaying diacritics correctly, e.g. Coté, J. But in solr admin they are displaying such as Coté, J. In fez editing form, display is correct (Coté, J). In record view, all is well. In list view, all is not. If I switch solr off, all is OK everywhere.


I’ve set JAVA_OPTS="-Dfile.encoding=UTF-8" in my env. Variables but it had no impact.


Can you make any further suggestions?





From: Jauslin Kai []
Sent: Monday, 31 August 2009 6:54 PM
To: Bernadette Houghton
Subject: AW: Diacritics


Hi Bern,


We use a special filter for text fields in Solr: ISOLatin1AccentFilter (see You need to add this to the “index” and “query” section of the fieldtype “text” in the Solr schema.xml. The “all” field should have this type.


The second thing you need to make sure is that you have a perfect UTF-8 workflow. I have encountered several issues (unfortunately I did not have time to report them all back). Possibly, they are already corrected in the newest trunk version. Let’s see:

-          MySQL connection (as you mentioned), plus: if you are upgrading the database, check that the fez_record_search_key* tables are really on utf8_unicode_ci  (if not: you can change it manually with the MySQL query browser for each table).

-          Fulltext indexing: file class.fulltext_tools: check that it has the “-enc UTF-8” flag in the line exec(APP_PDFTOTEXT_EXEC." -enc UTF-8 -nopgbrk $filename $textfilename");

-          Class.fulltext_index: check updateFulltextCache to contain the line $fulltext = utf8_encode($fulltext);


You can do checks at several levels:

1.       MySQL table fez_fulltext_cache: should contain correct UTF-8 – i.e. when viewing in MySQL query browser or in SQLyog. This is the source for Solr, if it’s wrong here (e.g. double characters for diacritics), it will be wrong in Solr

2.       Solr Admin Backend: search for pid, e.g. “eth:12345” and check the XML. This should be correct when viewing in the web browser.

3.       Fez Editing Form (if correct here, it should be also correct when viewing). Make sure that your Smarty templates all have UTF-8 file encoding and UTF-8 character set.


Cheers, Kai




ETH Zürich, Kai Jauslin, ETH-Bibliothek, Prozesse und IT, Integration und Entwicklung, Rämistrasse 101, CH-8092 Zürich, Tel +41-44-6324972, Büro HG H29.5,,


Von: Bernadette Houghton []
Gesendet: Montag, 31. August 2009 03:19
An: Jauslin Kai
Betreff: Diacritics


Hi Kai, I note that you seem to have diacritics set up nicely at ETH – you can search with and without the diacritic character, e.g. either “hafliger” or Häfliger” will retrieve this author. This isn’t happening for us, though – we can only retrieve by searching with the diacritic.


I’ve added the following to my.ini, as per a previous message from you on fez-users –







(We also have a bit of an issue with diacritics displaying with strange characters in List view, with SOLR turned on, but this seems to be another story).


Any suggestions you can offer will be much appreciated.





Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
Deakin University CRICOS Provider Code 00113B (Vic)

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus free