The database has been created like this:
CREATE DATABASE test
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci ;
SET NAMES 'utf8' ;
CREATE TABLE my_table
content BLOB NOT NULL ,
) ENGINE innodb ;
The OSS crawler retrieves the MySQL content using SQL like this:
SELECT content FROM my_table ;
But when running queries in the OSS admin interface (and from my own PHP app), OSS is returning garbled content (if the content is non-English).
I've been trying to use CONVERT() and CAST() in the crawler SQL but nothing seems to work.
What do I need to do to get OSS to index non-English content in a MySQL database correctly?
Which language do the content you are trying to index belong to ?
Do mysql returns the content as UTF-8 character ?
If you are trying indexing certain language.
You can try setting the "Language" option in the database crawler setting and try indexing.
For example if your content is in french you can set the "Language" to "French".
The content is definitely stored correctly as UTF8 in the database.
It doesn't matter what language the content is, nor what the Language option setting is - the problem is not about indexing, it's almost certainly about how OSS is interpreting the content it retrieves from the database.
Running a query for an English word returns the correct documents, but the snippets are garbled. This happens both in the OSS admin interface and my own web app:
This suggests to me that OSS has incorrectly retrieved the content from the database.
You're right. It is probably related to the usage of BLOB field type. I will do some test and give you feedback later today.
Sorry, I just realized I pasted the wrong DDL - the table column being indexed by OSS is a TEXT column. But the problem remains.
In general, how does OSS know what encoding to use when interpreting the data being indexed?
About encoding, OSS is dependent of the setup of your system environment (locales), and the setup of the database.
OSS uses the JDBC driver. With OSS 1.4 we provide the MySQL JDBC driver (5.1.22).
To care about encoding issue, the best way is to be sure to be in UTF-8 encoding everywhere. In most cases, if the database is already encoded using UTF-8, nothing has to be done.
Can you check what is the status of the JVM ? In the user interface of OpenSearchServer you can have a look at the tab panel Runtime/System/Properties and check some variables related to encoding:
You may also check this page to know how to explicitly force the charset when connecting to MySQL:
Current settings are:
MySQL is "Ver 14.14 Distrib 5.6.10, for Linux (x86_64)"
Running "OpenSearchServer v1.3.1 - stable - rev 1974 - build 550" on Centos x86_64 VM's on 64-bit Wintel.
Adding "?characterEncoding=utf8" (nor UTF-8) to the JDBC connection string didn't help.
Anything else we can try?
As an example, this is some content that's coming from the database
which is this in UTF-8:
ce 91 cf 81 cf 87 ce b5 ce af ce bf
But OSS is returning this:
c3 8e e2 80 98 c3 8f c2 81 c3 8f e2 80 a1 c3 8e
c2 b5 c3 8e c2 af c3 8e c2 bf
same trouble, i'm using OpenSearchServer v1.5.3 - build 390f883772
and I put
but the results are already the same : caracters with accent don't work.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.