Which language do the content you are trying to index belong to ?
Do mysql returns the content as UTF-8 character ?
If you are trying indexing certain language.
You can try setting the "Language" option in the database crawler setting and try indexing.
For example if your content is in french you can set the "Language" to "French".
Naveen.A.N
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The content is definitely stored correctly as UTF8 in the database.
It doesn't matter what language the content is, nor what the Language option setting is - the problem is not about indexing, it's almost certainly about how OSS is interpreting the content it retrieves from the database.
Running a query for an English word returns the correct documents, but the snippets are garbled. This happens both in the OSS admin interface and my own web app: http://awasu.com/tmp/oss.png http://awasu.com/tmp/oss2.png
This suggests to me that OSS has incorrectly retrieved the content from the database.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To care about encoding issue, the best way is to be sure to be in UTF-8 encoding everywhere. In most cases, if the database is already encoded using UTF-8, nothing has to be done.
Can you check what is the status of the JVM ? In the user interface of OpenSearchServer you can have a look at the tab panel Runtime/System/Properties and check some variables related to encoding:
file.encoding: UTF-8
sun.io.unicode.encoding: UnicodeBig
sun.jnu.encoding: US-ASCII
Hello,
same trouble, i'm using OpenSearchServer v1.5.3 - build 390f883772
and I put
jdbc:mysql://nameserver:3306/database?useUnicode=true&characterEncoding=UTF-8
but the results are already the same : caracters with accent don't work.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The database has been created like this:
The OSS crawler retrieves the MySQL content using SQL like this:
But when running queries in the OSS admin interface (and from my own PHP app), OSS is returning garbled content (if the content is non-English).
I've been trying to use CONVERT() and CAST() in the crawler SQL but nothing seems to work.
What do I need to do to get OSS to index non-English content in a MySQL database correctly?
Hello Taka,
Which language do the content you are trying to index belong to ?
Do mysql returns the content as UTF-8 character ?
If you are trying indexing certain language.
You can try setting the "Language" option in the database crawler setting and try indexing.
For example if your content is in french you can set the "Language" to "French".
Naveen.A.N
The content is definitely stored correctly as UTF8 in the database.
It doesn't matter what language the content is, nor what the Language option setting is - the problem is not about indexing, it's almost certainly about how OSS is interpreting the content it retrieves from the database.
Running a query for an English word returns the correct documents, but the snippets are garbled. This happens both in the OSS admin interface and my own web app:
http://awasu.com/tmp/oss.png
http://awasu.com/tmp/oss2.png
This suggests to me that OSS has incorrectly retrieved the content from the database.
Hi Taka,
You're right. It is probably related to the usage of BLOB field type. I will do some test and give you feedback later today.
Sorry, I just realized I pasted the wrong DDL - the table column being indexed by OSS is a TEXT column. But the problem remains.
In general, how does OSS know what encoding to use when interpreting the data being indexed?
About encoding, OSS is dependent of the setup of your system environment (locales), and the setup of the database.
OSS uses the JDBC driver. With OSS 1.4 we provide the MySQL JDBC driver (5.1.22).
http://dev.mysql.com/downloads/connector/j/
To care about encoding issue, the best way is to be sure to be in UTF-8 encoding everywhere. In most cases, if the database is already encoded using UTF-8, nothing has to be done.
Can you check what is the status of the JVM ? In the user interface of OpenSearchServer you can have a look at the tab panel Runtime/System/Properties and check some variables related to encoding:
file.encoding: UTF-8
sun.io.unicode.encoding: UnicodeBig
sun.jnu.encoding: US-ASCII
You may also check this page to know how to explicitly force the charset when connecting to MySQL:
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
Current settings are:
file.encoding UTF-8
sun.jnu.encoding UTF-8
sun.io.unicode.encoding UnicodeLittle
Also:
java.version 1.7.0_17-b02
file.encoding.pkg sun.io
os.arch amd64
MySQL is "Ver 14.14 Distrib 5.6.10, for Linux (x86_64)"
Running "OpenSearchServer v1.3.1 - stable - rev 1974 - build 550" on Centos x86_64 VM's on 64-bit Wintel.
Adding "?characterEncoding=utf8" (nor UTF-8) to the JDBC connection string didn't help.
Last edit: Taka Muraoka 2013-04-16
Anything else we can try?
As an example, this is some content that's coming from the database
Αρχείο
which is this in UTF-8:
ce 91 cf 81 cf 87 ce b5 ce af ce bf
But OSS is returning this:
c3 8e e2 80 98 c3 8f c2 81 c3 8f e2 80 a1 c3 8e
c2 b5 c3 8e c2 af c3 8e c2 bf
Hello,
same trouble, i'm using OpenSearchServer v1.5.3 - build 390f883772
and I put
jdbc:mysql://nameserver:3306/database?useUnicode=true&characterEncoding=UTF-8
but the results are already the same : caracters with accent don't work.