Can't get OSS to properly index UTF-8 content from a MySQL database

Help
2013-04-15
2014-06-24
  • Taka Muraoka

    Taka Muraoka - 2013-04-15

    The database has been created like this:

    CREATE DATABASE test
        DEFAULT CHARACTER SET utf8
        DEFAULT COLLATE utf8_general_ci ;
    SET NAMES 'utf8' ;
    
    CREATE TABLE my_table
    (
        content BLOB NOT NULL ,
    ) ENGINE innodb ;
    

    The OSS crawler retrieves the MySQL content using SQL like this:

    SELECT content FROM my_table ;
    

    But when running queries in the OSS admin interface (and from my own PHP app), OSS is returning garbled content (if the content is non-English).

    I've been trying to use CONVERT() and CAST() in the crawler SQL but nothing seems to work.

    What do I need to do to get OSS to index non-English content in a MySQL database correctly?

     
  • Naveen A.N

    Naveen A.N - 2013-04-15

    Hello Taka,

    Which language do the content you are trying to index belong to ?

    Do mysql returns the content as UTF-8 character ?

    If you are trying indexing certain language.
    You can try setting the "Language" option in the database crawler setting and try indexing.
    For example if your content is in french you can set the "Language" to "French".

    Naveen.A.N

     
  • Taka Muraoka

    Taka Muraoka - 2013-04-15

    The content is definitely stored correctly as UTF8 in the database.

    It doesn't matter what language the content is, nor what the Language option setting is - the problem is not about indexing, it's almost certainly about how OSS is interpreting the content it retrieves from the database.

    Running a query for an English word returns the correct documents, but the snippets are garbled. This happens both in the OSS admin interface and my own web app:
    http://awasu.com/tmp/oss.png
    http://awasu.com/tmp/oss2.png
    This suggests to me that OSS has incorrectly retrieved the content from the database.

     
  • Emmanuel Keller

    Emmanuel Keller - 2013-04-16

    Hi Taka,

    You're right. It is probably related to the usage of BLOB field type. I will do some test and give you feedback later today.

     
  • Taka Muraoka

    Taka Muraoka - 2013-04-16

    Sorry, I just realized I pasted the wrong DDL - the table column being indexed by OSS is a TEXT column. But the problem remains.

    In general, how does OSS know what encoding to use when interpreting the data being indexed?

     
  • Emmanuel Keller

    Emmanuel Keller - 2013-04-16

    About encoding, OSS is dependent of the setup of your system environment (locales), and the setup of the database.

    OSS uses the JDBC driver. With OSS 1.4 we provide the MySQL JDBC driver (5.1.22).
    http://dev.mysql.com/downloads/connector/j/

    To care about encoding issue, the best way is to be sure to be in UTF-8 encoding everywhere. In most cases, if the database is already encoded using UTF-8, nothing has to be done.

    Can you check what is the status of the JVM ? In the user interface of OpenSearchServer you can have a look at the tab panel Runtime/System/Properties and check some variables related to encoding:
    file.encoding: UTF-8
    sun.io.unicode.encoding: UnicodeBig
    sun.jnu.encoding: US-ASCII

    You may also check this page to know how to explicitly force the charset when connecting to MySQL:
    http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html

     
  • Taka Muraoka

    Taka Muraoka - 2013-04-16

    Current settings are:
    file.encoding UTF-8
    sun.jnu.encoding UTF-8
    sun.io.unicode.encoding UnicodeLittle

    Also:
    java.version 1.7.0_17-b02
    file.encoding.pkg sun.io
    os.arch amd64

    MySQL is "Ver 14.14 Distrib 5.6.10, for Linux (x86_64)"

    Running "OpenSearchServer v1.3.1 - stable - rev 1974 - build 550" on Centos x86_64 VM's on 64-bit Wintel.

    Adding "?characterEncoding=utf8" (nor UTF-8) to the JDBC connection string didn't help.

     
    Last edit: Taka Muraoka 2013-04-16
  • Taka Muraoka

    Taka Muraoka - 2013-04-25

    Anything else we can try?

    As an example, this is some content that's coming from the database
    Αρχείο
    which is this in UTF-8:
    ce 91 cf 81 cf 87 ce b5 ce af ce bf

    But OSS is returning this:
    c3 8e e2 80 98 c3 8f c2 81 c3 8f e2 80 a1 c3 8e
    c2 b5 c3 8e c2 af c3 8e c2 bf

     
  • Milhau

    Milhau - 2014-06-24

    Hello,
    same trouble, i'm using OpenSearchServer v1.5.3 - build 390f883772
    and I put
    jdbc:mysql://nameserver:3306/database?useUnicode=true&characterEncoding=UTF-8
    but the results are already the same : caracters with accent don't work.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks