Mecab
From mysqlftppc
Contents |
mysqlftppc mecab plugin
mecab plugin tokenizes a character sequence with mecab library. The rule how the token is extracted is dependent on the mecab dictionary, so it is important to choose a good dictionary for what you want to do. For natural Japanese, there're some good dictionaries, and naist-jdic is recommended. If you use MacOS X, you might find a Korean dictionary. Note mysqlftppc will automatically transcode the character encoding if necessary. If you hit performance issue, ensure that the charset of the table and that of mecab dictionary is same.
After you install the plugin with INSTALL PLUGIN, you can check the status of the plugin as following:
mysql> SHOW STATUS LIKE "Mecab_info"; +---------------+---------------------------------------+ | Variable_name | Value | +---------------+---------------------------------------+ | Mecab_info | with mecab 0.97, ICU 4.0(Unicode 5.1) | +---------------+---------------------------------------+ 1 row in set (0.00 sec)
There're some variables to control the behavior. With ICU enabled mecab plugin, you'll see the variables as following:
mysql> SHOW VARIABLES LIKE "mecab%"; +-----------------------+---------+ | Variable_name | Value | +-----------------------+---------+ | mecab_dicdir | | | mecab_normalization | OFF | | mecab_unicode_version | DEFAULT | | mecab_userdic | | +-----------------------+---------+ 5 rows in set (0.00 sec)
Unicode normalization
You can improve the result of query by applying Unicode normalization, which is provided by ICU library linked with the plugin. Note that the normalization is done over index, and not over the table data itself. If you build the plugin with ICU, there are two system variables, mecab_normalization and mecab_unicode_version. mecab_normalization is OFF by default, meaning that normalization is not performed.
To enable normalization, set mecab_normalization variable as following:
SET GLOBAL mecab_normalization="KC"
The argument is one of OFF, C, D, KC, KD and those are Unicode normalization forms(UAX#15). If you have a string that will use Unicode 3.2, mecab_unicode_version might be useful. The flag specifies the normalization based on Unicode 3.2.
SET GLOBAL mecab_unicode_version="3.2"
Mecab & UTF-8 dictionary
You have to install mecab with utf8 support. Using --enable-utf8-only configure option is recommended, and don't forget installing a dictionary in UTF-8. By default, mecab use a dictionary that is configured in mecabrc file. If you want to alter the dictionary, use mecab_dicdir and mecab_userdic system variables. Those are the same with mecab command line option of dicdir and userdic option.
SET GLOBAL mecab_dicdir="/path/to/dicdir" SET GLOBAL mecab_userdic="/path/to/userdic"
Boolean mode syntax
Most of the rules are the same with the built-in parser.
You can escape a special character by prepending a backslash. If you want to use a backslash as a character, escape it with another backslash. Please be aware that you might experience backslashes are evaluated in the surrouding process for example, SQL command line to SQL server, or PHP interpreter to PHP string.
Tokens in boolean mode query are treated as a token as-is. If the word you put was a phrase i.e., you want to tokenize it with mecab, plase in in phrase.
You can perform phrase query as following:
CREATE TABLE me (c TEXT, FULLTEXT(c) WITH PARSER mecab);
INSERT INTO me VALUES("今日の天気は晴れです。");
SELECT * FROM me WHERE MATCH(c) AGAINST('+"今日の天気"' IN BOOLEAN MODE);
