Welcome, Guest! Log In | Create Account

Bigram

From mysqlftppc

Jump to: navigation, search

日本語

mysqlftppc bigram plugin

bigram plugin tokenizes a character sequence into a bi-gram (2-gram) token stream. Please note it uses character n-gram, not word n-gram. You also need to know the nature of character bigram tokens.

  • You can't search less than 2 chars, i.e., single character. You can't search a string that have "a" in it.
  • The index works nice when you use much variety of characters. For example, Japanese strings creates good index cardinality than that from English because Japanese usually use 1945 chars.
  • In querying, always use phrase query. You have to query IN BOOLEAN MODE. Phrase query is in more strictly, phrase token query. Natural mode query does not make sense with bigram plugin.

After you install the plugin with INSTALL PLUGIN, you can check the status of the plugin as following:

mysql> SHOW STATUS LIKE "Bigram_info";
+---------------+---------------------------+
| Variable_name | Value                     |
+---------------+---------------------------+
| Bigram_info   | with ICU 4.0(Unicode 5.1) |
+---------------+---------------------------+
1 row in set (0.00 sec)

Unicode Normalization

You can improve the result of query by applying Unicode normalization, which is provided by ICU library linked with the plugin. Note that the normalization is done over index, and not over the table data itself. If you build the plugin with ICU, there are two system variables, bigram_normalization and bigram_unicode_version. bigram_normalization is OFF by default, meaning that normalization is not performed.

mysql> SHOW VARIABLES LIKE "bigram_%";
+------------------------+---------+
| Variable_name          | Value   |
+------------------------+---------+
| bigram_normalization   | OFF     |
| bigram_unicode_version | DEFAULT |
+------------------------+---------+
2 rows in set (0.01 sec)

To enable normalization, set bigram_normalization variable as following:

SET GLOBAL bigram_normalization="KC"

The argument is one of OFF, C, D, KC, KD and those are Unicode normalization forms(UAX#15). If you have a string that will use Unicode 3.2, bigram_unicode_version might be useful. The flag specifies the normalization based on Unicode 3.2.

SET GLOBAL bigram_unicode_version="3.2"

Boolean mode syntax

Most of the rules are the same with the built-in parser. There're some changes due to bigram algorithm.

You can escape a special character by prepending a backslash. If you want to use a backslash as a character, escape it with another backslash. Please be aware that you might experience backslashes are evaluated in the surrouding process for example, SQL command line to SQL server, or PHP interpreter to PHP string.

mysql> CREATE TABLE bi (c TEXT, FULLTEXT(c) WITH PARSER bigram);
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO bi VALUES("ab d +ef");
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM bi WHERE MATCH(c) AGAINST('+b\\ d' IN BOOLEAN MODE);
+----------+
| c        |
+----------+
| ab d +ef |
+----------+
1 row in set (0.00 sec)

mysql> SELECT * FROM bi WHERE MATCH(c) AGAINST('+d\\ \\+e' IN BOOLEAN MODE);
+----------+
| c        |
+----------+
| ab d +ef |
+----------+
1 row in set (0.00 sec)

The double-quote " character creates a simple single string. You can include white spaces or backslashes in it. With bigram parser, querying is always a phrase query against bigram token stream.

mysql> CREATE TABLE bi (c TEXT, FULLTEXT(c) WITH PARSER bigram);
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO bi VALUES("ab def");
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM bi WHERE MATCH(c) AGAINST('+"b d"' IN BOOLEAN MODE);
+--------+
| c      |
+--------+
| ab def |
+--------+
1 row in set (0.00 sec)

The truncation operator * is disabled in bigram parser. The length of the indexed string is always 2 in bigram parser. The truncation operator might work only with a string shorter than 2 chars, but it is confusing, and thus disabled.