[groonga-talk:321] [ANN] Groonga 6.0.3

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi, 

Groonga 6.0.3 has been released!
  http://groonga.org/docs/news.html#release-6-0-3

Install: http://groonga.org/docs/install.html
Charcteristics:  http://groonga.org/docs/characteristic.html

The topics in this release:

  * Supported command version 3
  * Improved performance about between() without indexes
  * Added number and time plugins
  * Supported window function

This release includes some bug fixes, so please upgrade Groonga!

# Topics

## Supported command version 3

In this release, Groonga supports command version 3.

Groonga command supports revision by command version.

In the previous versions, output format is something like:

    > status
    [
      [0,1464342851.826973,0.0002036094665527344],
      {
        "alloc_count":273,
        "starttime":1464342849,
        "start_time":1464342849,
        "uptime":2,
        "version":"6.0.3",
        "n_queries":0,
        "cache_hit_rate":0.0,
        "command_version":1,
        "default_command_version":1,
        "max_command_version":3
      }
    ]

If you specify --comand_version 3, the format of response is changed
to object literal:

    > status --command_version 3
    {
      "header":{
        "return_code":0,
        "start_time":1464342859.177147,
        "elapsed_time":0.0001118183135986328
      },
      "body":{
        "alloc_count":276,
        "starttime":1464342849,
        "start_time":1464342849,
        "uptime":10,
        "version":"6.0.3",
        "n_queries":0,
        "cache_hit_rate":0.0,
        "command_version":3,
        "default_command_version":1,
        "max_command_version":3
      }
    }

It makes JavaScript friendly.

## Improved performance about between() without indexes

In this release, performance of between() without indexes is improved.

Here is the schema for benchmark.

table_create Entries TABLE_NO_KEY
column_create Entries rank COLUMN_SCALAR Int32

In the above sample, we expects that sequential search is used.

There are four test cases. The number of records are different.

  * 1,000
  * 10,000
  * 100,000
  * 1,000,000

The sequential number is assigned as rank.

During benchmark execution, we applied between() for the following cases.

  * 1,000 (select the number of rank is greater than 500, less than 600)
  * 10,000 (select the number of rank is greater than 5,000, less than 5,100)
  * 100,000 (select the number of rank is greater than 50,000, less than 50,100)
  * 1,000,000 (select the number of rank is greater than 500,000, less than 500,100)

Here is the query which is executed.

  select Entries --cache no --filter 'between(rank, 500, "exclude", 600, "include")'

Here is the benchmark result. Note that median is used.

  * 1,000 4.3ms -> 0.2590ms
  * 10,000 39.5ms -> 0.6440ms
  * 100,000 390ms -> 5.1ms
  * 1,000,000 3798.3ms -> 44.2ms

Here is the logarithmic graph. There is a great performance improvement.

  http://groonga.org/images/blog/ja/2016-05-29-groonga-between-optimization/benchmark-between-optimization.png

## Added number and time plugins

In this release, new plugins are added - time and number plugin.

It helps you to classify specific range of value as same value.

number plugin provides number_classify function.

Let's classify price every 100 value.

Here is the sample schema and data.

  plugin_register functions/number
  table_create Prices TABLE_PAT_KEY Int32

  load --table Prices
  [
  {"_key": 0},
  {"_key": 1},
  {"_key": 99},
  {"_key": 100},
  {"_key": 101},
  {"_key": 199},
  {"_key": 200},
  {"_key": 201}
  ]

Use the following query which uses number_classify:

  select Prices --sortby _id --limit -1 --output_columns '_key, number_classify(_key, 100)'

number_classify(_key, 100) classifies the value by _key and classifies
every 100.

  [
    [
      [8],
      [
        ["_key","Int32"],
        ["number_classify",null]
      ],
      [0,0],
      [1,0],
      [99,0],
      [100,100],
      [101,100],
      [199,100],
      [200,200],
      [201,200]
    ]
  ]

As every 100, data between 100 and 199 are treated as same range of
price.

time plugin provides similar functionality about timestamp. time
plugin provides the following functions:

  * time_classify_second
  * time_classify_minute
  * time_classify_hour
  * time_classify_day
  * time_classify_week
  * time_classify_month
  * time_classify_year

Next, classify timestamps every 10 minutes, here is the sample schema
and data.

  plugin_register functions/time
  table_create Timestamps TABLE_PAT_KEY Time
  load --table Timestamps
  [
  {"_key": "2016-05-05 22:29:59.999999"},
  {"_key": "2016-05-05 22:30:00.000000"},
  {"_key": "2016-05-05 22:30:00.000001"},
  {"_key": "2016-05-05 22:39:59.999999"},
  {"_key": "2016-05-05 22:40:00.000000"},
  {"_key": "2016-05-05 22:40:00.000001"}
  ]

To classify every minute, use time_classify_minute function.

  select Timestamps --sortby _id --limit -1 --output_columns '_key, time_classify_minute(_key, 10)'

Here is the result:

  [
    [
      [6],
      [
        ["_key","Time"],
        ["time_classify_minute",null]
      ],
      [1462454999.999999,1462454400.0],
      [1462455000.0,1462455000.0],
      [1462455000.000001,1462455000.0],
      [1462455599.999999,1462455000.0],
      [1462455600.0,1462455600.0],
      [1462455600.000001,1462455600.0]
    ]
  ]

The time interval is 10 minutes, data between 22:30 and 22:39 are
treated as same range of timestamp (1462455000.0).

## Supported window function

Supported window function

In this release, window function is supported.

For example, you can use record_number which returns nth number of record.

Here is the sample schema and data.

  table_create Items TABLE_HASH_KEY ShortText
  column_create Items price COLUMN_SCALAR UInt32

  load --table Items
  [
  {"_key": "item1", "price": 666},
  {"_key": "item2", "price": 999},
  {"_key": "item3", "price": 777},
  {"_key": "item4", "price": 111},
  {"_key": "item5", "price": 333},
  {"_key": "item6", "price": 222}
  ]

Let's search price data with nth number of record by ascending
order. Here is the query for it.

  select Items \
    --columns[nth_record].stage initial \
    --columns[nth_record].value 'record_number()' \

The important point is that the value of record_number() is assigned
to dynamic column nth_record.

  [
    [
      [6],
      [
        ["_key","ShortText"],
        ["price","UInt32"],
        ["nth_record","UInt32"]
      ],
      ["item1",666,4],
      ["item2",999,6],
      ["item3",777,5],
      ["item4",111,1],
      ["item5",333,3],
      ["item6",222,2]
    ]
  ]

It is obvious that each record is nth number of record.

## Improvements

  * [experimental] Added GRN_II_OVERLAP_TOKEN_SKIP_ENABLE and
        GRN_NGRAM_TOKENIZER_REMOVE_BLANK_DISABLE environment variables
        to improve performance of N-gram tokenizer. [GitHub#533][Patch
        by Naoya Murakami]

  * [table_create] Stopped to ignore nonexistent default tokenizer,
    normalizer or token filters. In the previous versions, Groonga
    ignored a typo in --default_tokenizer, --normalizer or
    --token_filters parameter silently. It caused a delay in finding
    problems.

  * [select] output_columns v1: Supported expression such as
     snippet_html(...) in output_columns.

  * [select] Removed a limitation about the number of labeled
    drilldowns. In the previous versions, the number of max labeled
    drilldowns is limited to 10.

  * [number_classify] Added a number plugin. Use number_classify
    function to classify by value.

  * Added a time plugin. Use time_classify_second,
    time_classify_minute, time_classify_hour, time_classify_day,
    time_classify_week, time_classify_month, time_classify_year
    function to classify by value.

  * [select] Supported dynamic column creation which is used in
    output_columns, drilldown or sortby
    [GitHub#539,#541,#542,#544,#545][Patch by Naoya Murakami]:

    select \
      --columns[LABEL].stage filtered \
      --columns[LABEL].type ShortText \
      --columns[LABEL].flags COLUMN_SCALAR \
      --columns[LABEL].value 'script syntax expression' \
      ...

  * [experimental][select] Improved performance for range/equal search
    and enough filtered case. Set
    GRN_TABLE_SELECT_ENOUGH_FILTERED_RATIO environment variable to
    enable this feature.

  * [select] Supported index used search for filtered tables.

  * Supported to detect changed database isn't closed. This feature is
    useful to check database corruption when Groonga is crashed
    unexpectedly.

  * [grndb] Supported detecting database wasn't closed successfully
    case.

  * Added --drilldown_filter.

  * Supported filter in labeled drilldown.

  * Improved performance for using [between] without index. By
    between() optimization, there is a case that range search is 100x
    faster than the previous version of between().

  * [record_number] Supported window function.

  * [experimental][select] Supported --slices.

  * [select] Deprecated --sortby and --drilldown_sortby. Use
    --sort_keys and -drilldown_sort_keys instead.

  * [select] Deprecated --drilldown[...]. Use --drilldowns[...]
     instead.

  * Added [Command version] 3. It uses object literal based envelope.

  * [groonga-httpd] Updated bundled nginx version to 1.11.0.

## Fixes

  * [select] output_columns v2: Fixed a bug that * isn't expand to
     columns correctly.

  * Fixed a bug that 1usec information is lost for time value.

  * Fixed a crash bug when a mruby plugin is initialized with multiple
    threads.

  * Fixed a bug that static indexing crashes if a posting list is very
    long. This bug may occur against enormous size of
    database. [GitHub#546]

## Thanks

  * Naoya Murakami

-- 
Kentaro Hayashi <ha...@cl...>

[groonga-talk:321] [ANN] Groonga 6.0.3

an embeddable full-text search engine library

[groonga-talk:321] [ANN] Groonga 6.0.3