#73 bif:search_excerpt does not handle unicode

v6.1.3
open
5
2013-01-11
2011-10-04
No

The problem is simple: passing unicode into bif:search_excerpt makes the query fail:

select distinct ?r (bif:concat(bif:search_excerpt(bif:vector('constitución'), ?v))) as ?ex where {
?r nao:prefLabel ?v .
FILTER(bif:contains(?v, "'constitución'")) . }

"SR476: Function search_excerpt needs an array of VARCHAR as argument 1, not an array of NVARCHAR (225))"

There is a simple workaround which at least makes the query succeed:

select distinct ?r (bif:concat(bif:search_excerpt(bif:vector(bif:charset_recode('constitución', '_wide_', 'UTF-8'))) as ?ex where {
?r nao:prefLabel ?v .
FILTER(bif:contains(?v, "'constitución'")) . }

However, the search excerpt is then unusable as it only contains the non-unicode chars which in the case of, say Russian means: no search excerpts at all.

Discussion

  • Ivan Mikhailov

    Ivan Mikhailov - 2011-10-10

    I've fixed this (and improved the generation of excerpt for all cases, e.g, by preserving some delimiters like commas). I'm not committing right now because the function lacks charcase/diacritic processing for multibyte unicodes.

     
  • Ivan Mikhailov

    Ivan Mikhailov - 2011-10-22

    A final version of the fix is on its way to commercial release and to Virtuoso Open Source. It handles any combonation of narrow / UTF-8 / wide words to highlight and narrow / UTF-8 / wide text of the document. The resulting excerpt is narrow if the document is narrow and UTF-8 if the document is UTF-8 or wide.

     

Log in to post a comment.