Thread: [Fez-developers] Boolean searching / Fulltext indexing

Status: Beta

Brought to you by: amartlew, majchris, mrangryfish

fez-developers

[Fez-developers] Boolean searching / Fulltext indexing

From: Jauslin K. <kai...@li...> - 2007-11-19 18:47:39

Hello Fez developers,

the current trunk version does not allow boolean searching (AND/OR/NOT) =
in advanced search (e.g. to get all documents from author1 OR author2. =
Is there a plan for implementation?=20

What about the integration of our MySQL fulltext indexing into the =
trunk? It runs very well over here, with good performance (our fulltext =
table currently has 11'000 entries). I also did an implementation of =
highlighting query words in fulltext extracts and a document preview for =
the browse lists. Also this runs without much performance penalties.

The only thing that is not so easily possible with the current =
implementation is the combination of advanced, field based search with =
complex boolean fulltext search. I think for this functionality, another =
indexing approach (e.g. using Zend Lucene) should be used.

Could you please tell me what you think about these problems and whether =
there is anything going on into that direction?

Cheers from Zurich, Kai

--=20
Kai Jauslin, Dipl. Informatik-Ing. ETH, ETH Z=FCrich, ETH-Bibliothek, =
R=E4mistrasse 101, CH-8092 Z=FCrich
kai...@li..., Tel +41-44-6324972, B=FCro STB F19

Re: [Fez-developers] Boolean searching / Fulltext indexing

From: Christiaan K. <c.k...@li...> - 2007-11-19 22:38:48

Hello Kai

I am currently in Paris having a week off for holidays after the SUN PASIG
meeting. I'll be back in Australia next week.

Have you seen my presentation slides? They talk about the fez 2 release and
what will be in fez 2.1:

http://espace.library.uq.edu.au/view.php?pid=UQ:119976

We are certainly going to bring your fulltext code into the Fez trunk, as
soon as we can - possibly in the next couple of weeks.

I am looking very seriously into Postgresql. It provides a much more
powerful fulltext search engine called 'TSearch2'. Also there is a php
parser for google-style and/not/or searching with brackets ()s with the very
nice tsearch2 digital stratum fulltext query parser (php) -
http://digitalstratum.com/oss/fts_parser

However we will continue to support mysql as an equal option for the fez
index RDBMS. We could probably adapt the digital stratum parser to create
sql code for mysql as well as postgresql.

Cheers,
Christiaan 

On 19/11/07 7:47 PM, "Jauslin  Kai" <kai...@li...> wrote:

> Hello Fez developers,
> 
> the current trunk version does not allow boolean searching (AND/OR/NOT) in
> advanced search (e.g. to get all documents from author1 OR author2. Is there a
> plan for implementation?
> 
> What about the integration of our MySQL fulltext indexing into the trunk? It
> runs very well over here, with good performance (our fulltext table currently
> has 11'000 entries). I also did an implementation of highlighting query words
> in fulltext extracts and a document preview for the browse lists. Also this
> runs without much performance penalties.
> 
> The only thing that is not so easily possible with the current implementation
> is the combination of advanced, field based search with complex boolean
> fulltext search. I think for this functionality, another indexing approach
> (e.g. using Zend Lucene) should be used.
> 
> Could you please tell me what you think about these problems and whether there
> is anything going on into that direction?
> 
> Cheers from Zurich, Kai

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Christiaan Kortekaas
Senior Library Open Sorcerer
Library Technology Service
The University of Queensland, Australia QLD 4072
Telephone : (+61) (7) 3346 4337
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Re: [Fez-developers] Boolean searching / Fulltext indexing

From: Kai J. <kai...@li...> - 2007-11-21 18:44:10

Hi Christiaan,

thanks for your answer. Nice slides... where can I get FezTube? :-)

As for the search engine: I had a look at the digital stratum filter, 
but was a little bit disappointed. We would have to adjust it to support 
UTF-8 for internationalization, and wildcards (*?). A rebuild using 
phplexer/parser generator might be faster and more promising on the long 
term.

The main problem I see however, is that structured and unstructured 
(fulltext) search cannot be combined with the current Fez2 search key 
structure (using the MySQL indexing for fulltext). Example: return all 
documents with "+sky -sun". Combining the search key tables (attachment 
and title for example) using unions may return documents that have a 
title like 'The blue sky' with "the sun was high in the sky" as 
fulltext. Reason: boolean search 'rek_title like "sky" and not rek_title 
like "sun"' (will return the row) UNION fulltext search (which will 
ignore the document because of the occurence of "sun").

I am now going to take a closer look at Zend Lucene (TSearch2 might 
solve the problem, but I strongly favor database independence). So the 
current plan is to integrate Lucene as a parallel Fez index, using MySQL 
for the authorization indexing. This way, we would have an extremely 
solid and proven search engine in Fez. I'm going to investigate this 
possibility and keep you up-to-date.

What do you think about it?

Cheers, Kai

Christiaan Kortekaas schrieb:
> Hello Kai
>
> I am currently in Paris having a week off for holidays after the SUN PASIG
> meeting. I'll be back in Australia next week.
>
> Have you seen my presentation slides? They talk about the fez 2 release and
> what will be in fez 2.1:
>
> http://espace.library.uq.edu.au/view.php?pid=UQ:119976
>
> We are certainly going to bring your fulltext code into the Fez trunk, as
> soon as we can - possibly in the next couple of weeks.
>
> I am looking very seriously into Postgresql. It provides a much more
> powerful fulltext search engine called 'TSearch2'. Also there is a php
> parser for google-style and/not/or searching with brackets ()s with the very
> nice tsearch2 digital stratum fulltext query parser (php) -
> http://digitalstratum.com/oss/fts_parser
>
> However we will continue to support mysql as an equal option for the fez
> index RDBMS. We could probably adapt the digital stratum parser to create
> sql code for mysql as well as postgresql.
>
> Cheers,
> Christiaan 
>
>
> On 19/11/07 7:47 PM, "Jauslin  Kai" <kai...@li...> wrote:
>
>   
>> Hello Fez developers,
>>
>> the current trunk version does not allow boolean searching (AND/OR/NOT) in
>> advanced search (e.g. to get all documents from author1 OR author2. Is there a
>> plan for implementation?
>>
>> What about the integration of our MySQL fulltext indexing into the trunk? It
>> runs very well over here, with good performance (our fulltext table currently
>> has 11'000 entries). I also did an implementation of highlighting query words
>> in fulltext extracts and a document preview for the browse lists. Also this
>> runs without much performance penalties.
>>
>> The only thing that is not so easily possible with the current implementation
>> is the combination of advanced, field based search with complex boolean
>> fulltext search. I think for this functionality, another indexing approach
>> (e.g. using Zend Lucene) should be used.
>>
>> Could you please tell me what you think about these problems and whether there
>> is anything going on into that direction?
>>
>> Cheers from Zurich, Kai
>>     
>
>

Re: [Fez-developers] Boolean searching / Fulltext indexing

From: Christiaan K. <c.k...@li...> - 2007-11-21 19:42:11

Hi Kai

If you have a look in eserv.php you=B9ll see that eserv.php handles flash
video (.flv) files differently using an embedded flash video player:
http://dev-repo.library.uq.edu.au/websvn/filedetails.php?repname=3Dfez&path=3D%=
2
Ftrunk%2Feserv.php

We haven=B9t yet added an automatic =8Con-ingest=B9 workflow to automatically add
dissemination copies of mpg/mpeg2 video (and other formats) as flash video
(flv) files, but plan to do this soon. Until then we have done it manually
using a cross-platform bit of software called =8Cmuencode=B9 (although I may
have got the spelling wrong. The new workflow would wrap a fez webservice
around mu-encode just
like we do for imagemagick for image conversion.

As for the problems you see, yes they are probably problems all application=
s
with search engines come across, and a good solution may be Zend Lucene,
although it would be very nice if we could figure a way out to put the
authorization into the lucene index too, otherwise we come across
result-list paging, and mass post search authz filtering problems. The
Moodle project is looking into doing this too, using zend lucene and puttin=
g
the authz rules into the index in a google-summer-of-code project called th=
e
=8CGlobal search module=B9 for Moodle. You can see their wiki for details and
their =8Ctalk=B9 panel and forums on this topic.

I am very happy you are also looking into this and will be very interested
to see your progress.

Thanks for the information,

Christiaan=20


On 21/11/07 7:43 PM, "Kai Jauslin" <kai...@li...> wrote:

> Hi Christiaan,
>=20
> thanks for your answer. Nice slides... where can I get FezTube? :-)
>=20
> As for the search engine: I had a look at the digital stratum filter, but=
 was
> a little bit disappointed. We would have to adjust it to support UTF-8 fo=
r
> internationalization, and wildcards (*?). A rebuild using phplexer/parser
> generator might be faster and more promising on the long term.
>=20
> The main problem I see however, is that structured and unstructured (full=
text)
> search cannot be combined with the current Fez2 search key structure (usi=
ng
> the MySQL indexing for fulltext). Example: return all documents with "+sk=
y
> -sun". Combining the search key tables (attachment and title for example)
> using unions may return documents that have a title like 'The blue sky' w=
ith
> "the sun was high in the sky" as fulltext. Reason: boolean search 'rek_ti=
tle
> like "sky" and not rek_title like "sun"' (will return the row) UNION full=
text
> search (which will ignore the document because of the occurence of "sun")=
.
>=20
> I am now going to take a closer look at Zend Lucene (TSearch2 might solve=
 the
> problem, but I strongly favor database independence). So the current plan=
 is
> to integrate Lucene as a parallel Fez index, using MySQL for the authoriz=
ation
> indexing. This way, we would have an extremely solid and proven search en=
gine
> in Fez. I'm going to investigate this possibility and keep you up-to-date=
.
>=20
> What do you think about it?
>=20
> Cheers, Kai
>=20
>=20
> Christiaan Kortekaas schrieb:
>> =20
>> Hello Kai
>>=20
>> I am currently in Paris having a week off for holidays after the SUN PAS=
IG
>> meeting. I'll be back in Australia next week.
>>=20
>> Have you seen my presentation slides? They talk about the fez 2 release =
and
>> what will be in fez 2.1:
>>=20
>> http://espace.library.uq.edu.au/view.php?pid=3DUQ:119976
>>=20
>> We are certainly going to bring your fulltext code into the Fez trunk, a=
s
>> soon as we can - possibly in the next couple of weeks.
>>=20
>> I am looking very seriously into Postgresql. It provides a much more
>> powerful fulltext search engine called 'TSearch2'. Also there is a php
>> parser for google-style and/not/or searching with brackets ()s with the =
very
>> nice tsearch2 digital stratum fulltext query parser (php) -
>> http://digitalstratum.com/oss/fts_parser
>>=20
>> However we will continue to support mysql as an equal option for the fez
>> index RDBMS. We could probably adapt the digital stratum parser to creat=
e
>> sql code for mysql as well as postgresql.
>>=20
>> Cheers,
>> Christiaan=20
>>=20
>>=20
>> On 19/11/07 7:47 PM, "Jauslin  Kai" <kai...@li...>
>> <mailto:kai...@li...>  wrote:
>>=20
>>  =20
>> =20
>>> =20
>>> Hello Fez developers,
>>>=20
>>> the current trunk version does not allow boolean searching (AND/OR/NOT)=
 in
>>> advanced search (e.g. to get all documents from author1 OR author2. Is =
there
>>> a
>>> plan for implementation?
>>>=20
>>> What about the integration of our MySQL fulltext indexing into the trun=
k? It
>>> runs very well over here, with good performance (our fulltext table
>>> currently
>>> has 11'000 entries). I also did an implementation of highlighting query
>>> words
>>> in fulltext extracts and a document preview for the browse lists. Also =
this
>>> runs without much performance penalties.
>>>=20
>>> The only thing that is not so easily possible with the current
>>> implementation
>>> is the combination of advanced, field based search with complex boolean
>>> fulltext search. I think for this functionality, another indexing appro=
ach
>>> (e.g. using Zend Lucene) should be used.
>>>=20
>>> Could you please tell me what you think about these problems and whethe=
r
>>> there
>>> is anything going on into that direction?
>>>=20
>>> Cheers from Zurich, Kai
>>>    =20
>>> =20
>> =20
>>=20
>>  =20
>=20
>=20
>=20
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>=20
> _______________________________________________
> Fez-developers mailing list
> Fez...@li...
> https://lists.sourceforge.net/lists/listinfo/fez-developers


--=20
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Christiaan Kortekaas
Senior Library Open Sorcerer
Library Technology Service
The University of Queensland, Australia QLD 4072
Telephone : (+61) (7) 3346 4337
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~