Menu

#328 Quotation marks not respected as phrase search marker in abstracts

Unknown
closed
None
6.4.3
Bug
7.3
Linux
2021-08-28
2021-03-29
Joachim
No

Hi Mark,

both in Quicksearch and Advanced search, I get strange search results with phrase searches in quotation marks. Consider the phrase search "we are iron man" in my WIKINDX, which should give exactly one resource, https://www.bobc.uni-bonn.de/index.php?action=resource_RESOURCEVIEW_CORE&id=13021&browserTabID=. In Quicksearch, I get 59, which is obviously quite unrealistic (I didn't look at all resources, but the first ones do not have this phrase). When searching for the same phrase in Advanced search in titles only, I get only one resource, the right one. When searching in abstracts, I get 58, in titles AND abstracts 59. I guess that the search in abstract fields does not consider the quotation marks.

Best
Joachim

Related

Bugs and feature requests : #336

Discussion

  • Mark Grimshaw

    Mark Grimshaw - 2021-03-29
    • assigned_to: Mark Grimshaw
     
  • Mark Grimshaw

    Mark Grimshaw - 2021-03-29

    Hi Joachim,

    That's going to be a difficult bug to track down as I can't reproduce it. If I search on "virtual world" on my database in quicksearch I get 29 results which is right including results only in title, abstract, notes etc.

    I will look further.

    Mark

     

    Last edit: Mark Grimshaw 2021-03-30
    • Stéphane Aulery

      I can reproduce the bug with this very simple query. I searched the sentence on my own db, copied this subquery, and run it against the db of Joachim:

      SELECT
          `resourcetextId` AS rId ,
          resourcetextAbstract
      FROM wkx_resource_text WHERE ( MATCH(`resourcetextAbstract`) AGAINST('("we are iron man")' IN BOOLEAN MODE))
      

      This give 57 results.

      Even more bizarre, if I copy the text of the first two results (6968 and 11129) to my database and run the query I only get one result (from id 11129).

      I also changed the innodb_ft_enable_stopword option on my server and get the same results.

      I searched for bugs about FULLTEXT search in MySQL and MariaDB bugtrackers and found dozens of bugs.

      Mark, you can test in your own database but I think it's a MySQL / MariaDB bug. I can open a bug report with the data pulled from Joachim's database if you want.

      If that's right, we shouldn't wait for a resolution for ... months or years.

       
  • Mark Grimshaw

    Mark Grimshaw - 2021-03-31
     
    • Stéphane Aulery

      Maybe we could consider evaluating or take inspiration from TNTSearch, a full text search engine in pure PHP.

      Under the wood it use SQLite. I did a test tonight with SQLite and it work like a breeze.

       
      • Stéphane Aulery

        Study also Xapian.

         
  • Mark Grimshaw

    Mark Grimshaw - 2021-03-31

    I get bizarre results too. With:
    Špelda, D. (2017). The role of the telescope and microscope in the constitution of the Idea of scientific progress. The Seventeenth Century, 34(1), 107–126.

    in QUICKSEARCH and Advanced Search (on title):
    "the telescope" – positive
    "the telescope and" – positive
    "the telescope and the" – negative

    It's not WIKINDX dealing incorrectly with the results returned as copying the SQL statement compiled and running it produces the same results.

    Mark

     

    Last edit: Mark Grimshaw 2021-03-31
    • Stéphane Aulery

      This is yet another error then because the title queries use another MySQL operator.

       
      • Mark Grimshaw

        Mark Grimshaw - 2021-03-31

        Indeed:

        SELECT `rId` FROM (SELECT `resourceId` AS rId FROM wkx_resource 
        WHERE ((CONCAT_WS(' ', `resourceNoSort`, `resourceTitle`, 
        `resourceSubtitle`) REGEXP '[[:<:]]the telescope and 
        the[[:>:]]'))) AS wkx_u LEFT OUTER JOIN wkx_resource_creator ON 
        `resourcecreatorResourceId` = `rId` LEFT OUTER JOIN wkx_creator 
        ON `creatorId` = `resourcecreatorCreatorId`
        
         

        Last edit: Mark Grimshaw 2021-03-31
        • Stéphane Aulery

          But where do you see an error in these last results?

           
  • Mark Grimshaw

    Mark Grimshaw - 2021-03-31

    I think I know what one issue might be with search.

    Resource titles can be protected from changes in bibliographic capitalization using {...}.

    I've been able to get the results I want by removing the use of {...} in the title fields:

    SELECT `resourceId` AS rId FROM wkx_resource WHERE (
    REPLACE( REPLACE(CONCAT_WS(' ', `resourceNoSort`, `resourceTitle`, `resourceSubtitle`) , '{', ''), '}', '')
    REGEXP '[[:<:]]idea of scientific progress[[:>:]]')
    

    That explains the issue I've had but it's probably different to the main reported issues here.

    Mark

     
  • Stéphane Aulery

    • status: open --> wip
     
  • Mark Grimshaw

    Mark Grimshaw - 2021-08-28
    • status: wip --> closed
     

Log in to post a comment.