WIKINDX / Bugs and feature requests / #616 Advanced search : special characters are dropped inside exact phrase

Mark Grimshaw - 2023-10-10

This can be fixed. We have a little issue that I think should be fixed first. In the resource input form, fields such as title and subtitle use the tinyMCE box in which case characters such as '&' get converted to '&'. This doesn't happen with non-tinyMCE boxes and so sometimes '&' is written to the database converted to an HTML entitiy and sometimes not. This makes it difficult to form a SQL query to search for a phrase containing '&.' This applies to all other potential HTML entities.

A number of solutions I can think of but one that does not require the database to be upgraded (and we don't know which entities should be encoded or decoded depending on the approach taken) would be to formulate the SQL search for both an entity-encoded search and a non-entity-encoded search. We know which fields (in the resource entry form) will have entities encoded or not.

What do you think Stéphane?

Mark

Last edit: Mark Grimshaw 2023-10-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2023-10-10
  
  HTML stored in the database should be minimally escaped. No more than the characters " ("), ' ('), < (<), > (>), and & (&). Otherwise, it must be corrected.
  
  For advanced security it would be necessary to escape more but to date it is more than sufficient. So I don't think it's the worst difficulty.
  
  I thought that the Sort/NoSort suffixed fields were used for this search.
  
  I would make two fields to put a copy of titles and subtitles free of HTML, escaping, and protection characters {...} on save, and search them instead. Same things for all HTML fields. The code used to extract raw text from attachment can be used to extract it from HTML safely.
  
  This is what advanced fulltext search engines do with filters like SQL Server.
  
  This conference of Richard Hipp explains how he wrote his own full text search for SQLIte website and removing HTML is part of it : Behind The Scenes of Full Text Search.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Grimshaw - 2023-10-11

It's not just the title field; it's any field stored in the database that comes from a tinyMCE textarea. For example, this is how this text in an abstract field:

Grimshaw & Aagaard é < > æø "blah" and 'more blah' ` «»

is stored in the database (it's tinyMCE that does the encoding, not us):

Grimshaw & Aagaard é < > æø "blah" and 'more blah' ` «»

Searchable textarea fields are:
title
subtitle
abstract
notes
quotes + quote comments
paraphrases + paraphrase comments
musings
ideas

so I would not suggest duplicating all of these in the database.

I think the best solution is to encode the search terms in the SQL statement using htmlspecialchars() and the ENT_NOQUOTES flag, but to encode only for those fields originating from tinyMCE textareas.

It should be simple enough. Unless you have any objection, I will work on this over the next day or two.

Regards,

Mark
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2023-10-11
  
  What I wrote is valid for any field storing HTML of course.
  
  By keeping the HTML you will not be able to find the exact expressions that overlap an HTML tag.
  
  Won't you also have problems calculating the lengths?
  
  If it were me I would create a table with 5 fields (resourceId, fieldName, FieldValue, DateCreate, DateUpdate) in which to put the values without HTML, then I use this table for searches.
  
  While waiting for the right solution you can follow your idea and see how that stick.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Grimshaw - 2023-10-11

OK. If we accept your point that a database field should not have entities in it but have the text as entered by the user.

Solution 1:
A new table explicitly for searching where the fields I listed above are duplicated but without HTML entities (only '&', '>', and '<' need decoding). Any subsequent editing of the fields would also need to update this table.

Solution 2:
When writing the fields to the existing tables, decode HTML entities (only '&', '>', and '<' need decoding).

Both solutions involve an upgrade to the database.

I would prefer solution 2 as it requires less maintenance (the subsequent editing) and is closer to the aim of "HTML stored in the database should be minimally escaped. No more than the characters " ("), ' ('), < (<), > (>), and & (&). Otherwise, it must be corrected."

If we agree, do I do it for this release (I think yes)?

Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2023-10-11
  
  Solution 2 is wrong. It destroys the proper escaping of HTML fields.
  
  I will give a better answer from the current state of affairs.
  
  For HTML fields, TinyMCE already escapes well. Example:
  
  It's an **Escaping Test ** with quotes (") & other {OdDiTies} such as >, <, and Ⴈ signs
  
  Is stored as:
  
  It's an <strong>Escaping Test</strong> with quotes (") & other {OdDiTies} such as >, <, and Ⴈ signs
  
  " and ' are not encoded because there are not inside an HTML attributs. It's fine.
  
  I only have to fix for the release already stored values if there are more escaped.
  
  In the event that non-HTML fields contain HTML entities or HTML I must also correct them with this release.
  
  The proper solution is a table with fixed values minus HTML because you can't search "Escaping Test" with something like this: It's an <strong>Escaping</strong> Test with quotes (") & other {OdDiTies} such as >, <, and Ⴈ signs
  
  For a degraded solution you can just search all HTML entities equivalent to ", >,< as you first proposed.
  
  For the moment the degraded solution does the trick unless you think that the additional table is simpler to write queries. In your opinion, what is the easiest way to write queries in a short time?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Grimshaw - 2023-10-11

OK. Let's try duplicating certain searchable fields in a new table without HTML.

In addition to the fields I listed above, customLong can also have HTML entities.

So the fields to duplicate would be:
resourceNoSort resourceTitle (i.e. the two fields concatenated with space)
resourceSubtitle
resourcetextAbstract
resourcetextNote
resourcemetadataText
resourcecustomLong

The new table needs to store the fields above in addition to the following as minimum:
unique ID (auto incrementing)
resourceId
type (title, subtitle, abstract, note, metadata, custom or NULL)
text (from the fields listed above)
and (to be able to deal with the various types of metadata and the fact that each resource can have more than one metadatum):
metadataType (e.g., 'q', 'qc', 'p', 'pc', 'm', 'i' or NULL)
metadataId (a copy of resourcemetadataId or NULL)
and (to be able to deal with the fact that each resource can have more than one large custom field):
customId (a copy of resourcecustomCustomId)

The above should allow for updating the duplicated fields when the originals are edited.

It should be simple enough to change the search queries. If you look at QUICKSEARCH::fieldSql() you'll see what I mean. At a quick glance, Advanced Search is similar.

Mark

Last edit: Mark Grimshaw 2023-10-12

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2023-10-12
  
  Apart from the titles are there other fields with the protection sequence {...}, or other special syntax?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mark Grimshaw - 2023-10-12
    
    I'm pretty certain the answer is no.
    
    If you can write the function to prepare the data (whether in this initial transfer to the new table or when adding or editing such fields int he future), I can do the database upgrade and amend the search routines.
    
    What should we call the new table? 'search'?
    
    And a schema like?
    
    CREATE TABLE IF NOT EXISTS search (
    searchId int(11) NOT NULL AUTO_INCREMENT,
    searchResourceId int(11) DEFAULT NULL,
    searchMetadataId int(11) DEFAULT NULL,
    searchCustomId int(11) DEFAULT NULL,
    searchText mediumtext COLLATE utf8mb4_unicode_520_ci NOT NULL,
    searchType varchar(10) COLLATE utf8mb4_unicode_520_ci NOT NULL,
    searchMetadataType varchar(2) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
    searchMetadataPrivate varchar(1) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
    PRIMARY KEY (searchId),
    KEY (searchResourceId),
    KEY (searchMetadataId),
    KEY (searchCustomId),
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mark Grimshaw - 2023-10-13
      
      . . . just so you don't start doing it yourself, I've already starting drafting the code to upgrade the database. . .
      
      I think you would od a far better job than me in writing the code to clean-up the HTML.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Stéphane Aulery - 2023-10-13
        
        I would write a special function to filter the HTML.
        
        You can already create this function with fake code that returns the input text without modifying it and I will write the body this evening.
        
        In the table description, put the searchText fields at the end because it is very long. This can improve access time.
        
        searchType is for searchField?
        
        I don't really like mixing normal fields with metadata and custom fields, but keep it up and we'll see.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Stéphane Aulery - 2023-10-13
      
      I would call it search_resource because it is targeted.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Mark Grimshaw - 2023-10-13
        
        Hi Stéphane,
        
        We could have:
        search_resource (text field is varchar(1020))
        search_custom (text field is mediumtext) with FULLTEXT key
        search_metadata (text field is mediumtext) with FULLTEXT key
        
        search_metadata will also include ideas which are independent of resources.
        
        Splitting into three might help keep each table down in size and improve complex searches.
        
        You're the expert.
        
        I won't write the dummy HTML conversion code as it will take some time to get the table structure right and ensure the update code does the correct transfer in a way that can be easily used by search. For now I'm just using htmlspecialchars_decode($text, ENT_NOQUOTES) so I can at least check ampersands are not dropped. It's no problem to keep dropping the search table(s) and running UPDATEDATABASE again.
        
        Mark
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Stéphane Aulery - 2023-10-13
        
        I don't know where the ideas are stored anymore. I will try to find explanations in English, or write them myself, so that you understand how to structure a database. Merise is a good French method of database modeling.
        
        Splitting into three might help keep each table down in size and improve complex searches.
        
        Yes. In a table there must only be the fields of a single logical entity, an entity represented and distinguished by the fields of the primary key. For example the table the resource_metadata table mixes notes, summary, musings... This is not good. Restructuring that will be for another day. These three tables (search_resource, search_custom, search_metadata) are fine for the current schema.
        
        Wrap it in this function and I just have to write the real code:
        
        function filterHTML($text) { return htmlspecialchars_decode($text, ENT_NOQUOTES); }
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mark Grimshaw - 2023-10-13
        
        ideas are stored in resource_metadata where resourcemetadataType is 'i' — yes, they should not be stored there I know . . .
        
        resource_text is where notes and summaries/abstracts are stored—so they at least are separate from metadata.
        
        I will do the three search_xxx tables and get onto it tomorrow probably.
        
        Mark
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Grimshaw - 2023-10-14

Hi Stéphane,

So far so good. However, in the database, with the new filterHTML(), this pre-filterHTML()

Grimshaw & Aagaard é < > æø "blah" and 'more blah' ` «»

becomes this post-filterHTML():

Grimshaw & Aagaard Ã© < > Ã¦Ã¸ "blah" and 'more blah' ` Â«Â»

which doesn't seem to be correct.

(I've decided on five seearch_xxx tables:
search_resources
search_abstractnotes
search_custom
search_ideas
search_metadata

so that different conceptual types are kept separate.)

Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2023-10-14
  
  Hi Mark,
  
  Le 14/10/2023 à 10:17, Mark Grimshaw a écrit :
  
  So far so good. However, in the database, with the new filterHTML(), this pre-filterHTML()
  
  Grimshaw & Aagaard é < > æø "blah" and 'more blah' ` «»
  
  becomes this post-filterHTML():
  
  Grimshaw & Aagaard Ã© < > Ã¦Ã¸ "blah" and 'more blah' ` Â«Â»
  
  The characters Ã are weird.
  
  which doesn't seem to be correct.
  
  (I've decided on five seearch_xxx tables:
  search_resources
  search_abstractnotes
  search_custom
  search_ideas
  search_metadata
  
  so that different conceptual types are kept separate.)
  
  :-)
  
  --
  Stéphane Aulery
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mark Grimshaw - 2023-10-14
    
    They're to do with UTF-8. It's definitely something to do with filterHTML() as that is the only thing that is applied to the text.
    
    If the abstract text is:
    Grimshaw & Aagaard é < > æø "blah" and 'more blah' «»`
    
    in the form, it becomes
    Grimshaw & Aagaard é < > æø "blah" and 'more blah' «»
    
    in the resource_text database (correct) and
    Grimshaw & Aagaard Ã© < > Ã¦Ã¸ "blah" and 'more blah' Â«Â»`
    
    in the search_abstractnotes table (incorrect)
    
    search_abstractnotes should have:
    Grimshaw & Aagaard é < > æø "blah" and 'more blah' «»`
    
    That is, exactly what was entered in the form.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Stéphane Aulery - 2023-10-14
      
      Try the last commit please.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Mark Grimshaw - 2023-10-14
        
        Perfect. Thanks.
        
        Mark
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Stéphane Aulery - 2023-10-14
        
        I see you added mb_trim() to filterHTML(). As the contents of the tables are separated you can arrange the text as you want such as removing line breaks, double spaces...
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mark Grimshaw - 2023-10-14
        
        The issue I found when testing was that a field that had only HTML (e.g., just an image or perhaps an image with an HTML entitiy or two) should have been returned as ''. It wasn't and the consequence was that the field that should have been (default) NULL in search_xxx was not. mb_trim() solved it but perhaps you can think of something else.
        
        Mark
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Grimshaw - 2023-10-15

Hi Stéphane,

In the latest SVN, I've implemented the use of search_xxx tables in quicksearch. There appears to be some weirdness in mySQL FULLTEXT searches in that any non word character is ignored and, in an exact phrase, these are removed and two words separated by a non-word character (like space) are treated as two separate words not an exact phrase.

e.g., if I have the unique phrase in an abstract "more & abstract", then this finds it but also other abstracts with the word 'abstract' in it:

SELECT searchabstractnotesResourceId AS rId FROM search_abstractnotes WHERE ( MATCH(searchabstractnotesText) AGAINST('("& abstract")' IN BOOLEAN MODE)) AND (searchabstractnotesType = 'abstract')

If I try a plain REGEXP, I get no results:

SELECT searchabstractnotesResourceId AS rId FROM search_abstractnotes WHERE (searchabstractnotesText REGEXP '\b& abstract\b') AND (searchabstractnotesType = 'abstract');

likewise with:

SELECT searchabstractnotesResourceId AS rId FROM search_abstractnotes WHERE (searchabstractnotesText REGEXP '\b[[.ampersand]] abstract\b') AND (searchabstractnotesType = 'abstract');

This works correctly:

SELECT searchabstractnotesResourceId, searchabstractnotesText AS rId FROM search_abstractnotes WHERE ( searchabstractnotesText LIKE '%& abstract%') AND (searchabstractnotesType = 'abstract');

I can't find a solution that does not involve changing mySQL character sets for individual characters or moving away from FULLTEXT searches and or regexps.

I can't remember why we decided to use regexps and FULLTEXT in the first place other than speed with large databases.

I suggest forgetting about FULLTEXT and regexp and to instead use LIKE %...%.

Any other ideas?

Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2023-10-15
  
  I searched my emails. We are redoing the discussion from March 2018 whose title was "mysql regex searching". The starting point was the observation that MySQL REGEXPs are broken. Do you still have the messages?
  
  In 2021 I also proposed to test tntsearch.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2023-10-15
  
  But be careful, the MySQL syntax for matching phrase litteraly is not the one you use. Read Natural Language Full-Text Searches of MySQL doc
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Advanced search : special characters are dropped inside exact phrase

Virtual Research Environment / On-line Bibliography Manager

Target

Found in

Searches

Help

#616 Advanced search : special characters are dropped inside exact phrase

Discussion