WIKINDX / Bugs and feature requests / #646 Escaping error in HTML export

Stéphane Aulery - 2024-01-13

I think we have to create one resource of each time filed with a value for eache field which contains &. Then fix all export escaping and keep these resource as a separete file for futur tests.

Last edit: Stéphane Aulery 2024-01-13

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Grimshaw - 2024-01-14

Several possible solutions here.

Firstly, textinput (compared to tinyMCE textarea) has never been escaped when stored in the database. This includes the fields you list above.

We could go through at a later update and escape all these fields (which would then need to be unescaped when editing in a textinput).

But best, I think, to follow our principle of storing in the database as much as possible unchanged data as itheycomes from the web browser.

A solution to this, then, would be have an array of fields which have been escaped by tinyMCE (because it is a shorter list)—not forgetting custom fields— and, when exporting to HTML any field NOT in this list would be escaped with htmlspecialchars(). This would only be for HTML exporting.

Would that do it do you think?

Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2024-01-14
  
  Firstly, textinput (compared to tinyMCE textarea) has never been escaped when stored in the database. This includes the fields you list above.
  
  We could go through at a later update and escape all these fields (which would then need to be unescaped when editing in a textinput).
  
  No, that would be a mistake.
  
  But best, I think, to follow our principle of storing in the database as much as possible unchanged data as itheycomes from the web browser.
  
  The criterion is not the source but the choice of field type (HTML, non-HTML, integer, etc.). If you have chosen that a field is HTML it is because you want rich formatting in this scenario. Whenever possible, the HTML field type should be avoided because it is more difficult to manipulate and causes security problems.
  
  A solution to this, then, would be have an array of fields which have been escaped by tinyMCE (because it is a shorter list)—not forgetting custom fields— and, when exporting to HTML any field NOT in this list would be escaped with htmlspecialchars(). This would only be for HTML exporting.
  
  Yes, we need to know which fields are HTML, whether by a list or something else. But no, you should not escape them because they are already escaped in the database. These are the others that you have to escape when you put them in HTML.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mark Grimshaw - 2024-01-14
    
    OK. Will deal with it in a few days (busy for the next two).
    
    Mark
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Stéphane Aulery - 2024-01-14
      
      Meanwhile I will create resources to fuzz all fields.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Stéphane Aulery - 2024-01-17
        
        Hi Mark,
        
        I find a PHP librarty to generate fake data (Faker) and created a Fuzzer section in debug Tools.
        
        My idea is to create a generator that returns a complete resource or multiple generators that return data for the resource_* tables in the database. Then use the debug tool to inject a batch of randomly generated references.
        
        Before attempting a complete write-up, does this give you any ideas?
        
        I put the code in SVN so you can test.
        
        Regards,
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mark Grimshaw - 2024-01-18
        
        It sounds great Stéphane. Just a few things come to mind:
        1. Use RESOURCEMAP (if you haven't already thought about it) to give you (most of) the fields that need to be filled in with each resource type. You might find it easier to compile a FUZZERMAP that more directly gives all fields for each resource type and also gives you information on the field data type (although I guess this could be a query on the database itself).
        2. Use the code in RESOURCEWRITE, particularly gatherInput() and writeTables().
        3. Use the tag table and resourcemiscTag to keep track of groups of resources added. This then makes it very easy to delete all those resources in one go using the code in DELETERESOURCE.
        
        You will run into trouble if you just add data to all possible fields without thinking of the resource type. This is because deleting a resource doesn't necessarily delete just rows (in most cases this is indeed done) but it also does all sorts of cross-checks with existing resources (e.g., their keywords etc.). The same with writing a new resource.
        
        Mark
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Stéphane Aulery - 2024-01-18
        
        Very good advice. I don't know if I can do this in a reasonable amount of time. Is it difficult to navigate RESOURCEMAP?
        
        I don't know where I can simplify compared to the actual code. The goal is not necessarily initially to create fully functional resources but to fuzz the import/export only with a disposable base.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mark Grimshaw - 2024-01-18
        
        Let me see if I can come up with a FUZZERMAP over the next few days. (Short response—in a meeting . . .)
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Grimshaw - 2024-01-17

Should now be fixed in SVN ([r4956]).

Related

Commit: [r4956]

Last edit: Stéphane Aulery 2024-01-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2024-01-27
  
  Hi Mark,
  
  I generate 100 random resource with the fuzzer and exported them to HTML.
  
  Fields are escaped but sometimes the escaping is broken.
  
  Seems when the end of a formated resource is a semicolon, the formater replace it by a point instead of adding a point and breaks the escaping.
  
  Regards,
  
  formated_resource.png
  
  html_error.png
  
  page_end_filed_content.png
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mark Grimshaw - 2024-01-27
    
    Will look into it. I know why it happens just need to figure out how to stop it.
    
    Last edit: Stéphane Aulery 2024-01-27
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mark Grimshaw - 2024-01-27
      
      also with RTF (but not printing within wikindx).
      
      Last edit: Stéphane Aulery 2024-01-27
      
      alternate
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Mark Grimshaw - 2024-01-27
        
        With an input field that is &Ampersand&, I can see the error with RTF:
        Altm&<>an, R. (2023). Thesis title. Unpublished Dissertation PhD, Test, &Ampersand&amp.
        
        But exporting to HTML or viewing the resource in WIKINDX, I see no error:
        Altm&<>an, R. (2023). Thesis title. Unpublished Dissertation PhD, Test, &Ampersand&.
        source:
        Altm&<>an, R. (2023). <em>Thesis title</em>. Unpublished Dissertation PhD, Test, &Ampersand&.
        
        What bib style are you using?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Stéphane Aulery - 2024-01-27
        
        For this test APA. I did other tests. This is also true when the last character is < For example.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mark Grimshaw - 2024-01-27
        
        Strange. I'm using APA too. But duplicated it with magazine article and the endPage field. Seems it might not be on all fields. I'll look into it but it might be a day or two.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Stéphane Aulery - 2024-01-27
        
        This is where the fuzzer shows its usefulness. I couldn't have created by hand 100 resources to find this bug.
        
        I have the impression that there are also problems with the BiBTeX export. But let's go step by step so as not to break anything. Except for good reason uses \HTML\escape() instead of htmlspecialchars().
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Stéphane Aulery - 2024-01-27
        
        I also find the same error with data if field ID no. (ISBN etc.) and Publisher. Same something when an HTML entity is an the end of the formted text and not attached to a particular field.
        
        Greenshot-20240127225432.png
        
        Greenshot-20240127225503.png
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Grimshaw - 2024-01-28

I've found another issue.

First, ensure that hyperlinking the resource in a list to a view of that list is enabled in My Wikindx|Resources (with it off, there is no error).

Then view your resource (the one where an HTML element is the last character) in
1. A list (e.g., the FRONT or a basket, etc.)—the error appears.
2. In single view—no error.

However, I have just committed a fix that solves this issue and the HTML and RTF export.

On a related issue: if an HTML entity is added to a resource field (e.g. resourcepagesPageEnd) as &, it is stored like that in the database. That is, without change. If the resource is edited, the value then appears as '&' and is then stored as that in the database.

I assume we don't want this and that the field value should always be stored as originally entered (notwithstanding what tinyMCE does). Correct?

Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2024-01-28
  
  On a related issue: if an HTML entity is added to a resource field (e.g. resourcepagesPageEnd) as &, it is stored like that in the database. That is, without change. If the resource is edited, the value then appears as '&' and is then stored as that in the database.
  
  First TinyMCE do the right thing. If a non encoded & is loaded inside TinyMCE, it return it encoded. So there are no problem with TinyMCE and this case is imaginary.
  
  There are six cases:
  
  a) HTML field receives HTML well encoded content = OK
  b) HTML field receives HTML misencoded content = KO
  c) HTML field receives Text content without encodage characters = OK
  
  d) Text field receives HTML well content = OK
  e) Text field receives HTML misencoded content = KO
  f) Text field receives Text content = OK
  
  We write the software for the cases a), c) and f), and escape content on output following this scheme.
  
  b) and e) are pathological. Content like this may be inserted from the wrong source or by hand and it's almost impossible to avoid it when it comes from outside. We are not dealing with this case. It’s up to us at least not to generate it internally.
  
  d) This case is legal. At the output the data will be displayed as raw text because it has been escaped.
  
  When the content comes from an import or a call via an API (we don't have the latter case but we can be spammed by robots) it's not the same thing.
  
  Each import has its own format and I don't see that the titles in other software are HTML. So they must be escaped on a case by case basis.
  
  For the output, putting HTML into HTML or XML does not require anything special because the HTML fragment is presumed to be well escaped, otherwise the content must be escaped.
  
  example_amp.png
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stéphane Aulery - 2024-01-28
  
  You solution is wrong. You should not mix in \HTML\escapeHTMLTemp() escaping code and logic. Use \HTML\escape() and keep the logic in the formatter.
  
  Atfer you fix there are double escaping. You have to undo what you did. Tell me what you want to do and I will try to help you. Otherwise it's a weird case that we can keep for after the release.
  
  double_escaping.png
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mark Grimshaw - 2024-01-28
    
    hmmm. \BIB\cleanItems() line 926 is the point where it all happens. That if{} block deals with adding the ultimate field from the template to the rest of the reference and ensuring there is no double punctuation. The ultimate field is a special field that comprises only punctuation and can only be at the end of the reference.
    
    At the same time, if someone has written & or similar at the end of the penultimate field then the ';' of that character should not count as punctuation and thus be removed. In other words, any HTML entities of &...; form that are at the end of the penultimate field must be detected and preserved when checking for double punctuation.
    
    I don't understand how your example comes to be because the &' is not at the end of the reference.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Stéphane Aulery - 2024-01-28
      
      In escapeHTMLTemp(), the pattern sjould end with $ to detect only an entity an the end of the string.
      
      I don't understand how your example comes to be because the &' is not at the end of the reference.
      
      Maybe it's an other case but I lost the data and generated an ohter batch of 1000 resources. I have an other exemple where the field end with "<,". The coma is removed and the ; lost.
      
      This is why I am doubtful of the solution using a regexp. Code that removes excess punctuation should also detect entities with a comparison of the last characters and not a substitution.
      
      Greenshot-20240128150319.png
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Mark Grimshaw - 2024-01-28
        
        Hah! I did have the '$' but didn't realise I'd deleted it. But should it be '$' rather than '$'?
        
        I don't think it makes any difference, but, in cleanItems(), if I have the following print statements:
        
        print htmlspecialchars($itemArray[$lastItemKey]) . BR; $lastItem = \HTML\escapeHTMLTemp(TRUE, $itemArray[$lastItemKey]); print htmlspecialchars($lastItem) . BR; $$lastItem = preg_replace("/(.*)$lastItemPattern/Uus", '${1}', $lastItem); print htmlspecialchars($lastItem) . BR; $itemArray[$lastItemKey] = \HTML\escapeHTMLTemp(FALSE, $lastItem); print htmlspecialchars($itemArray[$lastItemKey]);
        
        I get:
        
        &'2–&22& W!K!NDX_&'2–&22&_W!K!NDX W!K!NDX_&'2–&22&_W!K!NDX &'2–&22&
        
        In other words, the '$' is not anchoring but greedily taking the entire string.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Stéphane Aulery - 2024-01-28
        
        My last commits adds it. This prevents the substitution from taking place in another place with unanticipated consequences.
        
        In other words, the '$' is not anchoring but greedily taking the entire string.
        
        it's why regexp is a foot gun. I fixed that with {1}.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Escaping error in HTML export

Virtual Research Environment / On-line Bibliography Manager

Target

Found in

Searches

Help

#646 Escaping error in HTML export

Related

Discussion

Related