Menu

#646 Escaping error in HTML export

6.8.0
closed
None
Unknown
Bug
Unknow
Unknow
Unknow
2024-02-01
2024-01-13
No

Hi Mark,

I find some fields of Miscellaneous type not escaped in HTML export (Publisher, Medium, Authors).

Regards,

2 Attachments

Related

News: 2024/01/wikindx-v680-release-candidate-1
News: 2024/02/wikindx-v680-release-candidate-2
News: 2024/02/wikindx-v680

Discussion

1 2 > >> (Page 1 of 2)
  • Stéphane Aulery

    I think we have to create one resource of each time filed with a value for eache field which contains &. Then fix all export escaping and keep these resource as a separete file for futur tests.

     

    Last edit: Stéphane Aulery 2024-01-13
  • Mark Grimshaw

    Mark Grimshaw - 2024-01-14

    Several possible solutions here.

    Firstly, textinput (compared to tinyMCE textarea) has never been escaped when stored in the database. This includes the fields you list above.

    We could go through at a later update and escape all these fields (which would then need to be unescaped when editing in a textinput).

    But best, I think, to follow our principle of storing in the database as much as possible unchanged data as itheycomes from the web browser.

    A solution to this, then, would be have an array of fields which have been escaped by tinyMCE (because it is a shorter list)—not forgetting custom fields— and, when exporting to HTML any field NOT in this list would be escaped with htmlspecialchars(). This would only be for HTML exporting.

    Would that do it do you think?

    Mark

     
    • Stéphane Aulery

      Firstly, textinput (compared to tinyMCE textarea) has never been escaped when stored in the database. This includes the fields you list above.

      We could go through at a later update and escape all these fields (which would then need to be unescaped when editing in a textinput).

      No, that would be a mistake.

      But best, I think, to follow our principle of storing in the database as much as possible unchanged data as itheycomes from the web browser.

      The criterion is not the source but the choice of field type (HTML, non-HTML, integer, etc.). If you have chosen that a field is HTML it is because you want rich formatting in this scenario. Whenever possible, the HTML field type should be avoided because it is more difficult to manipulate and causes security problems.

      A solution to this, then, would be have an array of fields which have been escaped by tinyMCE (because it is a shorter list)—not forgetting custom fields— and, when exporting to HTML any field NOT in this list would be escaped with htmlspecialchars(). This would only be for HTML exporting.

      Yes, we need to know which fields are HTML, whether by a list or something else. But no, you should not escape them because they are already escaped in the database. These are the others that you have to escape when you put them in HTML.

       
      • Mark Grimshaw

        Mark Grimshaw - 2024-01-14

        OK. Will deal with it in a few days (busy for the next two).

        Mark

         
        • Stéphane Aulery

          Meanwhile I will create resources to fuzz all fields.

           
          • Stéphane Aulery

            Hi Mark,

            I find a PHP librarty to generate fake data (Faker) and created a Fuzzer section in debug Tools.

            My idea is to create a generator that returns a complete resource or multiple generators that return data for the resource_* tables in the database. Then use the debug tool to inject a batch of randomly generated references.

            Before attempting a complete write-up, does this give you any ideas?

            I put the code in SVN so you can test.

            Regards,

             
            • Mark Grimshaw

              Mark Grimshaw - 2024-01-18

              It sounds great Stéphane. Just a few things come to mind:
              1. Use RESOURCEMAP (if you haven't already thought about it) to give you (most of) the fields that need to be filled in with each resource type. You might find it easier to compile a FUZZERMAP that more directly gives all fields for each resource type and also gives you information on the field data type (although I guess this could be a query on the database itself).
              2. Use the code in RESOURCEWRITE, particularly gatherInput() and writeTables().
              3. Use the tag table and resourcemiscTag to keep track of groups of resources added. This then makes it very easy to delete all those resources in one go using the code in DELETERESOURCE.

              You will run into trouble if you just add data to all possible fields without thinking of the resource type. This is because deleting a resource doesn't necessarily delete just rows (in most cases this is indeed done) but it also does all sorts of cross-checks with existing resources (e.g., their keywords etc.). The same with writing a new resource.

              Mark

               
              • Stéphane Aulery

                Very good advice. I don't know if I can do this in a reasonable amount of time. Is it difficult to navigate RESOURCEMAP?

                I don't know where I can simplify compared to the actual code. The goal is not necessarily initially to create fully functional resources but to fuzz the import/export only with a disposable base.

                 
                • Mark Grimshaw

                  Mark Grimshaw - 2024-01-18

                  Let me see if I can come up with a FUZZERMAP over the next few days. (Short response—in a meeting . . .)

                   
  • Mark Grimshaw

    Mark Grimshaw - 2024-01-17

    Should now be fixed in SVN ([r4956]).

     

    Related

    Commit: [r4956]


    Last edit: Stéphane Aulery 2024-01-27
    • Stéphane Aulery

      Hi Mark,

      I generate 100 random resource with the fuzzer and exported them to HTML.

      Fields are escaped but sometimes the escaping is broken.

      Seems when the end of a formated resource is a semicolon, the formater replace it by a point instead of adding a point and breaks the escaping.

      Regards,

       
      • Mark Grimshaw

        Mark Grimshaw - 2024-01-27

        Will look into it. I know why it happens just need to figure out how to stop it.

         

        Last edit: Stéphane Aulery 2024-01-27
        • Mark Grimshaw

          Mark Grimshaw - 2024-01-27

          also with RTF (but not printing within wikindx).

           

          Last edit: Stéphane Aulery 2024-01-27
          • Mark Grimshaw

            Mark Grimshaw - 2024-01-27

            With an input field that is &Ampersand&, I can see the error with RTF:
            Altm&<>an, R. (2023). Thesis title. Unpublished Dissertation PhD, Test, &Ampersand&amp.

            But exporting to HTML or viewing the resource in WIKINDX, I see no error:
            Altm&<>an, R. (2023). Thesis title. Unpublished Dissertation PhD, Test, &Ampersand&.
            source:
            Altm&amp;&lt;&gt;an, R. (2023). <em>Thesis title</em>. Unpublished Dissertation PhD, Test, &amp;Ampersand&amp;.

            What bib style are you using?

             
            • Stéphane Aulery

              For this test APA. I did other tests. This is also true when the last character is < For example.

               
              • Mark Grimshaw

                Mark Grimshaw - 2024-01-27

                Strange. I'm using APA too. But duplicated it with magazine article and the endPage field. Seems it might not be on all fields. I'll look into it but it might be a day or two.

                 
                • Stéphane Aulery

                  This is where the fuzzer shows its usefulness. I couldn't have created by hand 100 resources to find this bug.

                  I have the impression that there are also problems with the BiBTeX export. But let's go step by step so as not to break anything. Except for good reason uses \HTML\escape() instead of htmlspecialchars().

                   
                  • Stéphane Aulery

                    I also find the same error with data if field ID no. (ISBN etc.) and Publisher. Same something when an HTML entity is an the end of the formted text and not attached to a particular field.

                     
  • Mark Grimshaw

    Mark Grimshaw - 2024-01-28

    I've found another issue.

    First, ensure that hyperlinking the resource in a list to a view of that list is enabled in My Wikindx|Resources (with it off, there is no error).

    Then view your resource (the one where an HTML element is the last character) in
    1. A list (e.g., the FRONT or a basket, etc.)—the error appears.
    2. In single view—no error.

    However, I have just committed a fix that solves this issue and the HTML and RTF export.

    On a related issue: if an HTML entity is added to a resource field (e.g. resourcepagesPageEnd) as &amp;, it is stored like that in the database. That is, without change. If the resource is edited, the value then appears as '&' and is then stored as that in the database.

    I assume we don't want this and that the field value should always be stored as originally entered (notwithstanding what tinyMCE does). Correct?

    Mark

     
    • Stéphane Aulery

      On a related issue: if an HTML entity is added to a resource field (e.g. resourcepagesPageEnd) as &, it is stored like that in the database. That is, without change. If the resource is edited, the value then appears as '&' and is then stored as that in the database.

      First TinyMCE do the right thing. If a non encoded & is loaded inside TinyMCE, it return it encoded. So there are no problem with TinyMCE and this case is imaginary.

      There are six cases:

      a) HTML field receives HTML well encoded content = OK
      b) HTML field receives HTML misencoded content = KO
      c) HTML field receives Text content without encodage characters = OK

      d) Text field receives HTML well content = OK
      e) Text field receives HTML misencoded content = KO
      f) Text field receives Text content = OK

      We write the software for the cases a), c) and f), and escape content on output following this scheme.

      b) and e) are pathological. Content like this may be inserted from the wrong source or by hand and it's almost impossible to avoid it when it comes from outside. We are not dealing with this case. It’s up to us at least not to generate it internally.

      d) This case is legal. At the output the data will be displayed as raw text because it has been escaped.

      When the content comes from an import or a call via an API (we don't have the latter case but we can be spammed by robots) it's not the same thing.

      Each import has its own format and I don't see that the titles in other software are HTML. So they must be escaped on a case by case basis.

      For the output, putting HTML into HTML or XML does not require anything special because the HTML fragment is presumed to be well escaped, otherwise the content must be escaped.

       
    • Stéphane Aulery

      You solution is wrong. You should not mix in \HTML\escapeHTMLTemp() escaping code and logic. Use \HTML\escape() and keep the logic in the formatter.

      Atfer you fix there are double escaping. You have to undo what you did. Tell me what you want to do and I will try to help you. Otherwise it's a weird case that we can keep for after the release.

       
      • Mark Grimshaw

        Mark Grimshaw - 2024-01-28

        hmmm. \BIB\cleanItems() line 926 is the point where it all happens. That if{} block deals with adding the ultimate field from the template to the rest of the reference and ensuring there is no double punctuation. The ultimate field is a special field that comprises only punctuation and can only be at the end of the reference.

        At the same time, if someone has written &amp; or similar at the end of the penultimate field then the ';' of that character should not count as punctuation and thus be removed. In other words, any HTML entities of &...; form that are at the end of the penultimate field must be detected and preserved when checking for double punctuation.

        I don't understand how your example comes to be because the &amp;&apos; is not at the end of the reference.

         
        • Stéphane Aulery

          In escapeHTMLTemp(), the pattern sjould end with $ to detect only an entity an the end of the string.

          I don't understand how your example comes to be because the &' is not at the end of the reference.

          Maybe it's an other case but I lost the data and generated an ohter batch of 1000 resources. I have an other exemple where the field end with "<,". The coma is removed and the ; lost.

          This is why I am doubtful of the solution using a regexp. Code that removes excess punctuation should also detect entities with a comparison of the last characters and not a substitution.

           
          • Mark Grimshaw

            Mark Grimshaw - 2024-01-28

            Hah! I did have the '$' but didn't realise I'd deleted it. But should it be '$' rather than '$'?

            I don't think it makes any difference, but, in cleanItems(), if I have the following print statements:

            print htmlspecialchars($itemArray[$lastItemKey]) . BR;
                            $lastItem = \HTML\escapeHTMLTemp(TRUE, $itemArray[$lastItemKey]);
            print htmlspecialchars($lastItem) . BR;
                            $$lastItem = preg_replace("/(.*)$lastItemPattern/Uus", '${1}', $lastItem);
            print htmlspecialchars($lastItem) . BR;
                            $itemArray[$lastItemKey] = \HTML\escapeHTMLTemp(FALSE, $lastItem);
            print htmlspecialchars($itemArray[$lastItemKey]);
            

            I get:

            &amp;&apos;2–&amp;22&amp;
            W!K!NDX_&amp;&apos;2–&amp;22&amp;_W!K!NDX
            W!K!NDX_&amp;&apos;2–&amp;22&amp;_W!K!NDX
            &amp;&apos;2–&amp;22&amp; 
            

            In other words, the '$' is not anchoring but greedily taking the entire string.

             
            • Stéphane Aulery

              My last commits adds it. This prevents the substitution from taking place in another place with unanticipated consequences.

              In other words, the '$' is not anchoring but greedily taking the entire string.

              it's why regexp is a foot gun. I fixed that with {1}.

               
1 2 > >> (Page 1 of 2)

Log in to post a comment.