Hi, while working on an issue very similar to bug #17 (just with two different characters in an XML file), I was working with the function isHTMLCharacter() in src/vsmime.c and wanted to ask:
What clamsap does in getByteType() is (in this particular case) is basically checking either whether the character is printable or, if it isn't, whether it is accepted by a whitelist of bytes in the isHTMLCharacter() function. Both #17 and my case use multi-byte UTF-8 characters and clamsap seems to iterate over them byte by byte.
In my case, the two characters are '•' (a bullet point) and '‘' (opening single quote), represented as E2 80 A2 and E2 80 98, respectively. Unfortunately an update to 104.1 (from our 101.9) to use/test the wildcards suggested by Markus in #17 is not possible and as such I was trying to add the two offending bytes (0xA2 and 0x98, the others are already whitelisted) to the isHTMLCharacter() function, which solved the issue.
Now my question is: what decides the contents of this function? Is it according to some standard (I was unable to find any)? I would think so, given that that byte list was not modified in about 7 years, but the function already contains bytes right next to these (e.g. 0xA1 and 0x97). Or is it maybe created ad-hoc without any backing document?
BTW, this was a legitimate xlsx document, that was renamed to a .xml file.
In the long run, it would be nice to fully cover the UTF-8 standard.
Hi, dont know if this bug mixes some questions, however here you mean you have a Excel file rename to .xml only or do you have converted a Excel file to .xml via export. I ask because if you simply use the xlsx file then this is internally a ZIP package and due to misleading MIME detecting this will fail and should fail. I tried this and internally office is detected via libmagic but a file extension .xml is not compatible to office and therefore it fails.
If you have exported from Excel your content into .XML, then Excel uses plain XML syntax and this should work. So in case of doubts, please add an example to this ticket
Hi,
the internal function is only valid for ASCII, yes and thus will re-use code from file (magic) to detect encoding and handle ASCII and UTF-8 .
kind regards
Hi, in general I would like to know if the XML detection you mentioned at the begin now is solved with the library version 104.3 because here I enhanced the check to use libmagic . The internal function isHTMLCharacter was a shortcut before using libmagic and still is a shortcut to check if syntax is XML with only ASCII chars. If any non-ascii character is found now the libmagic check is done, but the final decision should be application/xml independent from encoding inside the content
kind regards,
Hi, I have already noticed your patch earlier today, built a clamsap package with it and sent it for testing (I unfortunately do not have any reproducing machine at hand), will update here once I receive feedback, thanks.
ok, thanks for the response, so if you need a package for RedHat or older SUSE OSes you can use https://sourceforge.net/projects/clamsap/files/RPM/clamsap-0.104.3-1.x86_64.rpm/download or if you tell me your architecture I can check if I am able to provide a test library for you.
I have received positive feedback on the patch for the original case (the bullet point and opening single quote) and have sent the patched packages to another reporter with different characters, but looking at the patch I assume the feedback will be positive there too. Thanks!
Thank you for the answer I will close this bug
Thank you for the answer I will close this bug