HTML5 parsing problems - links without quotes
Brought to you by:
mjericho
In HTML5 it is legit to write the following:
<link rel=canonical href=https://example.com/directory/>
The parser handles this as a self closing tag and extracts the following URL:
https://example.com/directory
Expected behavior is that the parser extracts the following URL:
https://example.com/directory/
This only works if you write the href in quotes:
<link rel=canonical href="https://example.com/directory/">
Thanks Tobias. Strange that I wrote the parser to interpret the closing slash as an empty element tag when no browsers interpret it that way. Maybe browser behaviour has changed in that respect over the years.
There's a fair bit of code and documentation to update to fix this but I'll see if I can get it done tonight.
Fixed in version 3.5.
Until version 3.5 is officially released, the development version is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip
Hi Martin. Looks great. Our test for this is green now. Thank you very much for the fast reaction.
You have a test suite for parsing? Would you mind sharing that with me? I only have a handful of unit tests at present.
Hi Martin!
Our tests are not really in a format that we can share and you could easily adapt. Our use case is kind of specialized. With Audisto we operate a service for technical website audits and most of our internal test cases work in a way, that we look at reproducable results for crawls of an internal test environement. Most of our tests refer to our so called hints which often come with additional logic on top of parsing .
If you want to create a better test suite I suggest you start looking at w3c web platform tests and the test suites of chromium and firefox.
Best regards
Tobias
Last edit: Tobias Schwarz 2018-05-09
Hi Martin - do you have a timeline in place for version 3.5?
I added a WebBot class for crawling/downloading websites but I want to document it before releasing. There's a fair bit to document so it's taking a while, but I'm plugging away at it in my spare time. Still probably months rather than weeks away though, I don't have a lot of spare time.
Thanks Martin!