Menu

#88 HTML5 parsing problems - links without quotes

General
open-fixed
None
5
2020-07-18
2018-05-08
No

In HTML5 it is legit to write the following:

<link rel=canonical href=https://example.com/directory/>  

The parser handles this as a self closing tag and extracts the following URL:

https://example.com/directory

Expected behavior is that the parser extracts the following URL:

https://example.com/directory/

This only works if you write the href in quotes:

<link rel=canonical href="https://example.com/directory/">  

Discussion

  • Martin Jericho

    Martin Jericho - 2018-05-08

    Thanks Tobias. Strange that I wrote the parser to interpret the closing slash as an empty element tag when no browsers interpret it that way. Maybe browser behaviour has changed in that respect over the years.

    There's a fair bit of code and documentation to update to fix this but I'll see if I can get it done tonight.

     
  • Martin Jericho

    Martin Jericho - 2018-05-08

    Fixed in version 3.5.

    Until version 3.5 is officially released, the development version is available here:
    http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip

     
  • Tobias Schwarz

    Tobias Schwarz - 2018-05-08

    Hi Martin. Looks great. Our test for this is green now. Thank you very much for the fast reaction.

     
  • Martin Jericho

    Martin Jericho - 2018-05-09

    You have a test suite for parsing? Would you mind sharing that with me? I only have a handful of unit tests at present.

     
  • Tobias Schwarz

    Tobias Schwarz - 2018-05-09

    Hi Martin!

    Our tests are not really in a format that we can share and you could easily adapt. Our use case is kind of specialized. With Audisto we operate a service for technical website audits and most of our internal test cases work in a way, that we look at reproducable results for crawls of an internal test environement. Most of our tests refer to our so called hints which often come with additional logic on top of parsing .

    If you want to create a better test suite I suggest you start looking at w3c web platform tests and the test suites of chromium and firefox.

    Best regards
    Tobias

     

    Last edit: Tobias Schwarz 2018-05-09
  • Code Buddy

    Code Buddy - 2019-02-12

    Hi Martin - do you have a timeline in place for version 3.5?

     
  • Martin Jericho

    Martin Jericho - 2019-02-12

    I added a WebBot class for crawling/downloading websites but I want to document it before releasing. There's a fair bit to document so it's taking a while, but I'm plugging away at it in my spare time. Still probably months rather than weeks away though, I don't have a lot of spare time.

     
    • Code Buddy

      Code Buddy - 2019-02-12

      Thanks Martin!

       
  • Martin Jericho

    Martin Jericho - 2020-07-18
    • status: unread --> open-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB