Menu

#28 -maxdepth does not load all 'inline' content

open-fixed
Crawler (17)
5
2008-10-22
2008-07-20
Ger Hobbelt
No

Take for example a run where -maxdepth=1 and a HTML page at that level is grabbed. Then pavuk will only grab any 'inline content' (such as images, style sheets and JavaScript files) at ONE level BELOW that 'to ensure that everything is loaded to show the page as it should', to very liberally paraphrase the manual.

Well, assume there's a JS file in there, which contains code to, say, [pre]load a set of images for a rather dynamic site: those image URLs will be detected BUT they will be DISCARDED as they are 'inline', yet two(2!) levels below the parent HTML page at 'maxdepth' level.

This should be fixed by not simply assuming the HTML parent is at level-1 (one up), but by looking up the level of the HTML parent and checking /that/ number against the maxdepth config parameter.

Discussion

  • Ger Hobbelt

    Ger Hobbelt - 2008-07-20
    • assigned_to: nobody --> i_a
     
  • Ger Hobbelt

    Ger Hobbelt - 2008-10-22

    Added routine to determine the proper 'non-inline' level for comparison:

    int determine_non_inlined_url_level(url *urlp);

    which is used by the '-lmax' condition code from now on.

    This should fix this issue...

     
  • Ger Hobbelt

    Ger Hobbelt - 2008-10-22
    • status: open --> pending-fixed
     
  • Ger Hobbelt

    Ger Hobbelt - 2008-10-22
    • labels: --> Crawler
    • status: pending-fixed --> open-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB