Thread: [Htmlparser-user] set URL for relative links
Brought to you by:
derrickoswald
From: jpdogg <jp...@gm...> - 2006-09-11 16:59:01
|
Hello, I've cached some HTML pages in local files and would like to tell the Parser object what the original URLs were so that it can correctly interpret relative links. As a simple example, say I do this: Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); If I construct a filter to give me all of the ImageTags in this simple document, I get one. Unfortunately, it has the URL foo.jpg. If I know that this file was originally located at http://www.bar.com/foo.html, how do I inform the parser module? I want it to be able to report that the above image is located at http://www.bar.com/foo.jpg. Thanks! Jeff |
From: Garry H. <ga...@gm...> - 2006-09-11 17:07:28
|
Did you try my_parser.setURL("http://www.bar.com/"); ? Just a thought. Cheers, Garry On Sep 12, 2006, at 12:58 AM, jpdogg wrote: > Hello, > > I've cached some HTML pages in local files and would like to tell the > Parser object what the original URLs were so that it can correctly > interpret relative links. > > As a simple example, say I do this: > > Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); > > If I construct a filter to give me all of the ImageTags in this simple > document, I get one. Unfortunately, it has the URL foo.jpg. If I > know that this file was originally located at > http://www.bar.com/foo.html, how do I inform the parser module? I > want it to be able to report that the above image is located at > http://www.bar.com/foo.jpg. > > Thanks! > Jeff > > ---------------------------------------------------------------------- > --- > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your > job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel? > cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Jeffrey B. <jb...@cs...> - 2006-09-11 18:21:05
|
On 9/11/06, Garry Huang <ga...@gm...> wrote: > Did you try my_parser.setURL("http://www.bar.com/"); ? Yeah, I tried that. If it's inserted before I call extractAllNodesThatMatch(img_filter); then http://www.bar.com is downloaded. If it's called after then the relative links aren't fixed. It's possible that there's something subtle with the ordering that I could change, but I couldn't get it to work and it seems like it would be a hack... Thanks for the suggestion though. -Jeff > Just a thought. > > Cheers, > Garry > > On Sep 12, 2006, at 12:58 AM, jpdogg wrote: > > > Hello, > > > > I've cached some HTML pages in local files and would like to tell the > > Parser object what the original URLs were so that it can correctly > > interpret relative links. > > > > As a simple example, say I do this: > > > > Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); > > > > If I construct a filter to give me all of the ImageTags in this simple > > document, I get one. Unfortunately, it has the URL foo.jpg. If I > > know that this file was originally located at > > http://www.bar.com/foo.html, how do I inform the parser module? I > > want it to be able to report that the above image is located at > > http://www.bar.com/foo.jpg. > > > > Thanks! > > Jeff > > > > ---------------------------------------------------------------------- > > --- > > Using Tomcat but need to do more? Need to support web services, > > security? > > Get stuff done quickly with pre-integrated technology to make your > > job easier > > Download IBM WebSphere Application Server v.1.0.1 based on Apache > > Geronimo > > http://sel.as-us.falkag.net/sel? > > cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-09-11 22:18:46
|
I believe you need to use setBaseUrl on the Page object. parser.getLexer ().getPage ().setBaseUrl ("http://www.bar.com"); Jeffrey Bigham wrote: >On 9/11/06, Garry Huang <ga...@gm...> wrote: > > >>Did you try my_parser.setURL("http://www.bar.com/"); ? >> >> > >Yeah, I tried that. > >If it's inserted before I call extractAllNodesThatMatch(img_filter); >then http://www.bar.com is downloaded. If it's called after then the >relative links aren't fixed. > >It's possible that there's something subtle with the ordering that I >could change, but I couldn't get it to work and it seems like it would >be a hack... > >Thanks for the suggestion though. > >-Jeff > > > >>Just a thought. >> >>Cheers, >>Garry >> >>On Sep 12, 2006, at 12:58 AM, jpdogg wrote: >> >> >> >>>Hello, >>> >>>I've cached some HTML pages in local files and would like to tell the >>>Parser object what the original URLs were so that it can correctly >>>interpret relative links. >>> >>>As a simple example, say I do this: >>> >>>Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); >>> >>>If I construct a filter to give me all of the ImageTags in this simple >>>document, I get one. Unfortunately, it has the URL foo.jpg. If I >>>know that this file was originally located at >>>http://www.bar.com/foo.html, how do I inform the parser module? I >>>want it to be able to report that the above image is located at >>>http://www.bar.com/foo.jpg. >>> >>>Thanks! >>>Jeff >>> >>>---------------------------------------------------------------------- >>>--- >>>Using Tomcat but need to do more? Need to support web services, >>>security? >>>Get stuff done quickly with pre-integrated technology to make your >>>job easier >>>Download IBM WebSphere Application Server v.1.0.1 based on Apache >>>Geronimo >>>http://sel.as-us.falkag.net/sel? >>>cmd=lnk&kid=120709&bid=263057&dat=121642 >>>_______________________________________________ >>>Htmlparser-user mailing list >>>Htm...@li... >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>------------------------------------------------------------------------- >>Using Tomcat but need to do more? Need to support web services, security? >>Get stuff done quickly with pre-integrated technology to make your job easier >>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Jeffrey B. <jb...@cs...> - 2006-09-11 22:28:08
|
Thanks Derrick, Your suggestion worked perfectly! -Jeff On 9/11/06, Derrick Oswald <Der...@ro...> wrote: > > I believe you need to use setBaseUrl on the Page object. > parser.getLexer ().getPage ().setBaseUrl ("http://www.bar.com"); > > Jeffrey Bigham wrote: > > >On 9/11/06, Garry Huang <ga...@gm...> wrote: > > > > > >>Did you try my_parser.setURL("http://www.bar.com/"); ? > >> > >> > > > >Yeah, I tried that. > > > >If it's inserted before I call extractAllNodesThatMatch(img_filter); > >then http://www.bar.com is downloaded. If it's called after then the > >relative links aren't fixed. > > > >It's possible that there's something subtle with the ordering that I > >could change, but I couldn't get it to work and it seems like it would > >be a hack... > > > >Thanks for the suggestion though. > > > >-Jeff > > > > > > > >>Just a thought. > >> > >>Cheers, > >>Garry > >> > >>On Sep 12, 2006, at 12:58 AM, jpdogg wrote: > >> > >> > >> > >>>Hello, > >>> > >>>I've cached some HTML pages in local files and would like to tell the > >>>Parser object what the original URLs were so that it can correctly > >>>interpret relative links. > >>> > >>>As a simple example, say I do this: > >>> > >>>Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); > >>> > >>>If I construct a filter to give me all of the ImageTags in this simple > >>>document, I get one. Unfortunately, it has the URL foo.jpg. If I > >>>know that this file was originally located at > >>>http://www.bar.com/foo.html, how do I inform the parser module? I > >>>want it to be able to report that the above image is located at > >>>http://www.bar.com/foo.jpg. > >>> > >>>Thanks! > >>>Jeff > >>> > >>>---------------------------------------------------------------------- > >>>--- > >>>Using Tomcat but need to do more? Need to support web services, > >>>security? > >>>Get stuff done quickly with pre-integrated technology to make your > >>>job easier > >>>Download IBM WebSphere Application Server v.1.0.1 based on Apache > >>>Geronimo > >>>http://sel.as-us.falkag.net/sel? > >>>cmd=lnk&kid=120709&bid=263057&dat=121642 > >>>_______________________________________________ > >>>Htmlparser-user mailing list > >>>Htm...@li... > >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>> > >>> > >>------------------------------------------------------------------------- > >>Using Tomcat but need to do more? Need to support web services, security? > >>Get stuff done quickly with pre-integrated technology to make your job easier > >>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > >>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > >>_______________________________________________ > >>Htmlparser-user mailing list > >>Htm...@li... > >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >> > >> > >> > > > >------------------------------------------------------------------------- > >Using Tomcat but need to do more? Need to support web services, security? > >Get stuff done quickly with pre-integrated technology to make your job easier > >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |