Menu

WARNING: URL does not contain text

Help
Miguel
2009-06-14
2013-04-27
  • Miguel

    Miguel - 2009-06-14

    Hi everyone,

    I'm trying to parse a very large number of web pages in order to extractthe text, but some of them does not contain It, as the parser says to me printing a message in the console. There is a way to catch those kind of messages after they are printed, and also to do some proper treatment of those urls?

    What I do is to create a parser with a well-formed url like this:

    parser = new Parser(url);

    And for some of those urls the message appears.

    Thanks in advance for your help!

     
    • Derrick Oswald

      Derrick Oswald - 2009-06-14

      You could try setting the Request header fields in the HTTP request to only accept text.
      See the documentation on org.htmlparser.http.ConnectionManager.setDefaultRequestProperties.

      However, this relies on the answering server to be properly configured, otherwise you will always get server's sending image files marked as text and so on.

      One way would be to prefecth the URL yourself and examine the mime type, then just skip non-text pages.

      The best way is probably extend the ConnectionMonitor interface to return a status about whether to continue or not. The checking could be done in-line then by implementing the interface.

      I look forward to your code patch submission ;)

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.