Menu

HeadingTag does not work..

Help
Ptr_v
2006-05-08
2013-04-27
  • Ptr_v

    Ptr_v - 2006-05-08

    the results of google for example..

    it prints nonsense like

    h1 style=margine  etc.  anyone used this tag?

    i need to extract h1 to h6... and its pretty difficult :P

     
    • Derrick Oswald

      Derrick Oswald - 2006-05-09

      It's just another CompositeTag.
      Are you asking for getPlainTextString() ?
      Try getHtml().

      The actual heading should be a TextNode in the list of children - see getChildren ().

       
    • Ptr_v

      Ptr_v - 2006-05-09

      i used getString() i think.. but getPlainTextString() works a little better.. but still some rubbish in it..

      i have another..much bigger problem. Im building a webcrawler.. and it runs out of memory.

      when i compile i get the following warnings:

      set Attributes(java.uitil.Hashtable) in org.html.parser.Tag has been deprecated

      i get this 3 times..

      here is the code,  can someone please say why it eats all memory nd then dies? it will work for 1 site..  but after it has done +/- 100 sites from a pool its starting to give errors about heap memory.

      http://rafb.net/paste/results/9WmmA153.html

      So has this maybe to do with

      set Attributes(java.uitil.Hashtable) in org.html.parser.Tag has been deprecated

      warnings??

       
      • Derrick Oswald

        Derrick Oswald - 2006-05-09

        I can't see where your problem is since not all the source is included, but you're probably hanging on to one or more nodes, which hang onto the page, which hang onto the source, which keeps the memory...

         
    • Ptr_v

      Ptr_v - 2006-05-09

      hanging on?  this is the complete spider class i made,

      http://rafb.net/paste/results/xfZ6Wh77.html

      its is run by ~10 threads from a pool,  each thread doing maximal 20 pages.  if prog is run in 'normal' way it starts spitting 'thread-23 is out of heap memory'..  if i run it with -server or make a bigger heap size it starts using Swap file till 1gig...

      I dont really see what you mean with "hangs" on nodes.

      thanks,

      peter

       
      • Derrick Oswald

        Derrick Oswald - 2006-05-10

        By 'hanging on' I mean 'holds a reference to'.
        The garbage collector can't reclaim something if there is another (static) object (indirectly) pointing at it.

        My guess is that it has to do with your declaration:
          private NodeList Nlist;
        This shouldn't be a class member since it's only used in processURL().

        Another possibility is that your inner class tags are holding on to the Spider instance, but I don't think this should be a problem. You might want to make them separate classes if the first suggestion doesn't help.

         
    • Ptr_v

      Ptr_v - 2006-05-10

      Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
          at java.io.BufferedWriter.<init>(BufferedWriter.java:87)
          at java.io.BufferedWriter.<init>(BufferedWriter.java:70)
          at java.io.PrintStream.init(PrintStream.java:83)
          at java.io.PrintStream.<init>(PrintStream.java:125)
          at sun.net.www.http.HttpClient.openServer(HttpClient.java:371)
          at sun.net.www.http.HttpClient.openServer(HttpClient.java:481)
          at sun.net.www.http.HttpClient.<init>(HttpClient.java:214)
          at sun.net.www.http.HttpClient.New(HttpClient.java:287)
          at sun.net.www.http.HttpClient.New(HttpClient.java:299)
          at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:785)
          at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:726)
          at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:651)
          at Ilocal.checkLink(Ilocal.java:104)
          at Ilocal.spiderFoundURL(Ilocal.java:114)
          at Spider.handleLink(Spider.java:343)
          at Spider$LocalLinkTag.doSemanticAction(Spider.java:233)
          at org.htmlparser.scanners.CompositeTagScanner.finishTag(CompositeTagScanner.java:305)
          at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:257)
          at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
          at Spider.processURL(Spider.java:175)
          at Spider.begin(Spider.java:330)
          at Spider.run(Spider.java:98)
          at WorkerThread.run(ThreadPool.java:133)

       
    • Ptr_v

      Ptr_v - 2006-05-12

      It wasnt the private NodeList Nlist;

      & I dont know how to seperate the classes in this case..they need functions from Spider class and are already extended

       
    • Ptr_v

      Ptr_v - 2006-05-12

      well i seperated the inner classes like this:

      http://rafb.net/paste/results/d0hbDL22.html

      only printing out the links..

      but the memory problem is still there. it cant garbage colelction the spider instance

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.