HTML Parser / Discussion / Help: HeadingTag does not work..

Ptr_v - 2006-05-08

the results of google for example..

it prints nonsense like

h1 style=margine etc. anyone used this tag?

i need to extract h1 to h6... and its pretty difficult :P

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2006-05-09
  
  It's just another CompositeTag.
  Are you asking for getPlainTextString() ?
  Try getHtml().
  
  The actual heading should be a TextNode in the list of children - see getChildren ().
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ptr_v - 2006-05-09
  
  i used getString() i think.. but getPlainTextString() works a little better.. but still some rubbish in it..
  
  i have another..much bigger problem. Im building a webcrawler.. and it runs out of memory.
  
  when i compile i get the following warnings:
  
  set Attributes(java.uitil.Hashtable) in org.html.parser.Tag has been deprecated
  
  i get this 3 times..
  
  here is the code, can someone please say why it eats all memory nd then dies? it will work for 1 site.. but after it has done +/- 100 sites from a pool its starting to give errors about heap memory.
  
  http://rafb.net/paste/results/9WmmA153.html
  
  So has this maybe to do with
  
  set Attributes(java.uitil.Hashtable) in org.html.parser.Tag has been deprecated
  
  warnings??
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2006-05-09
    
    I can't see where your problem is since not all the source is included, but you're probably hanging on to one or more nodes, which hang onto the page, which hang onto the source, which keeps the memory...
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ptr_v - 2006-05-09
  
  hanging on? this is the complete spider class i made,
  
  http://rafb.net/paste/results/xfZ6Wh77.html
  
  its is run by ~10 threads from a pool, each thread doing maximal 20 pages. if prog is run in 'normal' way it starts spitting 'thread-23 is out of heap memory'.. if i run it with -server or make a bigger heap size it starts using Swap file till 1gig...
  
  I dont really see what you mean with "hangs" on nodes.
  
  thanks,
  
  peter
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2006-05-10
    
    By 'hanging on' I mean 'holds a reference to'.
    The garbage collector can't reclaim something if there is another (static) object (indirectly) pointing at it.
    
    My guess is that it has to do with your declaration:
    private NodeList Nlist;
    This shouldn't be a class member since it's only used in processURL().
    
    Another possibility is that your inner class tags are holding on to the Spider instance, but I don't think this should be a problem. You might want to make them separate classes if the first suggestion doesn't help.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ptr_v - 2006-05-10
  
  Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
      at java.io.BufferedWriter.<init>(BufferedWriter.java:87)
      at java.io.BufferedWriter.<init>(BufferedWriter.java:70)
      at java.io.PrintStream.init(PrintStream.java:83)
      at java.io.PrintStream.<init>(PrintStream.java:125)
      at sun.net.www.http.HttpClient.openServer(HttpClient.java:371)
      at sun.net.www.http.HttpClient.openServer(HttpClient.java:481)
      at sun.net.www.http.HttpClient.<init>(HttpClient.java:214)
      at sun.net.www.http.HttpClient.New(HttpClient.java:287)
      at sun.net.www.http.HttpClient.New(HttpClient.java:299)
      at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:785)
      at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:726)
      at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:651)
      at Ilocal.checkLink(Ilocal.java:104)
      at Ilocal.spiderFoundURL(Ilocal.java:114)
      at Spider.handleLink(Spider.java:343)
      at Spider$LocalLinkTag.doSemanticAction(Spider.java:233)
      at org.htmlparser.scanners.CompositeTagScanner.finishTag(CompositeTagScanner.java:305)
      at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:257)
      at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
      at Spider.processURL(Spider.java:175)
      at Spider.begin(Spider.java:330)
      at Spider.run(Spider.java:98)
      at WorkerThread.run(ThreadPool.java:133)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ptr_v - 2006-05-12
  
  It wasnt the private NodeList Nlist;
  
  & I dont know how to seperate the classes in this case..they need functions from Spider class and are already extended
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ptr_v - 2006-05-12
  
  well i seperated the inner classes like this:
  
  http://rafb.net/paste/results/d0hbDL22.html
  
  only printing out the links..
  
  but the memory problem is still there. it cant garbage colelction the spider instance
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

HeadingTag does not work..

Forums

Help

HeadingTag does not work.. document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

HeadingTag does not work..