i used getString() i think.. but getPlainTextString() works a little better.. but still some rubbish in it..
i have another..much bigger problem. Im building a webcrawler.. and it runs out of memory.
when i compile i get the following warnings:
set Attributes(java.uitil.Hashtable) in org.html.parser.Tag has been deprecated
i get this 3 times..
here is the code, can someone please say why it eats all memory nd then dies? it will work for 1 site.. but after it has done +/- 100 sites from a pool its starting to give errors about heap memory.
I can't see where your problem is since not all the source is included, but you're probably hanging on to one or more nodes, which hang onto the page, which hang onto the source, which keeps the memory...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
its is run by ~10 threads from a pool, each thread doing maximal 20 pages. if prog is run in 'normal' way it starts spitting 'thread-23 is out of heap memory'.. if i run it with -server or make a bigger heap size it starts using Swap file till 1gig...
I dont really see what you mean with "hangs" on nodes.
thanks,
peter
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
By 'hanging on' I mean 'holds a reference to'.
The garbage collector can't reclaim something if there is another (static) object (indirectly) pointing at it.
My guess is that it has to do with your declaration:
private NodeList Nlist;
This shouldn't be a class member since it's only used in processURL().
Another possibility is that your inner class tags are holding on to the Spider instance, but I don't think this should be a problem. You might want to make them separate classes if the first suggestion doesn't help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedWriter.<init>(BufferedWriter.java:87)
at java.io.BufferedWriter.<init>(BufferedWriter.java:70)
at java.io.PrintStream.init(PrintStream.java:83)
at java.io.PrintStream.<init>(PrintStream.java:125)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:371)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:481)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:214)
at sun.net.www.http.HttpClient.New(HttpClient.java:287)
at sun.net.www.http.HttpClient.New(HttpClient.java:299)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:785)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:726)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:651)
at Ilocal.checkLink(Ilocal.java:104)
at Ilocal.spiderFoundURL(Ilocal.java:114)
at Spider.handleLink(Spider.java:343)
at Spider$LocalLinkTag.doSemanticAction(Spider.java:233)
at org.htmlparser.scanners.CompositeTagScanner.finishTag(CompositeTagScanner.java:305)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:257)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at Spider.processURL(Spider.java:175)
at Spider.begin(Spider.java:330)
at Spider.run(Spider.java:98)
at WorkerThread.run(ThreadPool.java:133)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the results of google for example..
it prints nonsense like
h1 style=margine etc. anyone used this tag?
i need to extract h1 to h6... and its pretty difficult :P
It's just another CompositeTag.
Are you asking for getPlainTextString() ?
Try getHtml().
The actual heading should be a TextNode in the list of children - see getChildren ().
i used getString() i think.. but getPlainTextString() works a little better.. but still some rubbish in it..
i have another..much bigger problem. Im building a webcrawler.. and it runs out of memory.
when i compile i get the following warnings:
set Attributes(java.uitil.Hashtable) in org.html.parser.Tag has been deprecated
i get this 3 times..
here is the code, can someone please say why it eats all memory nd then dies? it will work for 1 site.. but after it has done +/- 100 sites from a pool its starting to give errors about heap memory.
http://rafb.net/paste/results/9WmmA153.html
So has this maybe to do with
set Attributes(java.uitil.Hashtable) in org.html.parser.Tag has been deprecated
warnings??
I can't see where your problem is since not all the source is included, but you're probably hanging on to one or more nodes, which hang onto the page, which hang onto the source, which keeps the memory...
hanging on? this is the complete spider class i made,
http://rafb.net/paste/results/xfZ6Wh77.html
its is run by ~10 threads from a pool, each thread doing maximal 20 pages. if prog is run in 'normal' way it starts spitting 'thread-23 is out of heap memory'.. if i run it with -server or make a bigger heap size it starts using Swap file till 1gig...
I dont really see what you mean with "hangs" on nodes.
thanks,
peter
By 'hanging on' I mean 'holds a reference to'.
The garbage collector can't reclaim something if there is another (static) object (indirectly) pointing at it.
My guess is that it has to do with your declaration:
private NodeList Nlist;
This shouldn't be a class member since it's only used in processURL().
Another possibility is that your inner class tags are holding on to the Spider instance, but I don't think this should be a problem. You might want to make them separate classes if the first suggestion doesn't help.
Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedWriter.<init>(BufferedWriter.java:87)
at java.io.BufferedWriter.<init>(BufferedWriter.java:70)
at java.io.PrintStream.init(PrintStream.java:83)
at java.io.PrintStream.<init>(PrintStream.java:125)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:371)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:481)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:214)
at sun.net.www.http.HttpClient.New(HttpClient.java:287)
at sun.net.www.http.HttpClient.New(HttpClient.java:299)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:785)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:726)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:651)
at Ilocal.checkLink(Ilocal.java:104)
at Ilocal.spiderFoundURL(Ilocal.java:114)
at Spider.handleLink(Spider.java:343)
at Spider$LocalLinkTag.doSemanticAction(Spider.java:233)
at org.htmlparser.scanners.CompositeTagScanner.finishTag(CompositeTagScanner.java:305)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:257)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at Spider.processURL(Spider.java:175)
at Spider.begin(Spider.java:330)
at Spider.run(Spider.java:98)
at WorkerThread.run(ThreadPool.java:133)
It wasnt the private NodeList Nlist;
& I dont know how to seperate the classes in this case..they need functions from Spider class and are already extended
well i seperated the inner classes like this:
http://rafb.net/paste/results/d0hbDL22.html
only printing out the links..
but the memory problem is still there. it cant garbage colelction the spider instance