OutOfMemory Exception fix

Help
windjunky
2007-02-21
2013-07-06
  • windjunky

    windjunky - 2007-02-21

    After running JSpider on a very large site (20000+ links), you will get an OutOfMemory exception. This occurs because JSpider maintains the content of each URL in a Map (when using memory storage). I configured JSpider to use the JDBC storage instead of memory, but the performance was very bad. So instead I ended up nulling out the content of links that were already parsed.

    Originally, I would get the OOM at around 10000 links (memory usage was around 0.500GB). With my changes I was able to spider 22000 links with a memory usage around 0.575GB.

    I am using JSpider to create a google sitemap with my custom plugin. In the ResourceParsedEvent, ResourceIgnoredForFetchingEvent and ResourceIgnoredForParsingEvent events, I added the following lines:

            if (event.getResource() instanceof ResourceInternal) {
                ResourceInternal ri = (ResourceInternal) event.getResource();
                ri.setBytes(null);
            }

    Here is my file in its entirety

    import java.io.File;
    import java.net.URL;

    import net.javacoding.jspider.api.event.EventVisitor;
    import net.javacoding.jspider.api.event.JSpiderEvent;
    import net.javacoding.jspider.api.event.engine.EngineRelatedEvent;
    import net.javacoding.jspider.api.event.engine.SpideringStartedEvent;
    import net.javacoding.jspider.api.event.engine.SpideringStoppedEvent;
    import net.javacoding.jspider.api.event.folder.FolderDiscoveredEvent;
    import net.javacoding.jspider.api.event.folder.FolderRelatedEvent;
    import net.javacoding.jspider.api.event.resource.EMailAddressDiscoveredEvent;
    import net.javacoding.jspider.api.event.resource.EMailAddressReferenceDiscoveredEvent;
    import net.javacoding.jspider.api.event.resource.MalformedBaseURLFoundEvent;
    import net.javacoding.jspider.api.event.resource.MalformedURLFoundEvent;
    import net.javacoding.jspider.api.event.resource.ResourceDiscoveredEvent;
    import net.javacoding.jspider.api.event.resource.ResourceFetchErrorEvent;
    import net.javacoding.jspider.api.event.resource.ResourceFetchedEvent;
    import net.javacoding.jspider.api.event.resource.ResourceForbiddenEvent;
    import net.javacoding.jspider.api.event.resource.ResourceIgnoredForFetchingEvent;
    import net.javacoding.jspider.api.event.resource.ResourceIgnoredForParsingEvent;
    import net.javacoding.jspider.api.event.resource.ResourceParsedEvent;
    import net.javacoding.jspider.api.event.resource.ResourceReferenceDiscoveredEvent;
    import net.javacoding.jspider.api.event.resource.ResourceRelatedEvent;
    import net.javacoding.jspider.api.event.site.RobotsTXTFetchedEvent;
    import net.javacoding.jspider.api.event.site.RobotsTXTMissingEvent;
    import net.javacoding.jspider.api.event.site.SiteDiscoveredEvent;
    import net.javacoding.jspider.api.event.site.SiteRelatedEvent;
    import net.javacoding.jspider.api.event.site.UserAgentObeyedEvent;
    import net.javacoding.jspider.api.model.FetchIgnoredResource;
    import net.javacoding.jspider.api.model.ParseIgnoredResource;
    import net.javacoding.jspider.api.model.ParsedResource;
    import net.javacoding.jspider.core.logging.LogFactory;
    import net.javacoding.jspider.core.model.ResourceInternal;
    import net.javacoding.jspider.core.util.config.ConfigurationFactory;
    import net.javacoding.jspider.spi.Plugin;

    public class SiteMapPlugin implements Plugin, EventVisitor {

        private SFileWriter fileWriter = null;

        private int linkCounter = 0;

        private int fileCounter = 0;

        public String getName() {
            return "SiteMapPlugin";
        }

        public String getVersion() {
            return "1.0";
        }

        public String getDescription() {
            return getName();
        }

        public String getVendor() {
            return "";
        }

        public void initialize() {
            openFile("sitemap.xml");
        }

        private void openFile(String filename) {
            File file = new File(ConfigurationFactory.getConfiguration()
                    .getDefaultOutputFolder(), filename);

            fileWriter = new SFileWriter();
            fileWriter.begin(file);
            fileWriter.write("<?xml version=\&quot;1.0\&quot; encoding=\&quot;UTF-8\&quot;?>");
            fileWriter
                    .write("<urlset xmlns=\&quot;http://www.google.com/schemas/sitemap/0.84\&quot;>");
        }

        private void closeFile() {
            fileWriter.write("</urlset>");
            fileWriter.end();
        }

        public void shutdown() {
            closeFile();
        }

        public void notify(JSpiderEvent event) {
            event.accept(this);
        }

        public void visit(JSpiderEvent event) {
        }

        public void visit(EngineRelatedEvent event) {
        }

        public void visit(SpideringStartedEvent event) {
        }

        public void visit(SpideringStoppedEvent event) {
        }

        public void visit(FolderRelatedEvent event) {
        }

        public void visit(FolderDiscoveredEvent event) {
        }

        public void visit(ResourceRelatedEvent event) {
        }

        public void visit(EMailAddressDiscoveredEvent event) {
        }

        public void visit(EMailAddressReferenceDiscoveredEvent event) {
        }

        public void visit(MalformedURLFoundEvent event) {
        }

        public void visit(MalformedBaseURLFoundEvent event) {
        }

        public void visit(ResourceDiscoveredEvent event) {
        }

        public void visit(ResourceFetchedEvent event) {
        }

        public void visit(ResourceFetchErrorEvent event) {
        }

        public void visit(ResourceForbiddenEvent event) {
        }

        public void visit(ResourceParsedEvent event) {

            // if we hit 9000 links, then start a new file
            if (fileCounter >= 9000) {
                this.closeFile();
                this.openFile("sitemap-" + linkCounter + ".xml");
                fileCounter = 0;
            }

            fileCounter++;
            linkCounter++;

            if (linkCounter % 50 == 0) {
                LogFactory.getLog(this.getClass()).info("URLS " + linkCounter);
                System.out.println(getClass().getName() + "; URLS " + linkCounter);
            }
            ParsedResource pr = event.getResource();
            URL url = pr.getURL();

            fileWriter.write("\t<url>");
            fileWriter.write("\t\t<loc>" + url.toString() + "</loc>");
            fileWriter.write("\t\t<changefreq>daily</changefreq>");
            fileWriter.write("\t\t<priority>0.8</priority>");
            fileWriter.write("\t</url>");

            /*
             * NULL contents of URL to reduce memory footprint
             */
            if (pr instanceof ResourceInternal) {
                ResourceInternal ri = (ResourceInternal) pr;
                ri.setBytes(null);
            }
        }

        public void visit(ResourceIgnoredForFetchingEvent event) {

            FetchIgnoredResource resource = event.getResource();

            /*
             * NULL contents of URL to reduce memory footprint
             */
            if (resource instanceof ResourceInternal) {
                ResourceInternal ri = (ResourceInternal) resource;
                ri.setBytes(null);
            }
        }

        public void visit(ResourceIgnoredForParsingEvent event) {
            ParseIgnoredResource resource = event.getResource();

            /*
             * NULL contents of URL to reduce memory footprint
             */
            if (resource instanceof ResourceInternal) {
                ResourceInternal ri = (ResourceInternal) resource;
                ri.setBytes(null);
            }
        }

        public void visit(ResourceReferenceDiscoveredEvent event) {
        }

        public void visit(SiteRelatedEvent event) {
        }

        public void visit(SiteDiscoveredEvent event) {
        }

        public void visit(RobotsTXTMissingEvent event) {
        }

        public void visit(RobotsTXTFetchedEvent event) {
        }

        public void visit(UserAgentObeyedEvent event) {
        }
    }

     
  • Andrew

    Andrew - 2013-07-06

    I ran into an OutOfMemoryError myself, and will try to apply your fix. Here is a screenie of the error:
    http://i.imgur.com/kx3TRqK.jpg
    Also, I noticed oddity when skimming through the code of SpiderHttpUrlTask.java in lines 76, 79, and 116. For some reason, there is a BufferedInputStream(76) wrapped by another BufferedInputStream(79), with the original BIS closed(116).

    Is there some reason for this? I'm removing the 2nd BIS for my personal build.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks