Menu

PHPCrawl is the best

Help
Anonymous
2014-01-28
2014-01-28
  • Anonymous

    Anonymous - 2014-01-28

    I have worked with hadoop and nutch and some others.. All this bla bla redmap,map reduce bla bla. Means nothin.. Php crawler rocks.. It does just what it should do. Give you the raw data and you do what you need. There are a few things that would make it even more amazing. The ability to TOPN for a url and break out. I understand that this goes against the way phpcrwal works. But I cant figure out how to crawl multiple domains, I mean from a list of domains and stop. When you start the beast she just goes and goes. So when you start her there is no stopping her. If you hit a domain like witikdata you could be in there forever. There is no way to say enough already move on. Then you could have a list of domains and feed into phpcrawl and forget. On the reverse invert index there is not problem since you can crawl just the domain and stop. Phpcrawl is fast very fast. In CHILD MODE CALLS PARENT it just rocks. I really can't even think of anything else I would really need. Yes there are some small bugs SQLLITE is a pain and should be replaced with some NOSQL backend like ES.

    Very nice job and we should all contribute some cash to the development of this project. I will make mine today.

    Thanks

     

    Last edit: Anonymous 2014-01-28
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-01-28

    Hi,

    thanks a lot! And thanks for the donation!

    What do you mean with "TOPN for a url"? And what is ES?
    Maybe we can put this on the list of feature-requests for the next releases.

     
  • Anonymous

    Anonymous - 2014-01-28

    HI.. My pleasure... With Apache Nutch you can limit how deep to crawl TOP N is just how deep you want to crawl per domain. ES is ElasticSearch.

     

    Last edit: Anonymous 2014-01-28

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.