Menu

How to crawl suburl and query result ?

Help
Anonymous
2013-12-31
2014-01-05
  • Anonymous

    Anonymous - 2013-12-31

    Well, actually i got confused with follow problem:

    1. i want to crawl sub url like : www.tribunnews.com/pemilu-2014/ and only crawl url which has "pemilu-2014".

    2. well, to tell you all the site above is news portal. I need to get some news article plus comment from the reader. How can i crawl the article and the comment of reader ?

    3. Actually, i want to convert the result of crawled page from html to text. I need it to implement my study on classify words based on naive bayes classification.

    thank's before guys.

     
  • Anonymous

    Anonymous - 2014-01-02

    Hi!

    For your first question:
    Simply set the follow-mode of the crawler to 3 (see the docs, http://phpcrawl.cuab.de/classreferences/index.html), $crawler->setFollowMode(3).

    You also can just set a simple follow rule like $crawler->addURLFollowRule("#pemilu-2014#"), this affects the same as above.

    For you other two questions: I don't understand what you a trying to do, could you explain your problem(s) a littel more detailed?

     
    • Sysfotech

      Sysfotech - 2014-01-02

      Hello again, sorry for not replying soon ! BTW, it's me :D. I forgot to login.

      well, actually could PHPCrawl get specific content from html tag ? like when there a tag

      div id=comment | Hello | /div, i want to get "Hello". #2

      for question #3, could PHPCrawl convert HTML page into txt ?

      and i got another question, how can i filter code of page in PHPCrawl ? i just want to have page with code(200) not page with code(302) or code(404).

      thank's before

       
  • Sysfotech

    Sysfotech - 2014-01-02
     

    Last edit: Sysfotech 2014-01-02
  • Anonymous

    Anonymous - 2014-01-04

    Hey,

    for your last question:
    just check if the status code of the received document is 200 inside your handleDocumentInfo-method. ($DocInfo->http_status_code)

    To get specific content/tags/attributes/text from html-documents, you may just use some regular expressions or a DOM-parser (like the php buildin DOMDocument).

    phpcrawl itself doesn't contain methods to extract or manipulate data from received documents.

     
    • Sysfotech

      Sysfotech - 2014-01-05

      okay. thank's for the help , buddy. :D

       

      Last edit: Sysfotech 2014-01-05

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.