Well, actually i got confused with follow problem:
i want to crawl sub url like : www.tribunnews.com/pemilu-2014/ and only crawl url which has "pemilu-2014".
well, to tell you all the site above is news portal. I need to get some news article plus comment from the reader. How can i crawl the article and the comment of reader ?
Actually, i want to convert the result of crawled page from html to text. I need it to implement my study on classify words based on naive bayes classification.
thank's before guys.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
for your last question:
just check if the status code of the received document is 200 inside your handleDocumentInfo-method. ($DocInfo->http_status_code)
To get specific content/tags/attributes/text from html-documents, you may just use some regular expressions or a DOM-parser (like the php buildin DOMDocument).
phpcrawl itself doesn't contain methods to extract or manipulate data from received documents.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Well, actually i got confused with follow problem:
i want to crawl sub url like : www.tribunnews.com/pemilu-2014/ and only crawl url which has "pemilu-2014".
well, to tell you all the site above is news portal. I need to get some news article plus comment from the reader. How can i crawl the article and the comment of reader ?
Actually, i want to convert the result of crawled page from html to text. I need it to implement my study on classify words based on naive bayes classification.
thank's before guys.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
For your first question:
Simply set the follow-mode of the crawler to 3 (see the docs, http://phpcrawl.cuab.de/classreferences/index.html), $crawler->setFollowMode(3).
You also can just set a simple follow rule like $crawler->addURLFollowRule("#pemilu-2014#"), this affects the same as above.
For you other two questions: I don't understand what you a trying to do, could you explain your problem(s) a littel more detailed?
Hello again, sorry for not replying soon ! BTW, it's me :D. I forgot to login.
well, actually could PHPCrawl get specific content from html tag ? like when there a tag
div id=comment | Hello | /div, i want to get "Hello". #2
for question #3, could PHPCrawl convert HTML page into txt ?
and i got another question, how can i filter code of page in PHPCrawl ? i just want to have page with code(200) not page with code(302) or code(404).
thank's before
Last edit: Sysfotech 2014-01-02
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hey,
for your last question:
just check if the status code of the received document is 200 inside your handleDocumentInfo-method. ($DocInfo->http_status_code)
To get specific content/tags/attributes/text from html-documents, you may just use some regular expressions or a DOM-parser (like the php buildin DOMDocument).
phpcrawl itself doesn't contain methods to extract or manipulate data from received documents.
okay. thank's for the help , buddy. :D
Last edit: Sysfotech 2014-01-05