PHPCrawl / Forum / Help: How to crawl suburl and query result ?

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"
Anonymous - 2013-12-31

Well, actually i got confused with follow problem:

i want to crawl sub url like : www.tribunnews.com/pemilu-2014/ and only crawl url which has "pemilu-2014".

well, to tell you all the site above is news portal. I need to get some news article plus comment from the reader. How can i crawl the article and the comment of reader ?

Actually, i want to convert the result of crawled page from html to text. I need it to implement my study on classify words based on naive bayes classification.

thank's before guys.

Well, actually i got confused with follow problem: 1. i want to crawl sub url like : www.tribunnews.com/pemilu-2014/ and only crawl url which has "pemilu-2014". 2. well, to tell you all the site above is news portal. I need to get some news article plus comment from the reader. How can i crawl the article and the comment of reader ? 3. Actually, i want to convert the result of crawled page from html to text. I need it to implement my study on classify words based on naive bayes classification. thank's before guys.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-01-02

Hi!

For your first question:
Simply set the follow-mode of the crawler to 3 (see the docs, http://phpcrawl.cuab.de/classreferences/index.html), $crawler->setFollowMode(3).

You also can just set a simple follow rule like $crawler->addURLFollowRule("#pemilu-2014#"), this affects the same as above.

For you other two questions: I don't understand what you a trying to do, could you explain your problem(s) a littel more detailed?

Hi! For your first question: Simply set the follow-mode of the crawler to 3 (see the docs, http://phpcrawl.cuab.de/classreferences/index.html), $crawler->setFollowMode(3). You also can just set a simple follow rule like $crawler->addURLFollowRule("#pemilu-2014#"), this affects the same as above. For you other two questions: I don't understand what you a trying to do, could you explain your problem(s) a littel more detailed?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- Sysfotech - 2014-01-02
  
  Hello again, sorry for not replying soon ! BTW, it's me :D. I forgot to login.
  
  well, actually could PHPCrawl get specific content from html tag ? like when there a tag
  
  div id=comment | Hello | /div, i want to get "Hello". #2
  
  for question #3, could PHPCrawl convert HTML page into txt ?
  
  and i got another question, how can i filter code of page in PHPCrawl ? i just want to have page with code(200) not page with code(302) or code(404).
  
  thank's before
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Sysfotech - 2014-01-02

Last edit: Sysfotech 2014-01-02

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-01-04

Hey,

for your last question:
just check if the status code of the received document is 200 inside your handleDocumentInfo-method. ($DocInfo->http_status_code)

To get specific content/tags/attributes/text from html-documents, you may just use some regular expressions or a DOM-parser (like the php buildin DOMDocument).

phpcrawl itself doesn't contain methods to extract or manipulate data from received documents.

Hey, for your last question: just check if the status code of the received document is 200 inside your handleDocumentInfo-method. ($DocInfo->http_status_code) To get specific content/tags/attributes/text from html-documents, you may just use some regular expressions or a DOM-parser (like the php buildin DOMDocument). phpcrawl itself doesn't contain methods to extract or manipulate data from received documents.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- Sysfotech - 2014-01-05
  
  okay. thank's for the help , buddy. :D
  
  Last edit: Sysfotech 2014-01-05
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

How to crawl suburl and query result ?

Forums

Help

How to crawl suburl and query result ? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

How to crawl suburl and query result ?