Charset info in PHPCrawlDocumentInfo-class

Status: Beta

Brought to you by: huni

#13 Charset info in PHPCrawlDocumentInfo-class

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2014-11-25

Created: 2012-12-07

Creator: Uwe Hunfeld

Private: No

A user reported that it would be nice to have information about the charset of the page-content the crawler
delivers. (Like a new property PHPCrawlerDocumentInfo::content-charset or similar).

Discussion

Comment has been marked as spam.
Undo

View and moderate all "feature-requests Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Feature Requests"

Anonymous - 2014-01-09

subscribe

subscribe

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

SpiderBro - 2014-11-25

This would be especially useful as many text processing scripts will fail with the wrong character set.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2014-11-25

The problem here is that you simply can't rely on the charset-information the server sends with the header or the page states in meta-attributes.

The best way would be to let the crawler automatically detect the documents charset from the actual content statistically by examine the content itself (like some good texteditors do), but i didn't come over a good routine so far written in php. Besides that would affect the overall performance a lot i guess, these routines are very CPU-intensive.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous