A user reported that it would be nice to have information about the charset of the page-content the crawler
delivers. (Like a new property PHPCrawlerDocumentInfo::content-charset or similar).
The problem here is that you simply can't rely on the charset-information the server sends with the header or the page states in meta-attributes.
The best way would be to let the crawler automatically detect the documents charset from the actual content statistically by examine the content itself (like some good texteditors do), but i didn't come over a good routine so far written in php. Besides that would affect the overall performance a lot i guess, these routines are very CPU-intensive.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "feature-requests Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Feature Requests"
subscribe
This would be especially useful as many text processing scripts will fail with the wrong character set.
The problem here is that you simply can't rely on the charset-information the server sends with the header or the page states in meta-attributes.
The best way would be to let the crawler automatically detect the documents charset from the actual content statistically by examine the content itself (like some good texteditors do), but i didn't come over a good routine so far written in php. Besides that would affect the overall performance a lot i guess, these routines are very CPU-intensive.