A user reported that it would be nice to have information about the charset of the page-content the crawler
delivers. (Like a new property PHPCrawlerDocumentInfo::content-charset or similar).
You seem to have CSS turned off.
Please don't fill out this field.
This would be especially useful as many text processing scripts will fail with the wrong character set.
The problem here is that you simply can't rely on the charset-information the server sends with the header or the page states in meta-attributes.
The best way would be to let the crawler automatically detect the documents charset from the actual content statistically by examine the content itself (like some good texteditors do), but i didn't come over a good routine so far written in php. Besides that would affect the overall performance a lot i guess, these routines are very CPU-intensive.
Sign up for the SourceForge newsletter: