Menu

#13 Charset info in PHPCrawlDocumentInfo-class

open
nobody
None
5
2014-11-25
2012-12-07
Uwe Hunfeld
No

A user reported that it would be nice to have information about the charset of the page-content the crawler
delivers. (Like a new property PHPCrawlerDocumentInfo::content-charset or similar).

Discussion

  • SpiderBro

    SpiderBro - 2014-11-25

    This would be especially useful as many text processing scripts will fail with the wrong character set.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-11-25

    The problem here is that you simply can't rely on the charset-information the server sends with the header or the page states in meta-attributes.

    The best way would be to let the crawler automatically detect the documents charset from the actual content statistically by examine the content itself (like some good texteditors do), but i didn't come over a good routine so far written in php. Besides that would affect the overall performance a lot i guess, these routines are very CPU-intensive.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.