#13 Charset info in PHPCrawlDocumentInfo-class

open
nobody
None
5
2014-11-25
2012-12-07
Uwe Hunfeld
No

A user reported that it would be nice to have information about the charset of the page-content the crawler
delivers. (Like a new property PHPCrawlerDocumentInfo::content-charset or similar).

Discussion

  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2014-01-09

    subscribe

     
  • SpiderBro

    SpiderBro - 2014-11-25

    This would be especially useful as many text processing scripts will fail with the wrong character set.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-11-25

    The problem here is that you simply can't rely on the charset-information the server sends with the header or the page states in meta-attributes.

    The best way would be to let the crawler automatically detect the documents charset from the actual content statistically by examine the content itself (like some good texteditors do), but i didn't come over a good routine so far written in php. Besides that would affect the overall performance a lot i guess, these routines are very CPU-intensive.

     


Anonymous

Cancel  Add attachments





Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks