Menu

UTF-8 encode all results

Help
Anonymous
2014-01-09
2014-08-21
  • Anonymous

    Anonymous - 2014-01-09

    Hi, first of all i love PHPCrawl.
    But (there is always a but) i've got a little problem. Im saving all my results into MongoDB and Mongo is just accepting utf-8 strings. Where is the best position to convert all strings to utf-8?

    Best regards Julian

     
  • Anonymous

    Anonymous - 2014-01-09

    Hi Julian,

    first you'll have to find out (yourself) the encoding of the content of a page.
    phpcrawl delivers the content "as it is", that means it doesn't convert anything.

    The encoding may be find in the header (e.g. Content-Type:text/html; charset=ISOxyz)
    or/and in the HTML as a meta tag, (e.g. <meta http-equiv="Content-Type" content="text/html; charset=ISOxyz" />). If not, you may try mb_detect_encoding(), but be careful with that, it's results are simply wrong sometimes.

    Then you can use iconv() or mb_convert_encoding() to convert the content to UTF8 before dumping it into the MongoDB.

    There's still a open feature-request for phpcrawl regarding this (charset/encoding-information of a page):
    http://sourceforge.net/p/phpcrawl/feature-requests/13/

     

    Last edit: Anonymous 2014-01-09
  • Anonymous

    Anonymous - 2014-01-09

    Thanks for the fast reply, you're right but where in the code is the pure downloaded html code and the header? I think this is the best point in the crawling process to detect and change the encoding.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-01-09

    Hi!

    The best point is just inside you handleDocumentInfo()-method.
    Everything you need will be there:

    Content/HTML: $DocInfo->content
    Header: $DocInfo->header
    Meta-Tags: $DocInfo->meta_attributes

    Just take a look at the docs:
    http://phpcrawl.cuab.de/classreferences/PHPCrawlerDocumentInfo/overview.html

    I recommend you NOT to modify the phpcrawl-sourcecode itself!

     
  • Anonymous

    Anonymous - 2014-01-10

    Hi Uwe that's where im doing it but i have to change also the links found and so on (because of special cahrs in linktext and so on.)
    Thanks for Answers!

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-01-10

    The found links including linktext are also availabel in the handleDocumentInfo-method:
    $DocInfo->links_found_url_descriptors

    Can't you just convert everything just before you dump it into the DB?

     
  • Anonymous

    Anonymous - 2014-08-21

    uykuyuyuuiuiiikk

     
  • Anonymous

    Anonymous - 2014-08-21

    ilulub iuytvrttyii8o88dtert4564enbvtrric6urdgdfy6u66t

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.