Hi, first of all i love PHPCrawl.
But (there is always a but) i've got a little problem. Im saving all my results into MongoDB and Mongo is just accepting utf-8 strings. Where is the best position to convert all strings to utf-8?
Best regards Julian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
first you'll have to find out (yourself) the encoding of the content of a page.
phpcrawl delivers the content "as it is", that means it doesn't convert anything.
The encoding may be find in the header (e.g. Content-Type:text/html; charset=ISOxyz)
or/and in the HTML as a meta tag, (e.g. <meta http-equiv="Content-Type" content="text/html; charset=ISOxyz" />). If not, you may try mb_detect_encoding(), but be careful with that, it's results are simply wrong sometimes.
Then you can use iconv() or mb_convert_encoding() to convert the content to UTF8 before dumping it into the MongoDB.
Thanks for the fast reply, you're right but where in the code is the pure downloaded html code and the header? I think this is the best point in the crawling process to detect and change the encoding.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Uwe that's where im doing it but i have to change also the links found and so on (because of special cahrs in linktext and so on.)
Thanks for Answers!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi, first of all i love PHPCrawl.
But (there is always a but) i've got a little problem. Im saving all my results into MongoDB and Mongo is just accepting utf-8 strings. Where is the best position to convert all strings to utf-8?
Best regards Julian
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi Julian,
first you'll have to find out (yourself) the encoding of the content of a page.
phpcrawl delivers the content "as it is", that means it doesn't convert anything.
The encoding may be find in the header (e.g. Content-Type:text/html; charset=ISOxyz)
or/and in the HTML as a meta tag, (e.g. <meta http-equiv="Content-Type" content="text/html; charset=ISOxyz" />). If not, you may try mb_detect_encoding(), but be careful with that, it's results are simply wrong sometimes.
Then you can use iconv() or mb_convert_encoding() to convert the content to UTF8 before dumping it into the MongoDB.
There's still a open feature-request for phpcrawl regarding this (charset/encoding-information of a page):
http://sourceforge.net/p/phpcrawl/feature-requests/13/
Last edit: Anonymous 2014-01-09
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks for the fast reply, you're right but where in the code is the pure downloaded html code and the header? I think this is the best point in the crawling process to detect and change the encoding.
Hi!
The best point is just inside you handleDocumentInfo()-method.
Everything you need will be there:
Content/HTML: $DocInfo->content
Header: $DocInfo->header
Meta-Tags: $DocInfo->meta_attributes
Just take a look at the docs:
http://phpcrawl.cuab.de/classreferences/PHPCrawlerDocumentInfo/overview.html
I recommend you NOT to modify the phpcrawl-sourcecode itself!
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi Uwe that's where im doing it but i have to change also the links found and so on (because of special cahrs in linktext and so on.)
Thanks for Answers!
The found links including linktext are also availabel in the handleDocumentInfo-method:
$DocInfo->links_found_url_descriptors
Can't you just convert everything just before you dump it into the DB?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
uykuyuyuuiuiiikk
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
ilulub iuytvrttyii8o88dtert4564enbvtrric6urdgdfy6u66t