Crawler randomly scambles or truncates binary content mid-file and yields bad total bytecount.
Effect: Damaged files (images)
Reason: Unknown
Frequency: Rare, but high on specific sites
Example:
http://www.trigon-film.org/en/movies/Theeb/photos/large/theeb_00.jpg
CURL: Downloads 7'694'713 bytes [ok]
IE: Downloads 7'694'713 bytes [ok]
phpcrawl:
Download 1: 7'698'648 bytes [bad]
Download 2: 7'699'896 bytes [bad]
Download 3: 114'670 bytes [bad]
Download 4: 7'701'649 bytes [bad]
http://www.trigon-film.org/en/movies/Theeb/photos/large/theb_04.jpg
CURL: Downloads 2'084'505 bytes [ok]
IE: Downloads 2'084'505 bytes [ok]
phpcrawl:
Download 1: 2'085'454 bytes [bad]
Download 2: 2'084'505 bytes [OK] <!!!!
Download 3: 2'084'726 bytes [bad]
Download 4: 909'150 bytes [bad]
Anonymous
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Bug applies to PHPCrawl 0.83
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
...bug has to do with Apache sometimes setting
Transfer-Encoding: chunked
see:
http://www.trigon-film.org/en/movies/Centaur/photos/large/Centaur_02.jpg
I was able to get around the problem by forcing the crawler to HTTP 1.0 by setting
$crawler->setHTTPProtocolVersion(PHPCrawlerHTTPProtocols::HTTP_1_0);