Menu

#98 Binary data per site randomly incomplete or scrambled

open
nobody
None
5
2017-09-04
2016-02-23
Anonymous
No

Crawler randomly scambles or truncates binary content mid-file and yields bad total bytecount.

Effect: Damaged files (images)
Reason: Unknown
Frequency: Rare, but high on specific sites

Example:
http://www.trigon-film.org/en/movies/Theeb/photos/large/theeb_00.jpg
CURL: Downloads 7'694'713 bytes [ok]
IE: Downloads 7'694'713 bytes [ok]
phpcrawl:
Download 1: 7'698'648 bytes [bad]
Download 2: 7'699'896 bytes [bad]
Download 3: 114'670 bytes [bad]
Download 4: 7'701'649 bytes [bad]

http://www.trigon-film.org/en/movies/Theeb/photos/large/theb_04.jpg
CURL: Downloads 2'084'505 bytes [ok]
IE: Downloads 2'084'505 bytes [ok]
phpcrawl:
Download 1: 2'085'454 bytes [bad]
Download 2: 2'084'505 bytes [OK] <!!!!
Download 3: 2'084'726 bytes [bad]
Download 4: 909'150 bytes [bad]

Discussion

  • Anonymous

    Anonymous - 2016-02-23

    Bug applies to PHPCrawl 0.83

     
  • Anonymous

    Anonymous - 2017-09-04

    ...bug has to do with Apache sometimes setting
    Transfer-Encoding: chunked
    see:
    http://www.trigon-film.org/en/movies/Centaur/photos/large/Centaur_02.jpg

    I was able to get around the problem by forcing the crawler to HTTP 1.0 by setting
    $crawler->setHTTPProtocolVersion(PHPCrawlerHTTPProtocols::HTTP_1_0);

     

Anonymous
Anonymous

Add attachments
Cancel