Menu

#98 Binary data per site randomly incomplete or scrambled

open
nobody
None
5
2017-09-04
2016-02-23
Anonymous
No

Crawler randomly scambles or truncates binary content mid-file and yields bad total bytecount.

Effect: Damaged files (images)
Reason: Unknown
Frequency: Rare, but high on specific sites

Example:
http://www.trigon-film.org/en/movies/Theeb/photos/large/theeb_00.jpg
CURL: Downloads 7'694'713 bytes [ok]
IE: Downloads 7'694'713 bytes [ok]
phpcrawl:
Download 1: 7'698'648 bytes [bad]
Download 2: 7'699'896 bytes [bad]
Download 3: 114'670 bytes [bad]
Download 4: 7'701'649 bytes [bad]

http://www.trigon-film.org/en/movies/Theeb/photos/large/theb_04.jpg
CURL: Downloads 2'084'505 bytes [ok]
IE: Downloads 2'084'505 bytes [ok]
phpcrawl:
Download 1: 2'085'454 bytes [bad]
Download 2: 2'084'505 bytes [OK] <!!!!
Download 3: 2'084'726 bytes [bad]
Download 4: 909'150 bytes [bad]

Discussion

  • Anonymous

    Anonymous - 2016-02-23

    Bug applies to PHPCrawl 0.83

     
  • Anonymous

    Anonymous - 2017-09-04

    ...bug has to do with Apache sometimes setting
    Transfer-Encoding: chunked
    see:
    http://www.trigon-film.org/en/movies/Centaur/photos/large/Centaur_02.jpg

    I was able to get around the problem by forcing the crawler to HTTP 1.0 by setting
    $crawler->setHTTPProtocolVersion(PHPCrawlerHTTPProtocols::HTTP_1_0);

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.