Menu

Extract .pdf, .doc, .docx, etc. attached files

Help
Anonymous
2013-12-05
2017-08-02
  • Anonymous

    Anonymous - 2013-12-05

    Hi everybody!!!!
    If a page has attached documents in format .pdf, doc, docx, etc, Is posible crawler this documents?
    In advance thanks.
    Jorge von Rudno

     
    • Anonymous

      Anonymous - 2020-11-13
      Post awaiting moderation.
    • Anonymous

      Anonymous - 2020-11-13
      Post awaiting moderation.
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-12-05

    Hi Jorge,

    yes, it's possible.

    By default the crawler receives every kind of document (as far as you didn't add rules for it).

    To specify what kind of douments should get recevied, use the addContentTypeReceiveRule() method for example (http://cuab.de/classreferences/index.html)

     
  • Anonymous

    Anonymous - 2013-12-09

    Hi Mr. Uwe,

    Thanks a lot for your answer. I have trying to implement your sugestion adding this two instructions:

    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addContentTypeReceiveRule("#text/pdf#")

    The behavior that I want is that when crawler find in the html a link that has a pdf file attached it will extract the content of the pdf file. but at the moment I only get the html content but not the pdf file.

    What am I doing wrong?, perhaps is a wrong syntax in the second instruction?

    Thanks alot for your support.

    Regards.

    Jorge von Rudno

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-12-09

    Hi Jorge again,

    the mime-type of pdf-documents isn't "text/pdf", it's "application/pdf".

    So set $crawler->addContentTypeReceiveRule("#application/pdf#") and it sould
    do the trick.

     
  • Anonymous

    Anonymous - 2013-12-09

    Hi Uwe,

    I am very grateful for your help. I just do the change and now I can crawler of pdf file attached, but now I have another problem and perhaps you can help me.
    When I crawler a website for example:
    http://www.geopro.com/fileadmin/geopro/downloads/BIBLIOGRAPHY_Makris_et_al_1970-2012.pdf.
    I get just something like this:
    %PDF-1.5
    %????
    1 0 obj
    <</Type>>>
    endobj
    2 0 obj
    <>
    endobj
    3 0 obj
    <</Type>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>
    endobj
    4 0 obj
    <>
    stream
    x??[s??~w???N?[&?? n)??$[v??7??r?HB#??H??????????Vb???????etr??77?j?~}r????j|;9?????'??/??f[?7???48?&8?~???  #\?|???cA?Q,??Q
    o_???[???o?????? ??/??E?$?0L&q??0 ????l'6Bj??uB?m?Y??9c?"L?!wH?????&N? ????6?N>???Vm????6?g\,XXdA??!K?Y??b?a??????_???G?/?E6??fSd'q%A??w?>cE?;???Cd,? ???(:g????K*?0??}T? ?? ?(Q?^}???}=?3{}???>?v???K??>2?>?e???@A?I>e?x? ?!?;????y?????y2??b??|? ???????~??A??-???#?? ??d????S@;?????Y?X6e?c?6g???M????M?D(2??Q??C?C2j???V?sM,JqÝN _[??9M??h|?Pm???>???_?'z?i??Y????????7|?dcU??U5g\???4??P?r???n,a?6? Q~???[?sUm??}??j?)/?#???:!?M??G?.y?t??J?P?i%??~A??%U?E>???|H?Lf????{%?q+f??rb?w ]
    R???)?0?4~6?e`?????{?Q??%?

    I guess that my problem is that I should open adobe to read the content of the file when I use crawler.

    Thousands thanks for your time.

    Kind regards.

    Jorge von Rudno

     
  • Anonymous

    Anonymous - 2013-12-10

    set the correct headers before outputtng any data
    header('Content-Type:application/pdf'); before you output the data.

     
  • Anonymous

    Anonymous - 2017-08-02

    Search for only PDF do not work.

     
  • Anonymous

    Anonymous - 2017-08-02

    What have you tried?

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.