PHPCrawl / Forum / Help: Extract .pdf, .doc, .docx, etc. attached files

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-05

Hi everybody!!!!
If a page has attached documents in format .pdf, doc, docx, etc, Is posible crawler this documents?
In advance thanks.
Jorge von Rudno

Hi everybody!!!! If a page has attached documents in format .pdf, doc, docx, etc, Is posible crawler this documents? In advance thanks. Jorge von Rudno

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- Anonymous - 2020-11-13
  
  Post awaiting moderation.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.
- Anonymous - 2020-11-13
  
  Post awaiting moderation.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-12-05

Hi Jorge,

yes, it's possible.

By default the crawler receives every kind of document (as far as you didn't add rules for it).

To specify what kind of douments should get recevied, use the addContentTypeReceiveRule() method for example (http://cuab.de/classreferences/index.html)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-09

Hi Mr. Uwe,

Thanks a lot for your answer. I have trying to implement your sugestion adding this two instructions:

$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addContentTypeReceiveRule("#text/pdf#")

The behavior that I want is that when crawler find in the html a link that has a pdf file attached it will extract the content of the pdf file. but at the moment I only get the html content but not the pdf file.

What am I doing wrong?, perhaps is a wrong syntax in the second instruction?

Thanks alot for your support.

Regards.

Jorge von Rudno

Hi Mr. Uwe, Thanks a lot for your answer. I have trying to implement your sugestion adding this two instructions: $crawler->addContentTypeReceiveRule("#text/html#"); $crawler->addContentTypeReceiveRule("#text/pdf#") The behavior that I want is that when crawler find in the html a link that has a pdf file attached it will extract the content of the pdf file. but at the moment I only get the html content but not the pdf file. What am I doing wrong?, perhaps is a wrong syntax in the second instruction? Thanks alot for your support. Regards. Jorge von Rudno

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-12-09

Hi Jorge again,

the mime-type of pdf-documents isn't "text/pdf", it's "application/pdf".

So set $crawler->addContentTypeReceiveRule("#application/pdf#") and it sould
do the trick.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-09

Hi Uwe,

I am very grateful for your help. I just do the change and now I can crawler of pdf file attached, but now I have another problem and perhaps you can help me.
When I crawler a website for example:
http://www.geopro.com/fileadmin/geopro/downloads/BIBLIOGRAPHY_Makris_et_al_1970-2012.pdf.
I get just something like this:
%PDF-1.5
%????
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(de-DE) /StructTreeRoot 60 0 R/MarkInfo<>>>
endobj
2 0 obj
<>
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<>
stream
x??[s??~w???N?[&??n)??$[v??7??r?HB#??H??????????Vb???????etr??77?j?~}r????j|;9?????'??/??f[?7???48?&8?~???#\?|???cA?Q,??Q
o_???[???o????????/??E?$?0L&q??0 ????l'6Bj??uB?m?Y??9c?"L?!wH?????&N? ????6?N>???Vm????6?g\,XXdA??!K?Y??b?a??????_???G?/?E6??fSd'q%A??w?>cE?;???Cd,????(:g????K*?0??}T? ???(Q?^}???}=?3{}???>?v???K??>2?>?e???@A?I>e?x??!?;????y?????y2??b??|? ???????~??A??-???#?? ??d????S@;?????Y?X6e?c?6g???M????M?D(2??Q??C?C2j???V?sM,JqÝN_[??9M??h|?Pm???>???_?'z?i??Y????????7|?dcU??U5g\???4??P?r???n,a?6?Q~???[?sUm??}??j?)/?#???:!?M??G?.y?t??J?P?i%??~A??%U?E>???|H?Lf????{%?q+f??rb?w ]R???)?0?4~6?e`?????{?Q??%?

I guess that my problem is that I should open adobe to read the content of the file when I use crawler.

Thousands thanks for your time.

Kind regards.

Jorge von Rudno

Hi Uwe, I am very grateful for your help. I just do the change and now I can crawler of pdf file attached, but now I have another problem and perhaps you can help me. When I crawler a website for example: http://www.geopro.com/fileadmin/geopro/downloads/BIBLIOGRAPHY_Makris_et_al_1970-2012.pdf. I get just something like this: %PDF-1.5 %???? 1 0 obj <</Type/Catalog/Pages 2 0 R/Lang(de-DE) /StructTreeRoot 60 0 R/MarkInfo<>>> endobj 2 0 obj <> endobj 3 0 obj <</Type/Page/Parent 2 0 R/Resources<</Font<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>> endobj 4 0 obj <> stream x??\[s??~w???N?[&??n)??$[v??7??r?HB#??H??????????Vb???????etr??77?j?~}r????j|;9?????'??/??f[?7???48?&8?~???#\?|???cA?Q,??Q o_???[???o?`???????/??E?$?0L&q??0 ????l'6Bj??uB?m?Y??9c?"L?!wH?????&N? ????6?N>???`Vm????`6?g\,XXdA??!K?Y??b?a??????_???G?/?E6??fSd'q%A??w?>cE?;???Cd,????(:g????K*?0??}T? ???(Q?^}???}=?3{}???>?v???K??>2?>?e???@A?I>e?x??!?;????y?????y2??b??|? ???????~??A??-???#?? ??d????S@;?????Y?X6e?c?6g???M????M?D(2??Q??C?C2j???V?sM,JqÝN_[??9M??h|?Pm???>???_?'z?i??Y????????7|?dcU??U5g\???4??P?r???n,a?6?Q~???[?sUm??}??j?)/?#???:!?M??G?.y?t??J?P?i%??~A??%U?E>???|H?Lf?`???{%?q+f??rb?w ]R???)?0?4~6?e`?????{?Q??%? I guess that my problem is that I should open adobe to read the content of the file when I use crawler. Thousands thanks for your time. Kind regards. Jorge von Rudno

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-10

set the correct headers before outputtng any data
header('Content-Type:application/pdf'); before you output the data.

set the correct headers before outputtng any data header('Content-Type:application/pdf'); before you output the data.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2017-08-02

Search for only PDF do not work.

Search for only PDF do not work.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2017-08-02

What have you tried?

What have you tried?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Extract .pdf, .doc, .docx, etc. attached files

Forums

Help

Extract .pdf, .doc, .docx, etc. attached files document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Extract .pdf, .doc, .docx, etc. attached files