Hi everybody!!!!
If a page has attached documents in format .pdf, doc, docx, etc, Is posible crawler this documents?
In advance thanks.
Jorge von Rudno
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2020-11-13
Post awaiting moderation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2020-11-13
Post awaiting moderation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
By default the crawler receives every kind of document (as far as you didn't add rules for it).
To specify what kind of douments should get recevied, use the addContentTypeReceiveRule() method for example (http://cuab.de/classreferences/index.html)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The behavior that I want is that when crawler find in the html a link that has a pdf file attached it will extract the content of the pdf file. but at the moment I only get the html content but not the pdf file.
What am I doing wrong?, perhaps is a wrong syntax in the second instruction?
Thanks alot for your support.
Regards.
Jorge von Rudno
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am very grateful for your help. I just do the change and now I can crawler of pdf file attached, but now I have another problem and perhaps you can help me.
When I crawler a website for example: http://www.geopro.com/fileadmin/geopro/downloads/BIBLIOGRAPHY_Makris_et_al_1970-2012.pdf.
I get just something like this:
%PDF-1.5
%????
1 0 obj
<</Type>>>
endobj
2 0 obj
<>
endobj
3 0 obj
<</Type>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<>
stream
x??[s??~w???N?[&??n)??$[v??7??r?HB#??H??????????Vb???????etr??77?j?~}r????j|;9?????'??/??f[?7???48?&8?~???#\?|???cA?Q,??Q
o_???[???o????????/??E?$?0L&q??0
????l'6Bj??uB?m?Y??9c?"L?!wH?????&N? ????6?N>???Vm????6?g\,XXdA??!K?Y??b?a??????_???G?/?E6??fSd'q%A??w?>cE?;???Cd,????(:g????K*?0??}T? ???(Q?^}???}=?3{}???>?v???K??>2?>?e???@A?I>e?x??!?;????y?????y2??b??|? ???????~??A??-???#?? ??d????S@;?????Y?X6e?c?6g???M????M?D(2??Q??C?C2j???V?sM,JqÝN_[??9M??h|?Pm???>???_?'z?i??Y????????7|?dcU??U5g\???4??P?r???n,a?6?Q~???[?sUm??}??j?)/?#???:!?M??G?.y?t??J?P?i%??~A??%U?E>???|H?Lf????{%?q+f??rb?w ]R???)?0?4~6?e`?????{?Q??%?
I guess that my problem is that I should open adobe to read the content of the file when I use crawler.
Thousands thanks for your time.
Kind regards.
Jorge von Rudno
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi everybody!!!!
If a page has attached documents in format .pdf, doc, docx, etc, Is posible crawler this documents?
In advance thanks.
Jorge von Rudno
Hi Jorge,
yes, it's possible.
By default the crawler receives every kind of document (as far as you didn't add rules for it).
To specify what kind of douments should get recevied, use the addContentTypeReceiveRule() method for example (http://cuab.de/classreferences/index.html)
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi Mr. Uwe,
Thanks a lot for your answer. I have trying to implement your sugestion adding this two instructions:
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addContentTypeReceiveRule("#text/pdf#")
The behavior that I want is that when crawler find in the html a link that has a pdf file attached it will extract the content of the pdf file. but at the moment I only get the html content but not the pdf file.
What am I doing wrong?, perhaps is a wrong syntax in the second instruction?
Thanks alot for your support.
Regards.
Jorge von Rudno
Hi Jorge again,
the mime-type of pdf-documents isn't "text/pdf", it's "application/pdf".
So set $crawler->addContentTypeReceiveRule("#application/pdf#") and it sould
do the trick.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi Uwe,
I am very grateful for your help. I just do the change and now I can crawler of pdf file attached, but now I have another problem and perhaps you can help me.
When I crawler a website for example:
http://www.geopro.com/fileadmin/geopro/downloads/BIBLIOGRAPHY_Makris_et_al_1970-2012.pdf.
I get just something like this:
%PDF-1.5
%????
1 0 obj
<</Type>>>
endobj
2 0 obj
<>
endobj
3 0 obj
<</Type>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<>
stream
x??[s??~w???N?[&??n)??$[v??7??r?HB#??H??????????Vb???????etr??77?j?~}r????j|;9?????'??/??f[?7???48?&8?~???#\?|???cA?Q,??Q
o_???[???o?
???????/??E?$?0L&q??0 ????l'6Bj??uB?m?Y??9c?"L?!wH?????&N? ????6?N>???
Vm????6?g\,XXdA??!K?Y??b?a??????_???G?/?E6??fSd'q%A??w?>cE?;???Cd,????(:g????K*?0??}T? ???(Q?^}???}=?3{}???>?v???K??>2?>?e???@A?I>e?x??!?;????y?????y2??b??|? ???????~??A??-???#?? ??d????S@;?????Y?X6e?c?6g???M????M?D(2??Q??C?C2j???V?sM,JqÝN_[??9M??h|?Pm???>???_?'z?i??Y????????7|?dcU??U5g\???4??P?r???n,a?6?Q~???[?sUm??}??j?)/?#???:!?M??G?.y?t??J?P?i%??~A??%U?E>???|H?Lf?
???{%?q+f??rb?w ]R???)?0?4~6?e`?????{?Q??%?I guess that my problem is that I should open adobe to read the content of the file when I use crawler.
Thousands thanks for your time.
Kind regards.
Jorge von Rudno
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
set the correct headers before outputtng any data
header('Content-Type:application/pdf'); before you output the data.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Search for only PDF do not work.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
What have you tried?