Hi!
I faced a problem when tried to crawl a WebDAV generated xml pages. I'v got content like this:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="/svnindex.xsl"?> <!DOCTYPE svn [ <!ELEMENT svn (index)> <!ATTLIST svn version CDATA #REQUIRED href CDATA #REQUIRED> <!ELEMENT index (updir?, (file | dir)*)> <!ATTLIST index name CDATA #IMPLIED path CDATA #IMPLIED rev CDATA #IMPLIED base CDATA #IMPLIED> <!ELEMENT updir EMPTY> <!ELEMENT file EMPTY> <!ATTLIST file name CDATA #REQUIRED href CDATA #REQUIRED> <!ELEMENT dir EMPTY> <!ATTLIST dir name CDATA #REQUIRED href CDATA #REQUIRED> ]> <svn version="1.5.5 (r34862)" href="http://subversion.tigris.org/"> <index rev="52098" path="/" base="storage"> <dir name="Storage" href="Storage/" /> <dir name="Projects" href="Projects/" /> <dir name="Users" href="Users/" /> </index> </svn>
And crawler doesn't find any links in this page. I tried to add "dir" tag , but it didn't change anything
$crawler->addLinkExtractionTags("dir", "href");
Am I do smth wrong or phpcrawler doesn't support xml pages?
Thanks in advance, Artyom
Anonymous
You seem to have CSS turned off. Please don't fill out this field.
Your are right, by default, phpcrawl doen's search for links in XML-documents, it only checks documents of the type "text/html" for links.
The contentype of XML-docs usually is "text/xml" (as far as i know).
A little mod to the soucecode should do the trick. Just try changing the lines 335 and 376 in phpcrawlerpagerequest.class.php from
if (preg_match("/text\/html/ i", $actual_content_type))
to
if (preg_match("/text\// i", $actual_content_type))
Hope it works!
Best regards!
huni, thank you very much for the advide, it helped!
Hi!
I faced a problem when tried to crawl a WebDAV generated xml pages. I'v got content like this:
And crawler doesn't find any links in this page. I tried to add "dir" tag , but it didn't change anything
Am I do smth wrong or phpcrawler doesn't support xml pages?
Thanks in advance,
Artyom
Hi!
Your are right, by default, phpcrawl doen's search for links in XML-documents, it
only checks documents of the type "text/html" for links.
The contentype of XML-docs usually is "text/xml" (as far as i know).
A little mod to the soucecode should do the trick.
Just try changing the lines 335 and 376 in phpcrawlerpagerequest.class.php from
to
Hope it works!
Best regards!
huni,
thank you very much for the advide, it helped!