PHPCrawl / Forum / Help: Crawlink WebDAV xml pages

Hi!

I faced a problem when tried to crawl a WebDAV generated xml pages. I'v got content like this:

<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="/svnindex.xsl"?> 
<!DOCTYPE svn [
  <!ELEMENT svn   (index)> 
  <!ATTLIST svn   version CDATA #REQUIRED
                  href    CDATA #REQUIRED> 
  <!ELEMENT index (updir?, (file | dir)*)> 
  <!ATTLIST index name    CDATA #IMPLIED
                  path    CDATA #IMPLIED
                  rev     CDATA #IMPLIED
                  base    CDATA #IMPLIED> 
  <!ELEMENT updir EMPTY> 
  <!ELEMENT file  EMPTY> 
  <!ATTLIST file  name    CDATA #REQUIRED
                  href    CDATA #REQUIRED> 
  <!ELEMENT dir   EMPTY> 
  <!ATTLIST dir   name    CDATA #REQUIRED
                  href    CDATA #REQUIRED> 
]>
<svn version="1.5.5 (r34862)"
     href="http://subversion.tigris.org/"> 
  <index rev="52098" path="/" base="storage"> 
    <dir name="Storage" href="Storage/" /> 
    <dir name="Projects" href="Projects/" /> 
    <dir name="Users" href="Users/" /> 
  </index> 
</svn>

And crawler doesn't find any links in this page. I tried to add "dir" tag , but it didn't change anything

$crawler->addLinkExtractionTags("dir", "href");

Am I do smth wrong or phpcrawler doesn't support xml pages?

Thanks in advance,
Artyom

Anonymous

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2011-05-15

Hi!

Your are right, by default, phpcrawl doen's search for links in XML-documents, it
only checks documents of the type "text/html" for links.

The contentype of XML-docs usually is "text/xml" (as far as i know).

A little mod to the soucecode should do the trick.
Just try changing the lines 335 and 376 in phpcrawlerpagerequest.class.php from

if (preg_match("/text\/html/ i", $actual_content_type))

to

if (preg_match("/text\// i", $actual_content_type))

Hope it works!

Best regards!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2011-05-19

huni,
thank you very much for the advide, it helped!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Crawlink WebDAV xml pages

Forums

Help

Crawlink WebDAV xml pages document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Crawlink WebDAV xml pages