Menu

Crawlink WebDAV xml pages

Help
2011-05-13
2013-04-09
  • Nobody/Anonymous

    Hi!

    I faced a problem when tried to crawl a WebDAV generated xml pages. I'v got content like this:

    <?xml version="1.0"?> 
    <?xml-stylesheet type="text/xsl" href="/svnindex.xsl"?> 
    <!DOCTYPE svn [
      <!ELEMENT svn   (index)> 
      <!ATTLIST svn   version CDATA #REQUIRED
                      href    CDATA #REQUIRED> 
      <!ELEMENT index (updir?, (file | dir)*)> 
      <!ATTLIST index name    CDATA #IMPLIED
                      path    CDATA #IMPLIED
                      rev     CDATA #IMPLIED
                      base    CDATA #IMPLIED> 
      <!ELEMENT updir EMPTY> 
      <!ELEMENT file  EMPTY> 
      <!ATTLIST file  name    CDATA #REQUIRED
                      href    CDATA #REQUIRED> 
      <!ELEMENT dir   EMPTY> 
      <!ATTLIST dir   name    CDATA #REQUIRED
                      href    CDATA #REQUIRED> 
    ]>
    <svn version="1.5.5 (r34862)"
         href="http://subversion.tigris.org/"> 
      <index rev="52098" path="/" base="storage"> 
        <dir name="Storage" href="Storage/" /> 
        <dir name="Projects" href="Projects/" /> 
        <dir name="Users" href="Users/" /> 
      </index> 
    </svn>
    

    And crawler doesn't find any links in this page. I tried to add "dir" tag , but it didn't change anything

    $crawler->addLinkExtractionTags("dir", "href");
    

    Am I do smth wrong or phpcrawler doesn't support xml pages?

    Thanks in advance,
    Artyom

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2011-05-15

    Hi!

    Your are right, by default, phpcrawl doen's search for links in XML-documents, it
    only checks documents of the type "text/html" for links.

    The contentype of XML-docs usually is "text/xml" (as far as i know).

    A little mod to the soucecode should do the trick.
    Just try changing the lines 335 and 376 in phpcrawlerpagerequest.class.php from

    if (preg_match("/text\/html/ i", $actual_content_type))
    

    to

    if (preg_match("/text\// i", $actual_content_type))
    

    Hope it works!

    Best regards!

     
  • Nobody/Anonymous

    huni,
    thank you very much for the advide, it helped!

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.