autodiscover domain to crawl in the same index

Help
a_l_e
2014-03-18
2014-04-03
  • a_l_e

    a_l_e - 2014-03-18

    Hi all,
    I would like to know how I could implement a sort of "autodiscover" function.
    I have to crawl a series of "selflinked" domains. For this task, I would like to insert only one domain and let opensearchserver, recursively, could add every domain discovered into pages and crawl it later in the same index.

    Thank you for your support!

     
  • Alexandre Toyer

    Alexandre Toyer - 2014-03-21

    Hi a_l_e,

    You can achieve this by unchecking the "Enabled" checkbox in tab "Pattern list" in the Web Crawler. But take care! This could result in crawling the whole web...

    If you already know what can be the domains you can rather leave "Enabled" checkbox checked and add every domain (with a final "/*") in the Patterns list.

    Regards,
    Alexandre

     
    Last edit: Alexandre Toyer 2014-03-21
  • a_l_e

    a_l_e - 2014-03-21

    Hi, thank you for your reply.

    I unchecked the "enabled" checkbox in tab "Pattern list" and all it seemed to work!
    BUT some of our web developers sometimes insert links to external domains and the crawler started crawling the web...

    I do not know a priori the list of domains but I know they have to match a format (ie. mysite * .tld). I tried to use this info in "pattern list", something like "http * ://mysite * .tld" (spaces are added to avoid this forum escapes characters) but this don't seem to work.

    Suggestions?

    Regards,
    Ale

     
    Last edit: a_l_e 2014-03-21
  • Alexandre Toyer

    Alexandre Toyer - 2014-03-24

    Hello Ale,

    Yes this feature would be interesting. We logged this request some months ago: https://github.com/jaeksoft/opensearchserver/issues/14
    Unfortunately this has not been implemented yet, we'll try to do it quickly.

    Regards,
    Alexandre

     
  • a_l_e

    a_l_e - 2014-03-29

    Hi Alexandre,
    I think this feature will be really useful. Waiting for an implementation, I would like to share a dirty workaround I'm using. I configured opensearchserver to use a local proxy and implemented, via squid, redirection to a local blank page if domain name don't match needed criteria.

    I hope hearing from you soon.
    Thanks

     
  • a_l_e

    a_l_e - 2014-03-31

    Hi,
    I installed 1.5.3 version trying to crawling all ".local" tld domains only (on my project all domains are without hostname and have .local tld. Example: first.local, second.local...)

    In this scenario, I'm using this pattern list:
    http:// *.local/ *
    https:// *.local/ *
    http://first.local/index.html

    I use "first.local" to allow crawler could start but it crawls only the the index.html page and don't follow links inside it.

    opensearchserver logs:
    Unable to extract URL from http:// *.local/ *

    Suggestions?
    Thanks

     
    Last edit: a_l_e 2014-03-31
    • Emmanuel Keller

      Emmanuel Keller - 2014-04-01

      You can safely ignore this warning:
      Unable to extract URL from http:// *.local/ *

      I saw that there is spaces in your pattern. May be it is the issue ?
      http://[ ]*.local/[ ]*

      About first.local, it is exactly what you have to do. As OpenSearchServer was not able to generate an URL from the wildcard pattern (that cause the warning), it needs an entry point to start the crawl.

       
      Last edit: Emmanuel Keller 2014-04-01
      • a_l_e

        a_l_e - 2014-04-01

        Hi,
        Unfortunately the spaces in my pattern are a problem in this post, writing * and / characters.

        My "pattern list", enabled, is:
        http://*.local/*
        http://first.local/index2.html

        index2.html contains links to contents in other *.local domains

        Only index2.html is crawled.
        http://first.local/index2.html is the only permalink in "url browser" tab.
        first.local is the only hostname in "hostnames" tab.

        PS: I notice that with "http://*.local/*" in pattern list the url "http://first.local/index2.html is NOT crawled also with "manual crawl".
        Adding "http://first.local/*" in pattern list, "http://first.local/index2.html" are crawled with "manual crawl" as expected.

         
        Last edit: a_l_e 2014-04-01
        • Emmanuel Keller

          Emmanuel Keller - 2014-04-02

          Ok, we found a bug. We regroup patterns by top domains, to efficiently handle large lists of patterns. this optimization did not correctly handle the wildcards on the host part of the URL. It is fixed now.

          You can test the last build here:
          http://www.open-search-server.com/ftp/OpenSearchServer_1.5/build-1.5-b557/

           
          • a_l_e

            a_l_e - 2014-04-02

            Perfect! Now it's working as expected.
            Thank you so much for your support.

             
  • Andrew Fordred

    Andrew Fordred - 2014-04-01

    Hello everyone

    I also tried this in 1.5.3 and it did not work e.g. http://.somewebsite.com/

    the star does not show up after the /

    Should enabled be checked or not in the pattern list?

    Thanks

     
    Last edit: Andrew Fordred 2014-04-01
  • Emmanuel Keller

    Emmanuel Keller - 2014-04-01

    Andrew, you should use these patterns:

    http://*.somewebsite.com/*
    http://www.somewebsite.com/
    

    The first line describes which pattern are allowed.
    The second line is the crawl starting point.

     
    Last edit: Emmanuel Keller 2014-04-02
  • Andrew Fordred

    Andrew Fordred - 2014-04-03

    Will this assist if for instance a site has a database? It is an open database i.e. does not require a log-in but has a search field, will OSS now crawl the database using the autodiscover?

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks