#171 Oss1.4.1beta3: is not able to crawl http://www.cytoscape.org/* completely

v1.4
open
nobody
None
1
2013-05-04
2013-02-01
melchiaros
No

On trying to crawl the Pattern http://www.cytoscape.org/*

(* for all content of the URL)

OSS1.4.1beta3 failed right after the homepage(1 side processed) and declares the crawl as finished.

I attach the oss.log

1 Attachments

Discussion

  • melchiaros

    melchiaros - 2013-02-01

    The log shows a

    13:58:58,073 root - com.jaeksoft.searchlib.SearchLibException: File not found
    com.jaeksoft.searchlib.web.ServletException: com.jaeksoft.searchlib.SearchLibException: File not found
    \tat com.jaeksoft.searchlib.web.ScreenshotServlet.doRequest(ScreenshotServlet.java:132)

    but that is from timestamp and content associated to the screenshot function I have used some minutes before(for this ticket it was deactivated again and should not affect this issue here).

     
  • melchiaros

    melchiaros - 2013-02-01

    Correlated to the time is more:

    14:03:45,076 root - RELOAD - Hourly - Fri Feb 01 13:00:00 CET 2013 - Count:3 - Average:8.666667 - Min:2 - Max:12
    14:05:11,598 root - The element type \"link\" must be terminated by the matching end-tag \"</link>\".
    14:05:12,186 root - Expected scheme-specific part at index 7: mailto:

     
  • melchiaros

    melchiaros - 2013-02-01

    I give some other tries.

    Sometimes the crawl is failing. Now I have one that do its job(until 44 pages now ->limit at 100 pages / host).

    I have observed this behaviour also with OSS1.3.1

     
  • melchiaros

    melchiaros - 2013-02-01

    The last crawl has failed at 42(not 44 as mentioned above).

    A look at the logs shows again:

    4:25:33,260 root - The element type \"link\" must be terminated by the matching end-tag \"</link>\".
    14:25:43,025 root - The element type \"link\" must be terminated by the matching end-tag \"</link>\".
    14:25:43,135 root - Illegal character in path at index 25: http://www.cytoscape.org/<?= $latest_download_link ?>
    14:25:43,135 root - Illegal character in path at index 25: http://www.cytoscape.org/<?= $latest_release_notes_link ?>

    The full .log will be attached.

     
  • melchiaros

    melchiaros - 2013-02-01

    So, it seems to be the missing end tag that makes OSS aborting.

     
  • melchiaros

    melchiaros - 2013-02-01

    A third check gives also:

    14:36:44,329 root - The element type \"link\" must be terminated by the matching end-tag \"</link>\".

     
  • Naveen A.N

    Naveen A.N - 2013-02-05

    Hello,

    Do you mean the crawler terminates with "Aborted" status ?

    I made a test crawl for the same url and 8 documents where indexed.

    All the other URL are images.

    The URl browser has 84 URL.

    --Naveen.A.N

     
  • melchiaros

    melchiaros - 2013-02-09

    No, the status "Aborted" is never shown.

    The crawl ends on a different number of documents(utilizing different crawlers: Indices -> Create a new index -> so on...).

    You can also retest this with one crawler:

    run once and inrement manually. You see that the number of documents will increase.

     
  • Naveen A.N

    Naveen A.N - 2013-02-15

    Hello,

    Do you mean that you started the crawler with option "RunOnce" and once it is stopped you started the crawler again with "RunOnce" option and the documents are increasing?

    --Naveen.A.N

     
  • melchiaros

    melchiaros - 2013-02-15

    Exactly.


    Additional:

    You can show this also with different crawlers.-> Different numbers of documents on the url given above.

     
  • Naveen A.N

    Naveen A.N - 2013-02-15

    Hello,

    Here is the explanation for why there is increasing of documents while each crawl session with option "RunOnce".

    The pattern is http://www.cytoscape.org/* which means the crawler will crawl all the URLs from the website.

    At first crawl session the crawler crawls the homepage url http://www.cytoscape.org/ and stops crawling because of the "RunOnce" option but the links extracted from the home page http://www.cytoscape.org will be added in the URL Browser you can check the URLs in Crawler/Web/URL browser.

    When the crawler is initiated for next crawl session the crawler crawls the URL's extracted by the first crawl session.

    That's the reason why you find the documents are increasing for each crawl session.

    --Naveen.A.N

     
  • melchiaros

    melchiaros - 2013-02-15

    Aha, the crawler works incrementally.

    That comes not clear to me. I have thought that it is working only ones(get all what is there) and that it is.

     
  • melchiaros

    melchiaros - 2013-02-15

    I have retested with two different crawlers.

    They both are working as you describe.

    My statement to different end points on every crawl was wrong.

    Sorry, I have had mixed it up in my mind.

    Thank you for clarification Naveen

    This ticket here could be closed.(->May be you want to gt something from this for the documentation.)

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks