A Crawler sometimes decides to skip a certain document,
e.g. because it matches an exclude pattern or because
its file size is above a certain threshold.
Currently, such resources are completely ignored. This
may cause confusion among users: on several occasions
some of our customers complained about not being able
to find a document. It took some time to discover they
were skipped for such a reason.
Adding a method to CrawlerHandler that informs it that
a certain URL got skipped enables application
developers to e.g. log this information somewhere where
users can
find out about it.
Besides the skipped URL, some sort of SkipCode (similar
to ExitCode) could be specified, indicating why the
resource was skipped (outside domain, too large, ...).
So the method signature would be:
skipped(String url, SkipCode skipCode);
There are limits to which URLs a Crawler can reasonably
report as skipped. For example, it is unreasonable to
expect it to report all files that are too deeply
nested w.r.t. the maximum crawl depth, as this would
involve crawling beyond a certain depth only to be able
to report all those files as skipped. However, some
reporting is probably better than nothing, especially
since this involves skipped documents whose siblings
may have been crawled. In general: when you already
know a certain URL but decide not to access it for some
reason, this should be reported.
Logged In: YES
user_id=1242018
Originator: NO
Leo: I support this request