We have now two different fields in the Web crawler.
- lastModifiedDate just returns the date from the HTTP header.
- contentUpdateDate returns a date related to the content. The first time the page is crawled, it returns the current time. Then, it is updated only if content has changed.
When you say "deleted all URLs". Do you mean you remove the URL from the URL browser ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When the Web crawler downloads a page, a MD5 hash of the content is stored in the URL database (URL browser). The next time the same page is crawled, the new MD5 is compared to the old one. If the MD5 changes, the contentLastUpdate is updated with the current time.
You should not remove the URL from the URL browser. The Web crawler only downloads the URL which exists in the URL database. When you enter an inclusion pattern, an URL is automatically copied in the URL database.
I'm seeing correct last modified timestamps both in the URL browser and in the returned field now, but not for all pages.
There are quite a few web pages for which the URL browser has: "Index error"
(see attached screenshot)
(you'd think they still should get a "last modified date" though).
If I then do a search using terms unique to one of the pages with an index error, the result is still returned :-/
Then each entry in the URL browser should have one, no?
(a timestamp rather than a date). It does look like that each
entry that has "Index error" in the URL browser is missing this timestamp.
Can you do the same query in the OSS user interface and check what is
returned ?
I'm getting zero results, and two errors (see attached screenshots)(the first one: "ZK Could not initialize class com.jaeksoft.searchlib.query.parser.BooleanQueryLexer")
Not yet, we have to create the API. You may create a
feature request on GitHub.
Ok, will do.
You finally do not use the contentUpdateDate ?
No, I'm only interested in the timestamp in the HTTP header.
Can you confirm you're using "contentUpdateDate" and "contentLastUpdate" for the same thing? (both appear in messages above)
Hi,
First of all, many thanks for a great piece of software :)
I followed the instructions in this thread:
http://sourceforge.net/p/opensearchserve/discussion/947147/thread/f29f13f6
Then stopped the crawler, deleted all URLs, set fetch interval to 1 minute,
modified one of the pages in the crawl, and restarted it.
After this, still none of the pages have a last modified date set in the URL browser.
What am I missing?
And, was the request for support for the actual last-modified header info ever followed up on (the request of Henry Sudhof in that thread).
Thanks.
(edit: running OpenSearchServer v1.5.12 - build 3d77781724)
Last edit: Dr. Blurb 2015-05-26
Thanks! We try to do our best :)
There is new informations regarding this point.
We have now two different fields in the Web crawler.
- lastModifiedDate just returns the date from the HTTP header.
- contentUpdateDate returns a date related to the content. The first time the page is crawled, it returns the current time. Then, it is updated only if content has changed.
When you say "deleted all URLs". Do you mean you remove the URL from the URL browser ?
Also in the version I'm running?
yes
I left the inclusion patterns, and start the crawl again with one seed URL
Yes, v1.5.12 contains the new fields.
When the Web crawler downloads a page, a MD5 hash of the content is stored in the URL database (URL browser). The next time the same page is crawled, the new MD5 is compared to the old one. If the MD5 changes, the contentLastUpdate is updated with the current time.
You should not remove the URL from the URL browser. The Web crawler only downloads the URL which exists in the URL database. When you enter an inclusion pattern, an URL is automatically copied in the URL database.
Thanks.
Which field populates the "Mod. date" field in the URL browser? I'm guessing the contentLastUpdate?
Given this code:
I get the following output:
Attached is a screenshot of the field mappings for this index/crawler.
I must still be missing something :-/
thanks.
A followup on this one:
I'm seeing correct last modified timestamps both in the URL browser and in the returned field now, but not for all pages.
There are quite a few web pages for which the URL browser has: "Index error"
(see attached screenshot)
(you'd think they still should get a "last modified date" though).
If I then do a search using terms unique to one of the pages with an index error, the result is still returned :-/
Last edit: Dr. Blurb 2015-05-27
Is it possible to add this field mapping (lastModifiedDate -> lastModifiedDate) using the API?
edit: is it possible to set crawl process parameters (e.g. the fetch interval) using the API?
Last edit: Dr. Blurb 2015-05-28
The last modified date from the HTTP header.
Can you do the same query in the OSS user interface and check what is returned ?
Not yet, we have to create the API. You may create a feature request on GitHub.
https://github.com/jaeksoft/opensearchserver/issues
You finally do not use the contentUpdateDate ?
hi Emmanuel,
thanks for getting back :)
Then each entry in the URL browser should have one, no?
(a timestamp rather than a date). It does look like that each
entry that has "Index error" in the URL browser is missing this timestamp.
I'm getting zero results, and two errors (see attached screenshots)(the first one: "ZK Could not initialize class com.jaeksoft.searchlib.query.parser.BooleanQueryLexer")
Ok, will do.
No, I'm only interested in the timestamp in the HTTP header.
Can you confirm you're using "contentUpdateDate" and "contentLastUpdate" for the same thing? (both appear in messages above)
thanks again :-)