Menu

how to get lastModifiedDate

Help
Dr. Blurb
2015-05-26
2015-06-03
  • Dr. Blurb

    Dr. Blurb - 2015-05-26

    Hi,

    First of all, many thanks for a great piece of software :)

    I followed the instructions in this thread:
    http://sourceforge.net/p/opensearchserve/discussion/947147/thread/f29f13f6

    Then stopped the crawler, deleted all URLs, set fetch interval to 1 minute,
    modified one of the pages in the crawl, and restarted it.

    After this, still none of the pages have a last modified date set in the URL browser.

    What am I missing?

    And, was the request for support for the actual last-modified header info ever followed up on (the request of Henry Sudhof in that thread).

    Thanks.

    (edit: running OpenSearchServer v1.5.12 - build 3d77781724)

     

    Last edit: Dr. Blurb 2015-05-26
  • Emmanuel Keller

    Emmanuel Keller - 2015-05-26

    Thanks! We try to do our best :)

    There is new informations regarding this point.

    We have now two different fields in the Web crawler.
    - lastModifiedDate just returns the date from the HTTP header.
    - contentUpdateDate returns a date related to the content. The first time the page is crawled, it returns the current time. Then, it is updated only if content has changed.

    When you say "deleted all URLs". Do you mean you remove the URL from the URL browser ?

     
  • Dr. Blurb

    Dr. Blurb - 2015-05-26

    We have now two different fields in the Web crawler.

    Also in the version I'm running?

    When you say "deleted all URLs". Do you mean you remove the URL from the URL browser ?

    yes

    I left the inclusion patterns, and start the crawl again with one seed URL

     
  • Emmanuel Keller

    Emmanuel Keller - 2015-05-26

    Also in the version I'm running?

    Yes, v1.5.12 contains the new fields.

    When the Web crawler downloads a page, a MD5 hash of the content is stored in the URL database (URL browser). The next time the same page is crawled, the new MD5 is compared to the old one. If the MD5 changes, the contentLastUpdate is updated with the current time.

    You should not remove the URL from the URL browser. The Web crawler only downloads the URL which exists in the URL database. When you enter an inclusion pattern, an URL is automatically copied in the URL database.

     
  • Dr. Blurb

    Dr. Blurb - 2015-05-27

    Thanks.

    Which field populates the "Mod. date" field in the URL browser? I'm guessing the contentLastUpdate?

    lastModifiedDate just returns the date from the HTTP header.

    Given this code:

    $request = new OpenSearchServer\Search\Field\Search();
    $request->index('index11')
      ->searchField('content')
      ->query('alfa')
      ->rows(1)
      ->returnedFields(array('content', 'url', 'crawlDate', 'lastModifiedDate', 'host', 'lang'));
    
    $response = $oss_api->submit($request);
    
    $results = $response->getResults();
    
    foreach($results as $index => $result) {
    
      $crawl_date = $result->getField('crawlDate');
      $last_mod = $result->getField('lastModifiedDate');
      $host = $result->getField('host');
      $lang = $result->getField('lang');
    
      print "$index, crawl date [$crawl_date], last mod [$last_mod] host [$host] lang [$lang]\n";
    
      ...
    

    I get the following output:

    0, crawl date [], last mod [] host [www.<some host>.com] lang [en]
    

    Attached is a screenshot of the field mappings for this index/crawler.

    I must still be missing something :-/

    thanks.

     
  • Dr. Blurb

    Dr. Blurb - 2015-05-27

    A followup on this one:

    I'm seeing correct last modified timestamps both in the URL browser and in the returned field now, but not for all pages.

    There are quite a few web pages for which the URL browser has: "Index error"
    (see attached screenshot)
    (you'd think they still should get a "last modified date" though).

    If I then do a search using terms unique to one of the pages with an index error, the result is still returned :-/

     

    Last edit: Dr. Blurb 2015-05-27
  • Dr. Blurb

    Dr. Blurb - 2015-05-28

    Is it possible to add this field mapping (lastModifiedDate -> lastModifiedDate) using the API?

    edit: is it possible to set crawl process parameters (e.g. the fetch interval) using the API?

     

    Last edit: Dr. Blurb 2015-05-28
  • Emmanuel Keller

    Emmanuel Keller - 2015-06-03

    Which field populates the "Mod. date" field in the URL browser? I'm guessing the contentLastUpdate?

    The last modified date from the HTTP header.

    get the following output:

    Can you do the same query in the OSS user interface and check what is returned ?

    using the API?

    Not yet, we have to create the API. You may create a feature request on GitHub.
    https://github.com/jaeksoft/opensearchserver/issues

    You finally do not use the contentUpdateDate ?

     
  • Dr. Blurb

    Dr. Blurb - 2015-06-03

    hi Emmanuel,
    thanks for getting back :)

    The last modified date from the HTTP header.

    Then each entry in the URL browser should have one, no?
    (a timestamp rather than a date). It does look like that each
    entry that has "Index error" in the URL browser is missing this timestamp.

    Can you do the same query in the OSS user interface and check what is
    returned ?

    I'm getting zero results, and two errors (see attached screenshots)(the first one: "ZK Could not initialize class com.jaeksoft.searchlib.query.parser.BooleanQueryLexer")

    Not yet, we have to create the API. You may create a
    feature request on GitHub.

    Ok, will do.

    You finally do not use the contentUpdateDate ?

    No, I'm only interested in the timestamp in the HTTP header.

    Can you confirm you're using "contentUpdateDate" and "contentLastUpdate" for the same thing? (both appear in messages above)

    thanks again :-)

     

Log in to post a comment.