On Fri, Jan 23, 2004 at 07:20:01PM +0000, Neil Ireson wrote:
Mark and I were discussing the score-caching problem on the drive home
last night and have come up with a number of points for discussion. It
is a quite complex problem and in the end I think we will have to make a
compromise. Of course what, if anything, gets implemented is completely
dependant on Basile...
I started implementing caching on the following ideas. What is cached
is scores indexed by
the filter name (eg imagefilter or englishfilter)
the Etag (in the sense of HTTP header) or Md5sum of Url contents
However, I won't add code (so I won't terminate the caching code) into
the monitor as long as the PoesiaSoft/Documentation/ is not completed
(by others) under CVS. So please write your (system & user)
documentation, as I did. Fill up relevant files in LaTeX.
Caching should always be an optimisation: if a given entry is not
found, score is recomputed as usual, and this has no issue on the
accepting/rejecting decision (except for speed). A filter can ask for
a given score to be uncached.
The cache machinery may remove cache entries at will. In practice,
cache entries have a limited duration.
The most complicated caching issue relates to the image filter. As I
understand the system (and Christophe can correct me if I'm wrong) when
the browser requests a page containing images, the Image Filter makes a
number of sub-requests for the images. Once enough images have been
processed to provide a score for the page the score is sent to the
Decision Mechanism and the sub-request processing is stopped. If the
page is blocked then everything is fine, however, if the page is
accepted then the browser requests all the images for the page and the
Image Filter must again filter each image: as there is no way of
strictly associating an image with its reference page.
Since (as I understand it, but there is still lack of documentation
and I am waiting for others to write their part) only a full HTML page
has an image score, only that is cached. There is no mean to cache
individual "subscores" (actually, internal computation results for the
image filter) - since caching would be done (and is only doable)
during processing by the monitor of SCORE messages.
I am not sure that complex images are shared by different HTML pages.
I tend to believe it is rare in practice. Only simple glyphs
(e.g. arrows) are shared by lot of pages.
The only way to prevent the need for double filtering of images is to
cache the decision and not the scores. This is because acceptance of
each image on a page relates to the decision on the page and not an
image score. An image score might be high but if the page is accepted
then all the images for that page should also be accepted. [...]
I did not understand all of the details of Neil's message.
What does ENIC think about this?
I'm still waiting for PoesiaSoft/Documentation to be filled by others.