From: Basile S. <ba...@st...> - 2004-01-22 12:48:00
|
Dear All, In case you have a problem with CVS I just made available a snapshot (including the latest bug corrections) of the Poesia monitor source code (as a GNU zipped GNU tar archive) on http://starynkevitch.net/Basile/poesiamonitor_2004_Jan_22_13h.tar.gz The md5sum is adb08f90c5d3ce429f619dee287ffd07 But please use sourceforge CVS when available. Regards. --=20 Basile STARYNKEVITCH http://starynkevitch.net/Basile/=20 email: basile<at>starynkevitch<dot>net=20 aliases: basile<at>tunes<dot>org =3D bstarynk<at>nerim<dot>net 8, rue de la Fa=EFencerie, 92340 Bourg La Reine, France |
From: Neil I. <N.I...@dc...> - 2004-01-23 19:20:11
|
Dear all, Mark and I were discussing the score-caching problem on the drive home last night and have come up with a number of points for discussion. It is a quite complex problem and in the end I think we will have to make a compromise. Of course what, if anything, gets implemented is completely dependant on Basile... The most complicated caching issue relates to the image filter. As I understand the system (and Christophe can correct me if I'm wrong) when the browser requests a page containing images, the Image Filter makes a number of sub-requests for the images. Once enough images have been processed to provide a score for the page the score is sent to the Decision Mechanism and the sub-request processing is stopped. If the page is blocked then everything is fine, however, if the page is accepted then the browser requests all the images for the page and the Image Filter must again filter each image: as there is no way of strictly associating an image with its reference page. The only way to prevent the need for double filtering of images is to cache the decision and not the scores. This is because acceptance of each image on a page relates to the decision on the page and not an image score. An image score might be high but if the page is accepted then all the images for that page should also be accepted. Thus score caching WILL NOT solve the double filtering of images. In fact after puzzling our way through many different alternatives the following was the only workable solution we found... Scenario for Monitor Decision caching The browser makes a request for a HTML page URL www.stuff.com, containing text & images. The text gets sent to the language identifier then on to a language filter. The language filter sends a score to the Decision Mechanism. By some method the monitor stores the fact that a page refers to all the images in that page, i.e. URL www.stuff.com -> www.other.com/picture1.jpg, www.stuff.com/logo.tif, www.acme.org/banner.gif. This could be done automatically by the monitor or by some caching request made by the image filter. The HTML gets sent to the image filter. The image filter sends sub-requests for some or all the page images to the monitor. The monitor returns the sub-requested images to the image filter. The image filter sends a combined score to the Decision Mechanism. Ideally any individual image score should be cached with the image URL so that images are only processed once, i.e. URL www.stuff.com -> www.other.com/picture1.jpg (img-score=57), www.stuff.com/logo.tif (img-score=23), www.acme.org/banner.gif. The Decision Mechanism returns a decision to the monitor. The Monitor caches the decision related to the page URL. If the page is accepted... URL www.stuff.com -> accepted The monitor caches all the images referred to by that page as accepted images, i.e. URL www.other.com/picture1.jpg (img-score=57) -> accepted URL www.stuff.com/logo.tif (img-score=23) -> accepted URL www.acme.org/banner.gif -> accepted The Browser then requests all the page's images. The Monitor checks the cache for each image URL finds the cached accept and sends an accepted message back to the Browser. Next time the page is requested the page is accepted by the cached decision, as are all the page's images. If the page is rejected... URL www.stuff.com -> rejected The monitor caches the page decision and no further requests are made. The monitor caches all the images that have an individual score, but not the associated page rejected decision, i.e. URL www.other.com/picture1.jpg (img-score=57), URL www.stuff.com/logo.tif (img-score=23) Next time the page is requested the page is rejected and no further requests are made. NOTE: If the rejected decision is stored for images the next time an individual image was requested it would be "incorrectly" rejected even though it may have a very low score. Thus if the requested image has not been previously displayed it will have to be individual assessed by the image filter. SAME IMAGE, DIFFERENT PAGE If a request is made for another page where this page contains an image, which has been previously filtered. The images get sent to the image filter. The image filter sends sub-requests for some or all the page images to the monitor. The monitor checks the cache and for any image that has a cached individual score and returns that score to the image filter, otherwise the monitor returns the sub-requested images to the image filter. The image filter sends a combined score to the Decision Mechanism. If the page is accepted the image is associated with an accepted value. If the page is rejected the image's decision value is unchanged, i.e. if it was previously accepted it is still associated with an accepted decision. In this scenario the monitor will automatically accept any image that has been on an accepted page if it is requested as an individual image. The only way to view a high scoring harmful image is to place it on a seemingly innocuous page, i.e. a page with "normal text" and/or other non-fleshy images. Then if this image were requested individually it would be "incorrectly" accepted; however we have already previously failed to filter this image "correctly". NOTE: the above approach does not exclude the use of score caching. All the filters could also store their scores, i.e. URL www.stuff.com (en-light-score=refer, en-heavy-score=low) -> accepted Invoking POESIA with a new Decision strategy would only require the purging of the decision values associated with the cached URL. I think this is acceptable as a first approach. What do you all (especially Christophe) think? N |
From: Basile S. <ba...@st...> - 2004-01-23 19:48:10
|
On Fri, Jan 23, 2004 at 07:20:01PM +0000, Neil Ireson wrote: > Dear all, >=20 > Mark and I were discussing the score-caching problem on the drive home=20 > last night and have come up with a number of points for discussion. It=20 > is a quite complex problem and in the end I think we will have to make = a=20 > compromise. Of course what, if anything, gets implemented is completely= =20 > dependant on Basile... I started implementing caching on the following ideas. What is cached is scores indexed by=20 the filter name (eg imagefilter or englishfilter) the URL the Etag (in the sense of HTTP header) or Md5sum of Url contents However, I won't add code (so I won't terminate the caching code) into the monitor as long as the PoesiaSoft/Documentation/ is not completed (by others) under CVS. So please write your (system & user) documentation, as I did. Fill up relevant files in LaTeX. Caching should always be an optimisation: if a given entry is not found, score is recomputed as usual, and this has no issue on the accepting/rejecting decision (except for speed). A filter can ask for a given score to be uncached. The cache machinery may remove cache entries at will. In practice, cache entries have a limited duration. >=20 > The most complicated caching issue relates to the image filter. As I=20 > understand the system (and Christophe can correct me if I'm wrong) when= =20 > the browser requests a page containing images, the Image Filter makes a= =20 > number of sub-requests for the images. Once enough images have been=20 > processed to provide a score for the page the score is sent to the=20 > Decision Mechanism and the sub-request processing is stopped. If the=20 > page is blocked then everything is fine, however, if the page is=20 > accepted then the browser requests all the images for the page and the=20 > Image Filter must again filter each image: as there is no way of=20 > strictly associating an image with its reference page. Since (as I understand it, but there is still lack of documentation and I am waiting for others to write their part) only a full HTML page has an image score, only that is cached. There is no mean to cache individual "subscores" (actually, internal computation results for the image filter) - since caching would be done (and is only doable) during processing by the monitor of SCORE messages. I am not sure that complex images are shared by different HTML pages. I tend to believe it is rare in practice. Only simple glyphs (e.g. arrows) are shared by lot of pages. >=20 > The only way to prevent the need for double filtering of images is to=20 > cache the decision and not the scores. This is because acceptance of=20 > each image on a page relates to the decision on the page and not an=20 > image score. An image score might be high but if the page is accepted=20 > then all the images for that page should also be accepted. [...] I did not understand all of the details of Neil's message.=20 What does ENIC think about this? I'm still waiting for PoesiaSoft/Documentation to be filled by others. --=20 Basile STARYNKEVITCH http://starynkevitch.net/Basile/=20 email: basile<at>starynkevitch<dot>net=20 aliases: basile<at>tunes<dot>org =3D bstarynk<at>nerim<dot>net 8, rue de la Fa=EFencerie, 92340 Bourg La Reine, France |
From: Mohamed D. <da...@en...> - 2004-01-26 14:33:21
|
Dear All, As we decided during the workshop, all the specifications of the filters and the DM should be finished before the end of this week, The success of an open source project as Poesia depends on the documents that we are going to provide to the developpers. This message concerns all the developpers. Basile can not help us if he has not these specifs, best regards Mohamed Basile STARYNKEVITCH wrote: >On Fri, Jan 23, 2004 at 07:20:01PM +0000, Neil Ireson wrote: > >>Dear all, >> >>Mark and I were discussing the score-caching problem on the drive home >>last night and have come up with a number of points for discussion. It >>is a quite complex problem and in the end I think we will have to make a >>compromise. Of course what, if anything, gets implemented is completely >>dependant on Basile... >> > >I started implementing caching on the following ideas. What is cached >is scores indexed by > the filter name (eg imagefilter or englishfilter) > the URL > the Etag (in the sense of HTTP header) or Md5sum of Url contents > >However, I won't add code (so I won't terminate the caching code) into >the monitor as long as the PoesiaSoft/Documentation/ is not completed >(by others) under CVS. So please write your (system & user) >documentation, as I did. Fill up relevant files in LaTeX. > >Caching should always be an optimisation: if a given entry is not >found, score is recomputed as usual, and this has no issue on the >accepting/rejecting decision (except for speed). A filter can ask for >a given score to be uncached. > >The cache machinery may remove cache entries at will. In practice, >cache entries have a limited duration. > >>The most complicated caching issue relates to the image filter. As I >>understand the system (and Christophe can correct me if I'm wrong) when >>the browser requests a page containing images, the Image Filter makes a >>number of sub-requests for the images. Once enough images have been >>processed to provide a score for the page the score is sent to the >>Decision Mechanism and the sub-request processing is stopped. If the >>page is blocked then everything is fine, however, if the page is >>accepted then the browser requests all the images for the page and the >>Image Filter must again filter each image: as there is no way of >>strictly associating an image with its reference page. >> > >Since (as I understand it, but there is still lack of documentation >and I am waiting for others to write their part) only a full HTML page >has an image score, only that is cached. There is no mean to cache >individual "subscores" (actually, internal computation results for the >image filter) - since caching would be done (and is only doable) >during processing by the monitor of SCORE messages. > >I am not sure that complex images are shared by different HTML pages. >I tend to believe it is rare in practice. Only simple glyphs >(e.g. arrows) are shared by lot of pages. > >>The only way to prevent the need for double filtering of images is to >>cache the decision and not the scores. This is because acceptance of >>each image on a page relates to the decision on the page and not an >>image score. An image score might be high but if the page is accepted >>then all the images for that page should also be accepted. [...] >> > >I did not understand all of the details of Neil's message. > >What does ENIC think about this? > >I'm still waiting for PoesiaSoft/Documentation to be filled by others. > > |
From: Riadh E. <ri...@me...> - 2004-01-23 22:48:25
|
Hi Neil, I have a comment on what you wrote: > [...] however, if the page is > accepted then the browser requests all the images for the page and the > Image Filter must again filter each image: as there is no way of > strictly associating an image with its reference page. > There is a way to associate an image (or any object) with its reference page. When a web browser makes a request to an image included in a web page, it includes the header: "Referer: www.site.com/index.html" in the request headers. The ICAP protocol includes the request headers, the response headers and the body, so the monitor can know what is the referer web page to an image. I think this a simple way to avoid double filtering: When the monitor gets an ICAP request with an image refered by a non blocked HTML page, it accepts the image without calling the image filter. Best regards, Riadh. |