Hello everybody, I'm the other Jose Maria's student working on POESIA.
Mohamed, I'll try to resolve your doubts inbetween lines. Let's go.

On 8/30/06, Daoudi Mohamed <daoudi@enic.fr> wrote:
Hi Riadh and Yaiza,

>
> Thank you for your help and interest :)
>
> I'll try to explain you our idea:
>
> At this moment, whether a page is filtered or not is a decision
> taken by just one filter: the langid decides the language of the
> page, and the filter for that language is the only one deciding
> if the page must be filtered or not.

-------> No, the decision to filter or not is taken by more that one
filter, image filter and text filter for example,

>
> The situation now is the following: we have developed two different
> filters for Spanish and two others for German (porn and gambling),
> so the decision is more complicated, because a Spanish page could
> pass the porn filter but not the gambling one (but should be
> filtered anyway).


It is not very clear for me could you please explain these filters ?

:: We used Weka to built those classifiers, so for each one we have a file called (for example) germangambling.dat wich is a  classifier in Weka's internal format. The gambling ones were trained and built from a collection of harmful documents that were text extracted for gambling web pages (one in german and one in spanish). For porn classifiers we did the same but from a collection of porn web pages.

So now we have a few Java classes to read and instance those classifiers and then built the appropiate filters.

>
> Also, you have to keep in mind that now it will be easy to add new
> filters (we are also developing a GUI for adding them and
> configuring other aspects of POESIA; I will show you an alpha
> version soon), so the number of them can change easily.


---      sorry, I do not understand. The architecture of Poesia is
exactly what you want to do !!

----    There is a separation between the filters and the monitor and we
can added added a new filter (normaly) without any problem !!!!!

:: As I said just a paragraph before, it is not need to hand coded every new filter added to the system, we have a text filter manager that instance each one based on the filters configuration file.
As an example:
We have:
 - germangambling.dat and germanporn.dat Weka classifers.
 - text filter manager java classes.
 - an xml file with the name of the filters, its location and the appropiate configuration parameters.

What it does:
When POESIA starts it reads the config file and instance each filter creating a new conexion with the monitor.
Now we have a german gambling filter and a german porn filter running on POESIA.

The task now is not to code a new Java class and then integrate it in the system, but to play with Weka in order to create a classifier (to change the domain just change the harmful input collection before training) and add it with the Front End utility. We are also writing a tutorial to ease the process that will by finish in less than a month. 

>
> So, which things would have to be changed?
>
> First, it would be appropiated to separate the configuration of the
> filters from the configuration of the monitor. That is, the
> monitor_config.xml file should be separated into two, one file for
> the monitor and other one for the filters.


YES.

>
> So, when the monitor starts and instantiates all the filters, the
> list of them should be read instead of coded. We could mantain
> the structure of the "second part" of the monitor_config.xml file
> for managing the instantiation of the filters.
>
> Another new feature is the possibility of adding a black list and a
> white one. That is, a list of URLs that would be filtered (or
> allowed) directly without any analysis. The monitor would have to
> read these lists at the beginning of POESIA execution, and would
> search the asked URL on them before calling the langid (because
> if the URL is in one of the lists it wouldn't be necessary).

----> OK, but the a black list filter exist in the monitor (to verify !!)


Best regards


Mohamed


Best wishes,

Pablo.