Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Somik,

Somik Raha wrote:

>Bytway... I'd written earlier about this - what is your opinion on using
>Bayesian networks to have a rule-based learning system, that gets better
>over time ? i.e. right now the tag identification mechanism is linear-
>there is only so far that can go. But with the sort of dirty html we get,
>the system has to be self-learning. I am thinking of an approach where we'd
>try to eliminate a lot of the hard-coded rules, with a learning network. Of
>course, we'd have our tests to verify that we haven't broken anything, and
>from there, it should only get better. It would be great to have your
>insight on this.
>
I think I don't quite understand enough abuot htmlparser to see how 
Bayesian networks would be applicable.  I have only recently worked out 
how your scanners work, or rather, that you have scanners for different 
types of tags and can then avoid processing those tags that you are not 
interested in.  You say above that your tag identification mechanism is 
linear, but linear with respect to what?

Can you give me an example of the hard coded rules you are using now, 
and a couple of examples of dirty html pages that cause them to be 
sub-optimal.

Using learning in a system to increase efficiency is usually very 
difficult to do well.  Learning systems basically have more flexibility 
than other systems, but as a consequence you have moer free parameters. 
 It is easy to add a learning framework but then spend all your time 
just trying to adjust the system parameters, and then to discover that 
exploring the space of possible parameters for your learner is just too 
expensive.

Nontheless I am always fascinated by the problem of adding learning to a 
system, precisely because it is so difficult to do well.  If you can 
give me some concrete examples, I will do my best to help you select an 
appropriate learning mechanism.

>>p.s. I'm impressed by the frequency with which you are releasing
>>htmlParser, and your process of having multiple candidates etc.  I
>>struggle to release often as the release process itself still seems a
>>little cumbersome (sourceforge has got better) ....  have you any tips
>>for streamlining it ....?  I guess what I really need is an ant methods
>>    
>>
>like
>  
>
>>ant release-bug-fix version
>>ant create-new-version-release
>>ant create-new-candiate-release
>>
>>which handle all the necessary communication with sourceforge,
>>uploading, packaging and handling of release numbers ....
>>    
>>
>
>Ha ha! I am not sure if you'll believe this, but I was inspired to structure
>the htmlparser project based on the neurogrid project-  you had ant scripts
>long before we did. 
>
Interesting.  The ant scripts for neurogrid were originally made by Rick 
Knowles, and I'm still only just getting a really good feel for ant.  I 
remember trying to set up something in my ng scripts that would add the 
date to the jar file name, like you sometimes do, and failing.  My own 
fault really; I rarely read the manual and always try to learn by 
modifying the operation of an existing system (kind of an evolutionary 
approach ....)

>Of course, ant scripts are so important to do the job
>automatically - but I like keeping things simple -in the sense, there is no
>seperate bug-fix version, but the next integration release (Candidate).
>
>I am not yet a fan of branches - they're ok if they dont live more than two
>weeks (I've been thinking real hard about it for a while). Im planning to
>get the production release out this week - so we can all move on to 1.3
>(instead of having two versions - we'll live with 1.3 integration releases).
>I'd hate to make the same bug fixes twice.
>
ok, but how much do you use the ant tasks to collect together all the 
thigns required for a release, or do you access souceforge yourself each 
time? Given that the code is ready to go, how long does it take you to 
do a release?  5 minutes, 30?  I'm imagining an ant task that would 
require one command and then you'd leave it to run ....  Really I will 
have to sit down and spend some time looking at your ant scripts again 
to work that out, but I thought it might be interesting to hear from you 
about whether the release process feels cumbersome or not...

CHEERS> SAM