Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )
Brought to you by:
derrickoswald
From: Sam J. <ga...@yh...> - 2002-12-23 06:07:54
|
Hi Somik, Somik Raha wrote: >Bytway... I'd written earlier about this - what is your opinion on using >Bayesian networks to have a rule-based learning system, that gets better >over time ? i.e. right now the tag identification mechanism is linear- >there is only so far that can go. But with the sort of dirty html we get, >the system has to be self-learning. I am thinking of an approach where we'd >try to eliminate a lot of the hard-coded rules, with a learning network. Of >course, we'd have our tests to verify that we haven't broken anything, and >from there, it should only get better. It would be great to have your >insight on this. > I think I don't quite understand enough abuot htmlparser to see how Bayesian networks would be applicable. I have only recently worked out how your scanners work, or rather, that you have scanners for different types of tags and can then avoid processing those tags that you are not interested in. You say above that your tag identification mechanism is linear, but linear with respect to what? Can you give me an example of the hard coded rules you are using now, and a couple of examples of dirty html pages that cause them to be sub-optimal. Using learning in a system to increase efficiency is usually very difficult to do well. Learning systems basically have more flexibility than other systems, but as a consequence you have moer free parameters. It is easy to add a learning framework but then spend all your time just trying to adjust the system parameters, and then to discover that exploring the space of possible parameters for your learner is just too expensive. Nontheless I am always fascinated by the problem of adding learning to a system, precisely because it is so difficult to do well. If you can give me some concrete examples, I will do my best to help you select an appropriate learning mechanism. >>p.s. I'm impressed by the frequency with which you are releasing >>htmlParser, and your process of having multiple candidates etc. I >>struggle to release often as the release process itself still seems a >>little cumbersome (sourceforge has got better) .... have you any tips >>for streamlining it ....? I guess what I really need is an ant methods >> >> >like > > >>ant release-bug-fix version >>ant create-new-version-release >>ant create-new-candiate-release >> >>which handle all the necessary communication with sourceforge, >>uploading, packaging and handling of release numbers .... >> >> > >Ha ha! I am not sure if you'll believe this, but I was inspired to structure >the htmlparser project based on the neurogrid project- you had ant scripts >long before we did. > Interesting. The ant scripts for neurogrid were originally made by Rick Knowles, and I'm still only just getting a really good feel for ant. I remember trying to set up something in my ng scripts that would add the date to the jar file name, like you sometimes do, and failing. My own fault really; I rarely read the manual and always try to learn by modifying the operation of an existing system (kind of an evolutionary approach ....) >Of course, ant scripts are so important to do the job >automatically - but I like keeping things simple -in the sense, there is no >seperate bug-fix version, but the next integration release (Candidate). > >I am not yet a fan of branches - they're ok if they dont live more than two >weeks (I've been thinking real hard about it for a while). Im planning to >get the production release out this week - so we can all move on to 1.3 >(instead of having two versions - we'll live with 1.3 integration releases). >I'd hate to make the same bug fixes twice. > ok, but how much do you use the ant tasks to collect together all the thigns required for a release, or do you access souceforge yourself each time? Given that the code is ready to go, how long does it take you to do a release? 5 minutes, 30? I'm imagining an ant task that would require one command and then you'd leave it to run .... Really I will have to sit down and spend some time looking at your ant scripts again to work that out, but I thought it might be interesting to hear from you about whether the release process feels cumbersome or not... CHEERS> SAM |