I wonder what tagsoup does to RDFa and other interesting bits we could extract.

imho the extraction of RDFa or microformats from HTML would gain the end-user much more value from aperture than extracting the plaintext from bad html.


It was Grant Ingersoll who said at the right time 23.05.2008 20:54 the following words:
On May 23, 2008, at 7:52 AM, Antoni Myłka wrote:

Grant Ingersoll pisze:
Has anyone tried TagSoup (http://ccil.org/~cowan/XML/tagsoup/) on  

We haven't. Have you? Is it better? For our purposes reliability is  
important than speed. The webcrawler is usually constrained by the
bandwidth and not by the local html processing, so if it is what it  
it is, it might not be a bad idea to check it out.

I haven't tried it, a friend of mine said he liked it, but that is the  
only reference.  It suggests it deals well with crappy HTML really  
well, but many parsers claim that, so I've grown immune to it.   I  
find such claims a bit dubious, anyway, unless you are running at  
really large scale, since, usually they are based on someone who has a  
small set (few hundred at most) of "bad" documents which they've  
validated against and then make such claims.

At any rate, maybe sometime in the future I will have the cycles to  
try it out and write it up as an option for Aperture.
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Aperture-devel mailing list


DI Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo.sauermann@dfki.de

Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313