Re: [Htmlparser-developer] Extracting a single purpose library
Brought to you by:
derrickoswald
From: Martin K. <Mar...@St...> - 2005-04-06 21:20:40
|
Hi Derrick, > Hmmm, another project. > Despite the large number of developers in htmlparser, it's really only me > that is active. So in effect you are asking me if I would like to take on > another project. > I'm not sure. > It's a bit heavy-weight for such a utility. I don't think it is that difficult to form a new project. I would expect to see about 6 or 8 classes and a high quality unit-test suite. Since the Html 4.0 standards seams to be not a subject of change, I guess there would be only hunting left (I would guess there isn't that much chance of seeing more then a couple of bugs in that area). > Are there other projects that would use it besides htmlparser and spring? Maybe jakarta projects may also be a target audience. Googling for character entity references, I found about 5 or 6 java projects implementing their own solutions to this problem. > Would spring even agree to use it? I can ask those folks, if you wouldn't mind. They said that they would like to replace the current implementation with a 3rd party library. But you are right, we should be sure, if they would use such a small special purpose library. . > While conceptually I agree with the concept of coalescence of disparate > code streams, I wonder if it would gain the traction necessary in the open > source bazaar given the ease with which a 'good enough' solution can be > created. Is there another way? What exactly do you mean by 'another way'? > The encoding/decoding of URLs (%20 for space etc.) derives from a > different origin RFP 2396: http://www.ietf.org/rfc/rfc2396.txt > Is that what you mean by encodeUrl() or am I missing something? Right. The problem is mean is currently you have two special classes to use. URLEncoder and URLDecoder. Also everytime using them, it ends with specifying UTF-8 as the encoding shema (since the decode/encode(String) methods are deprecated). Cheers, Martin (Kersten) >> Hi Derrick, >> >>> HTML Parser has a similar class as well... >>> org.htmlparser.util.Translate.java: >> >> >> This file I was originally thinking about, thats right. >> >>> This file was arrived at via a similar mechanism to your own. It's not >>> stand-alone, relying on a sort utility, to avoid two copies of each >>> reference in the class, and two reference classes, for a total of about >>> 5 classes. >> >> >> I use a class called CharacterEntityReferences storing a collection >> of the references (simply map of String to Integer). >> >> To determine if a given character is a special character (a entity >> reference exist), I use a binary field (int type with each bit >> representing a character reference between [0...1000] and >> [8000...10000]. Works very well. >> >> But this are implementational details we can discuss later. >> The goal is to provide a single solution which is well tested >> and reliable. I found a bug within the original spring version >> (numeric references of � till of 	 are not processed >> in the right way). >> >> Also I was looking around finding some implementations only >> supporting character references for characters <=255. So >> these implementations are not complete in terms of the >> specifications. >> >> So I guess there is a need for that library. >> >>> There was a patch provided by *Karsten Pawlik* that loaded the table >>> from a resource: >>> >>> http://sourceforge.net/tracker/index.php?func=detail&aid=897297&group_id=24399&atid=381401 >>> but this was never integrated. >> >> >> I don't like my version using a secondary resource also. >> You know I need to use a TokenStream which makes it quite >> complex. (I use the DTD definitions provided by w3.org). >> But it should give a great unit-test case since in this resource >> is every entity reference and so converting this file and checking >> if all entity references are converted would be a necessary test, >> which is the desired test situation. >> >> I currently favour a version using something like sequences to shorten >> the needed amount of lines of code to set up all entities. >> But which implementation is finally choosen does not matter >> much to me anyways as long as the version is highly reliable and >> quite fast when it comes to actually conversion. >> >>> This could be broken out into a separate jar. Is that what you are >>> suggesting? >> >> >> Yes. I would like to have a special library for only encoding and >> decoding strings to HTML. Also I would like to add >> encode/decodeURL methods to avoid using URLEncoder and >> URLDecoder by also adding "UTF-8" as default encoding character >> set, since URLDecoder.decode(String) is deprecated. >> >> So something like: >> >> HtmlUtils.encode(String normalString) : htmlString >> HtmlUtils.decode(String htmlString) : normal string >> HtmlUtils.encodeUrl(String normalString) : URL (UTF-8) >> HtmlUtils.decodeUrl(String url) : String normalString >> >> Maybe renaming the HtmlUtils to HtmlCoder or something similar >> would also be appreciated. (or HtmlConverter) >> >> Is it possible to put this library under a special sourceforge >> project? Like HtmlCoder project or what ever? I guess there is >> some more functionality to add in order to support streams and >> readers to simplify its use. If we are doing right with this library >> there is a chance that even more projects will use this library >> instead of their on (possible limited) solutions. >> >> For the ownership of the project: I would suggest if you guys >> can handle it. You do well with the html-parser and I guess >> you have what it takes. Of cause I would like to contribute >> all my code and knowledge. Also I would like to do some >> testing which implementation of the decoding/encoding >> algorithms are finally used and review/develop a complete >> set of unit-tests (I don't know your test coverage so maybe >> your HtmlParser codebase already have anything what it >> takes). >> >> So you see this is actually a quite complex field :-) >> >> >> Cheers, >> >> Martin (Kersten) >> >> PS: I would also like to use named entity references >> when encoding a html-string (currently the Spring version >> does only use number formats, which arn't that handy >> if you need to read the encoded html). >> >>> >>> Derrick >>> >>> Martin Kersten wrote: >>> >>>> Dear Html-Parser developers, >>>> >>>> my name is Martin Kersten and I am looking for a library doing HTML >>>> related conversions. Originally I started to refactor the HtmlUtils >>>> class of the Spring framework. But thinking about it, it would be best >>>> if such a capability (decode/encode strings from / to Html) would be >>>> provided by a special and tiny library. Such a library would be a >>>> relief, I guess. Also I wouldn't like to invent the wheel, twice... . >>>> >>>> Also by my own refactoring affords, I ended with a special class >>>> encapuslating the named entity references and load it from file >>>> (I used the http://www.w3.org/TR/REC-html40/sgml/entities.html files). >>>> I don't know if this efford is worth it but from an OOP stand point it >>>> looks nice :-). >>>> >>>> Anyways, I would be happy if such a highly focused library would >>>> be out there. >>>> So what do you think? Any chance that such a library can be created? >>>> >>>> >>>> Cheers, >>>> >>>> Martin (Kersten) >>>> >>>> PS: I am not a Spring developer, I am just a Spring user who cares... . >>>> :-) >>>> > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > |