Re: [Htmlparser-developer] Extracting a single purpose library
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2005-04-06 11:19:43
|
Martin, Hmmm, another project. Despite the large number of developers in htmlparser, it's really only me that is active. So in effect you are asking me if I would like to take on another project. I'm not sure. It's a bit heavy-weight for such a utility. Are there other projects that would use it besides htmlparser and spring? Would spring even agree to use it? While conceptually I agree with the concept of coalescence of disparate code streams, I wonder if it would gain the traction necessary in the open source bazaar given the ease with which a 'good enough' solution can be created. Is there another way? The encoding/decoding of URLs (%20 for space etc.) derives from a different origin RFP 2396: http://www.ietf.org/rfc/rfc2396.txt Is that what you mean by encodeUrl() or am I missing something? Derrick Martin Kersten wrote: > Hi Derrick, > >> HTML Parser has a similar class as well... >> org.htmlparser.util.Translate.java: > > > This file I was originally thinking about, thats right. > >> This file was arrived at via a similar mechanism to your own. It's >> not stand-alone, relying on a sort utility, to avoid two copies of >> each reference in the class, and two reference classes, for a total >> of about 5 classes. > > > I use a class called CharacterEntityReferences storing a collection > of the references (simply map of String to Integer). > > To determine if a given character is a special character (a entity > reference exist), I use a binary field (int type with each bit > representing a character reference between [0...1000] and > [8000...10000]. Works very well. > > But this are implementational details we can discuss later. > The goal is to provide a single solution which is well tested > and reliable. I found a bug within the original spring version > (numeric references of � till of 	 are not processed > in the right way). > > Also I was looking around finding some implementations only > supporting character references for characters <=255. So > these implementations are not complete in terms of the > specifications. > > So I guess there is a need for that library. > >> There was a patch provided by *Karsten Pawlik* that loaded the table >> from a resource: >> >> http://sourceforge.net/tracker/index.php?func=detail&aid=897297&group_id=24399&atid=381401 >> >> but this was never integrated. > > > I don't like my version using a secondary resource also. > You know I need to use a TokenStream which makes it quite > complex. (I use the DTD definitions provided by w3.org). > But it should give a great unit-test case since in this resource > is every entity reference and so converting this file and checking > if all entity references are converted would be a necessary test, > which is the desired test situation. > > I currently favour a version using something like sequences to shorten > the needed amount of lines of code to set up all entities. > But which implementation is finally choosen does not matter > much to me anyways as long as the version is highly reliable and > quite fast when it comes to actually conversion. > >> This could be broken out into a separate jar. Is that what you are >> suggesting? > > > Yes. I would like to have a special library for only encoding and > decoding strings to HTML. Also I would like to add > encode/decodeURL methods to avoid using URLEncoder and > URLDecoder by also adding "UTF-8" as default encoding character > set, since URLDecoder.decode(String) is deprecated. > > So something like: > > HtmlUtils.encode(String normalString) : htmlString > HtmlUtils.decode(String htmlString) : normal string > HtmlUtils.encodeUrl(String normalString) : URL (UTF-8) > HtmlUtils.decodeUrl(String url) : String normalString > > Maybe renaming the HtmlUtils to HtmlCoder or something similar > would also be appreciated. (or HtmlConverter) > > Is it possible to put this library under a special sourceforge > project? Like HtmlCoder project or what ever? I guess there is > some more functionality to add in order to support streams and > readers to simplify its use. If we are doing right with this library > there is a chance that even more projects will use this library > instead of their on (possible limited) solutions. > > For the ownership of the project: I would suggest if you guys > can handle it. You do well with the html-parser and I guess > you have what it takes. Of cause I would like to contribute > all my code and knowledge. Also I would like to do some > testing which implementation of the decoding/encoding > algorithms are finally used and review/develop a complete > set of unit-tests (I don't know your test coverage so maybe > your HtmlParser codebase already have anything what it > takes). > > So you see this is actually a quite complex field :-) > > > Cheers, > > Martin (Kersten) > > PS: I would also like to use named entity references > when encoding a html-string (currently the Spring version > does only use number formats, which arn't that handy > if you need to read the encoded html). > >> >> Derrick >> >> Martin Kersten wrote: >> >>> Dear Html-Parser developers, >>> >>> my name is Martin Kersten and I am looking for a library doing >>> HTML related conversions. Originally I started to refactor the >>> HtmlUtils >>> class of the Spring framework. But thinking about it, it would be best >>> if such a capability (decode/encode strings from / to Html) would be >>> provided by a special and tiny library. Such a library would be a >>> relief, I guess. Also I wouldn't like to invent the wheel, twice... . >>> >>> Also by my own refactoring affords, I ended with a special class >>> encapuslating the named entity references and load it from file >>> (I used the http://www.w3.org/TR/REC-html40/sgml/entities.html files). >>> I don't know if this efford is worth it but from an OOP stand point >>> it looks nice :-). >>> >>> Anyways, I would be happy if such a highly focused library would >>> be out there. >>> So what do you think? Any chance that such a library can be created? >>> >>> >>> Cheers, >>> >>> Martin (Kersten) >>> >>> PS: I am not a Spring developer, I am just a Spring user who >>> cares... . :-) >>> |