Re: [Htmlparser-developer] Extracting a single purpose library
Brought to you by:
derrickoswald
From: Martin K. <Mar...@St...> - 2005-04-06 00:33:30
|
Hi Derrick, > HTML Parser has a similar class as well... > org.htmlparser.util.Translate.java: This file I was originally thinking about, thats right. > This file was arrived at via a similar mechanism to your own. It's not > stand-alone, relying on a sort utility, to avoid two copies of each > reference in the class, and two reference classes, for a total of about 5 > classes. I use a class called CharacterEntityReferences storing a collection of the references (simply map of String to Integer). To determine if a given character is a special character (a entity reference exist), I use a binary field (int type with each bit representing a character reference between [0...1000] and [8000...10000]. Works very well. But this are implementational details we can discuss later. The goal is to provide a single solution which is well tested and reliable. I found a bug within the original spring version (numeric references of � till of 	 are not processed in the right way). Also I was looking around finding some implementations only supporting character references for characters <=255. So these implementations are not complete in terms of the specifications. So I guess there is a need for that library. > There was a patch provided by *Karsten Pawlik* that loaded the table from > a resource: > > http://sourceforge.net/tracker/index.php?func=detail&aid=897297&group_id=24399&atid=381401 > but this was never integrated. I don't like my version using a secondary resource also. You know I need to use a TokenStream which makes it quite complex. (I use the DTD definitions provided by w3.org). But it should give a great unit-test case since in this resource is every entity reference and so converting this file and checking if all entity references are converted would be a necessary test, which is the desired test situation. I currently favour a version using something like sequences to shorten the needed amount of lines of code to set up all entities. But which implementation is finally choosen does not matter much to me anyways as long as the version is highly reliable and quite fast when it comes to actually conversion. > This could be broken out into a separate jar. Is that what you are > suggesting? Yes. I would like to have a special library for only encoding and decoding strings to HTML. Also I would like to add encode/decodeURL methods to avoid using URLEncoder and URLDecoder by also adding "UTF-8" as default encoding character set, since URLDecoder.decode(String) is deprecated. So something like: HtmlUtils.encode(String normalString) : htmlString HtmlUtils.decode(String htmlString) : normal string HtmlUtils.encodeUrl(String normalString) : URL (UTF-8) HtmlUtils.decodeUrl(String url) : String normalString Maybe renaming the HtmlUtils to HtmlCoder or something similar would also be appreciated. (or HtmlConverter) Is it possible to put this library under a special sourceforge project? Like HtmlCoder project or what ever? I guess there is some more functionality to add in order to support streams and readers to simplify its use. If we are doing right with this library there is a chance that even more projects will use this library instead of their on (possible limited) solutions. For the ownership of the project: I would suggest if you guys can handle it. You do well with the html-parser and I guess you have what it takes. Of cause I would like to contribute all my code and knowledge. Also I would like to do some testing which implementation of the decoding/encoding algorithms are finally used and review/develop a complete set of unit-tests (I don't know your test coverage so maybe your HtmlParser codebase already have anything what it takes). So you see this is actually a quite complex field :-) Cheers, Martin (Kersten) PS: I would also like to use named entity references when encoding a html-string (currently the Spring version does only use number formats, which arn't that handy if you need to read the encoded html). > > Derrick > > Martin Kersten wrote: > >> Dear Html-Parser developers, >> >> my name is Martin Kersten and I am looking for a library doing HTML >> related conversions. Originally I started to refactor the HtmlUtils >> class of the Spring framework. But thinking about it, it would be best >> if such a capability (decode/encode strings from / to Html) would be >> provided by a special and tiny library. Such a library would be a relief, >> I guess. Also I wouldn't like to invent the wheel, twice... . >> >> Also by my own refactoring affords, I ended with a special class >> encapuslating the named entity references and load it from file >> (I used the http://www.w3.org/TR/REC-html40/sgml/entities.html files). >> I don't know if this efford is worth it but from an OOP stand point it >> looks nice :-). >> >> Anyways, I would be happy if such a highly focused library would >> be out there. >> So what do you think? Any chance that such a library can be created? >> >> >> Cheers, >> >> Martin (Kersten) >> >> PS: I am not a Spring developer, I am just a Spring user who cares... . >> :-) >> > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |