Re: [Htmlparser-developer] Extracting a single purpose library

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Creating the project isn't a problem. Maintaining it is.
Another way might be to keep the translate library in the htmlparser 
code stream and then it gets the build, distribute, testing etc. for free.

Martin Kersten wrote:

> Hi Derrick,
>
>> Hmmm, another project.
>> Despite the large number of developers in htmlparser, it's really 
>> only me that is active. So in effect you are asking me if I would 
>> like to take on another project.
>> I'm not sure.
>> It's a bit heavy-weight for such a utility.
>
>
> I don't think it is that difficult to form a new project. I
> would expect to see about 6 or 8 classes and a high quality
> unit-test suite. Since the Html 4.0 standards seams to be not
> a subject of change, I guess there would be only hunting left (I would
> guess there isn't that much chance of seeing more then a
> couple of bugs in that area).
>
>> Are there other projects that would use it besides htmlparser and 
>> spring?
>
>
> Maybe jakarta projects may also be a target audience. Googling
> for character entity references, I found about 5 or 6 java projects
> implementing their own solutions to this problem.
>
>> Would spring even agree to use it?
>
>
> I can ask those folks, if you wouldn't mind. They said
> that they would like to replace the current implementation
> with a 3rd party library. But you are right, we should
> be sure, if they would use such a small special purpose
> library.
> .
>
>> While conceptually I agree with the concept of coalescence of 
>> disparate code streams, I wonder if it would gain the traction 
>> necessary in the open source bazaar given the ease with which a 'good 
>> enough' solution can be created. Is there another way?
>
>
> What exactly do you mean by 'another way'?
>
>> The encoding/decoding of URLs (%20 for space etc.) derives from a 
>> different origin RFP 2396: http://www.ietf.org/rfc/rfc2396.txt
>> Is that what you mean by encodeUrl() or am I missing something?
>
>
> Right. The problem is mean is currently you have two special classes
> to use. URLEncoder and URLDecoder. Also everytime using them, it
> ends with specifying UTF-8 as the encoding shema (since the
> decode/encode(String) methods are deprecated).
>
>
> Cheers,
>
> Martin (Kersten)
>
>>> Hi Derrick,
>>>
>>>> HTML Parser has a similar class as well... 
>>>> org.htmlparser.util.Translate.java:
>>>
>>>
>>>
>>> This file I was originally thinking about, thats right.
>>>
>>>> This file was arrived at via a similar mechanism to your own. It's 
>>>> not stand-alone, relying on a sort utility, to avoid two copies of 
>>>> each reference in the class, and two reference classes, for a total 
>>>> of about 5 classes.
>>>
>>>
>>>
>>> I use a class called CharacterEntityReferences storing a collection
>>> of the references (simply map of String to Integer).
>>>
>>> To determine if a given character is a special character (a entity
>>> reference exist), I use a binary field (int type with each bit
>>> representing a character reference between [0...1000] and
>>> [8000...10000]. Works very well.
>>>
>>> But this are implementational details we can discuss later.
>>> The goal is to provide a single solution which is well tested
>>> and reliable. I found a bug within the original spring version
>>> (numeric references of &#0; till of &#9; are not processed
>>> in the right way).
>>>
>>> Also I was looking around finding some implementations only
>>> supporting character references for characters <=255. So
>>> these implementations are not complete in terms of the
>>> specifications.
>>>
>>> So I guess there is a need for that library.
>>>
>>>> There was a patch provided by *Karsten Pawlik* that loaded the 
>>>> table from a resource:
>>>>
>>>> http://sourceforge.net/tracker/index.php?func=detail&aid=897297&group_id=24399&atid=381401 
>>>>
>>>> but this was never integrated.
>>>
>>>
>>>
>>> I don't like my version using a secondary resource also.
>>> You know I need to use a TokenStream which makes it quite
>>> complex. (I use the DTD definitions provided by w3.org).
>>> But it should give a great unit-test case since in this resource
>>> is every entity reference and so converting this file and checking
>>> if all entity references are converted would be a necessary test,
>>> which is the desired test situation.
>>>
>>> I currently favour a version using something like sequences to shorten
>>> the needed amount of lines of code to set up all entities.
>>> But which implementation is finally choosen does not matter
>>> much to me anyways as long as the version is highly reliable and
>>> quite fast when it comes to actually conversion.
>>>
>>>> This could be broken out into a separate jar. Is that what you are 
>>>> suggesting?
>>>
>>>
>>>
>>> Yes. I would like to have a special library for only encoding and
>>> decoding strings to HTML. Also I would like to add
>>> encode/decodeURL methods to avoid using URLEncoder and
>>> URLDecoder by also adding "UTF-8" as default encoding character
>>> set, since URLDecoder.decode(String) is deprecated.
>>>
>>> So something like:
>>>
>>> HtmlUtils.encode(String normalString) : htmlString
>>> HtmlUtils.decode(String htmlString) : normal string
>>> HtmlUtils.encodeUrl(String normalString) : URL (UTF-8)
>>> HtmlUtils.decodeUrl(String url) : String normalString
>>>
>>> Maybe renaming the HtmlUtils to HtmlCoder or something similar
>>> would also be appreciated. (or HtmlConverter)
>>>
>>> Is it possible to put this library under a special sourceforge
>>> project? Like HtmlCoder project or what ever? I guess there is
>>> some more functionality to add in order to support streams and
>>> readers to simplify its use. If we are doing right with this library
>>> there is a chance that even more projects will use this library
>>> instead of their on (possible limited) solutions.
>>>
>>> For the ownership of the project: I would suggest if you guys
>>> can handle it. You do well with the html-parser and I guess
>>> you have what it takes. Of cause I would like to contribute
>>> all my code and knowledge. Also I would like to do some
>>> testing which implementation of the decoding/encoding
>>> algorithms are finally used and review/develop a complete
>>> set of unit-tests (I don't know your test coverage so maybe
>>> your HtmlParser codebase already have anything what it
>>> takes).
>>>
>>> So you see this is actually a quite complex field :-)
>>>
>>>
>>> Cheers,
>>>
>>> Martin (Kersten)
>>>
>>> PS: I would also like to use named entity references
>>> when encoding a html-string (currently the Spring version
>>> does only use number formats, which arn't that handy
>>> if you need to read the encoded html).
>>>
>>>>
>>>> Derrick
>>>>
>>>> Martin Kersten wrote:
>>>>
>>>>> Dear Html-Parser developers,
>>>>>
>>>>>   my name is Martin Kersten and I am looking for a library doing 
>>>>> HTML related conversions. Originally I started to refactor the 
>>>>> HtmlUtils
>>>>> class of the Spring framework. But thinking about it, it would be 
>>>>> best
>>>>> if such a capability (decode/encode strings from / to Html) would be
>>>>> provided by a special and tiny library. Such a library would be a 
>>>>> relief, I guess. Also I wouldn't like to invent the wheel, twice... .
>>>>>
>>>>> Also by my own refactoring affords, I ended with a special class
>>>>> encapuslating the named entity references and load it from file
>>>>> (I used the http://www.w3.org/TR/REC-html40/sgml/entities.html 
>>>>> files).
>>>>> I don't know if this efford is worth it but from an OOP stand 
>>>>> point it looks nice :-).
>>>>>
>>>>> Anyways, I would be happy if such a highly focused library would
>>>>> be out there.
>>>>> So what do you think? Any chance that such a library can be created?
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Martin (Kersten)
>>>>>
>>>>> PS: I am not a Spring developer, I am just a Spring user who 
>>>>> cares... . :-)
>>>>>
>