Re: [Htmlparser-developer] Extracting a single purpose library

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Derrick,

> Hmmm, another project.
> Despite the large number of developers in htmlparser, it's really only me 
> that is active. So in effect you are asking me if I would like to take on 
> another project.
> I'm not sure.
> It's a bit heavy-weight for such a utility.

I don't think it is that difficult to form a new project. I
would expect to see about 6 or 8 classes and a high quality
unit-test suite. Since the Html 4.0 standards seams to be not
a subject of change, I guess there would be only hunting left (I would
guess there isn't that much chance of seeing more then a
couple of bugs in that area).

> Are there other projects that would use it besides htmlparser and spring?

Maybe jakarta projects may also be a target audience. Googling
for character entity references, I found about 5 or 6 java projects
implementing their own solutions to this problem.

> Would spring even agree to use it?

I can ask those folks, if you wouldn't mind. They said
that they would like to replace the current implementation
with a 3rd party library. But you are right, we should
be sure, if they would use such a small special purpose
library.
.
> While conceptually I agree with the concept of coalescence of disparate 
> code streams, I wonder if it would gain the traction necessary in the open 
> source bazaar given the ease with which a 'good enough' solution can be 
> created. Is there another way?

What exactly do you mean by 'another way'?

> The encoding/decoding of URLs (%20 for space etc.) derives from a 
> different origin RFP 2396: http://www.ietf.org/rfc/rfc2396.txt
> Is that what you mean by encodeUrl() or am I missing something?

Right. The problem is mean is currently you have two special classes
to use. URLEncoder and URLDecoder. Also everytime using them, it
ends with specifying UTF-8 as the encoding shema (since the
decode/encode(String) methods are deprecated).

Cheers,

Martin (Kersten)

>> Hi Derrick,
>>
>>> HTML Parser has a similar class as well... 
>>> org.htmlparser.util.Translate.java:
>>
>>
>> This file I was originally thinking about, thats right.
>>
>>> This file was arrived at via a similar mechanism to your own. It's not 
>>> stand-alone, relying on a sort utility, to avoid two copies of each 
>>> reference in the class, and two reference classes, for a total of about 
>>> 5 classes.
>>
>>
>> I use a class called CharacterEntityReferences storing a collection
>> of the references (simply map of String to Integer).
>>
>> To determine if a given character is a special character (a entity
>> reference exist), I use a binary field (int type with each bit
>> representing a character reference between [0...1000] and
>> [8000...10000]. Works very well.
>>
>> But this are implementational details we can discuss later.
>> The goal is to provide a single solution which is well tested
>> and reliable. I found a bug within the original spring version
>> (numeric references of &#0; till of &#9; are not processed
>> in the right way).
>>
>> Also I was looking around finding some implementations only
>> supporting character references for characters <=255. So
>> these implementations are not complete in terms of the
>> specifications.
>>
>> So I guess there is a need for that library.
>>
>>> There was a patch provided by *Karsten Pawlik* that loaded the table 
>>> from a resource:
>>>
>>> http://sourceforge.net/tracker/index.php?func=detail&aid=897297&group_id=24399&atid=381401
>>> but this was never integrated.
>>
>>
>> I don't like my version using a secondary resource also.
>> You know I need to use a TokenStream which makes it quite
>> complex. (I use the DTD definitions provided by w3.org).
>> But it should give a great unit-test case since in this resource
>> is every entity reference and so converting this file and checking
>> if all entity references are converted would be a necessary test,
>> which is the desired test situation.
>>
>> I currently favour a version using something like sequences to shorten
>> the needed amount of lines of code to set up all entities.
>> But which implementation is finally choosen does not matter
>> much to me anyways as long as the version is highly reliable and
>> quite fast when it comes to actually conversion.
>>
>>> This could be broken out into a separate jar. Is that what you are 
>>> suggesting?
>>
>>
>> Yes. I would like to have a special library for only encoding and
>> decoding strings to HTML. Also I would like to add
>> encode/decodeURL methods to avoid using URLEncoder and
>> URLDecoder by also adding "UTF-8" as default encoding character
>> set, since URLDecoder.decode(String) is deprecated.
>>
>> So something like:
>>
>> HtmlUtils.encode(String normalString) : htmlString
>> HtmlUtils.decode(String htmlString) : normal string
>> HtmlUtils.encodeUrl(String normalString) : URL (UTF-8)
>> HtmlUtils.decodeUrl(String url) : String normalString
>>
>> Maybe renaming the HtmlUtils to HtmlCoder or something similar
>> would also be appreciated. (or HtmlConverter)
>>
>> Is it possible to put this library under a special sourceforge
>> project? Like HtmlCoder project or what ever? I guess there is
>> some more functionality to add in order to support streams and
>> readers to simplify its use. If we are doing right with this library
>> there is a chance that even more projects will use this library
>> instead of their on (possible limited) solutions.
>>
>> For the ownership of the project: I would suggest if you guys
>> can handle it. You do well with the html-parser and I guess
>> you have what it takes. Of cause I would like to contribute
>> all my code and knowledge. Also I would like to do some
>> testing which implementation of the decoding/encoding
>> algorithms are finally used and review/develop a complete
>> set of unit-tests (I don't know your test coverage so maybe
>> your HtmlParser codebase already have anything what it
>> takes).
>>
>> So you see this is actually a quite complex field :-)
>>
>>
>> Cheers,
>>
>> Martin (Kersten)
>>
>> PS: I would also like to use named entity references
>> when encoding a html-string (currently the Spring version
>> does only use number formats, which arn't that handy
>> if you need to read the encoded html).
>>
>>>
>>> Derrick
>>>
>>> Martin Kersten wrote:
>>>
>>>> Dear Html-Parser developers,
>>>>
>>>>   my name is Martin Kersten and I am looking for a library doing HTML 
>>>> related conversions. Originally I started to refactor the HtmlUtils
>>>> class of the Spring framework. But thinking about it, it would be best
>>>> if such a capability (decode/encode strings from / to Html) would be
>>>> provided by a special and tiny library. Such a library would be a 
>>>> relief, I guess. Also I wouldn't like to invent the wheel, twice... .
>>>>
>>>> Also by my own refactoring affords, I ended with a special class
>>>> encapuslating the named entity references and load it from file
>>>> (I used the http://www.w3.org/TR/REC-html40/sgml/entities.html files).
>>>> I don't know if this efford is worth it but from an OOP stand point it 
>>>> looks nice :-).
>>>>
>>>> Anyways, I would be happy if such a highly focused library would
>>>> be out there.
>>>> So what do you think? Any chance that such a library can be created?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Martin (Kersten)
>>>>
>>>> PS: I am not a Spring developer, I am just a Spring user who cares... . 
>>>> :-)
>>>>
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>