Re: [Htmlunit-user] Handling of entities in htmlunit?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

we now have
(3) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate f&#00fc;r das=20=

Team 33" />
(4) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate =
f&amp;#x00fc;r=20
das Team 33" />

Solution (3) is equivalent to solution (1), in that the webtest method=20=

only sees the single character '=FC', whereas in solution (4) the test=20=

method sees the six characters '&' '#' '0' '0' 'f' 'c' ';' and must=20
convert this charcter sequence to  the single '=FC'.

Isn't it interesting that, as Brad noted, his e-mail application=20
transformed (3) into (1)? It shows that keeping text valid across=20
platforms entails far more than 'a character-set declaration and a good=20=

editor': I've also seen text file corrupted when transferred over cvs=20
(i.e. check in from a pc, check out on a solaris account), copied on a=20=

usb stick, filtered with ant, ...

...

So the experts' opinion seems to be not to change htmlunit and 'play'=20
with the webtest's parameters.
I'll try this :-)

Thank you for your help!
	dna

On 5 avr. 05, at 09:23, Marc Guillemot wrote:

> Hi,
>
> I think that there are here 2 issues: entity resolution and character=20=

> encoding.
>
> Conversion from &uuml; to =FC is "only" entity resolution and I don't=20=

> think that the character encoding matters here (except that my email=20=

> is iso-8859-1 encoded).
> Solution (3) should probably have been:
> (3) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate =
f&amp;#x00fc;r=20
> das Team 33" />
> In this case when the text is interpreted as html, the =FC is =
specified=20
> with its unicode code but it doesn't mean that the character encoding=20=

> of the file is unicode.
>
> I agree with Brad that it should be the job of webtest to convert (2)=20=

> or (3) to (1). What the character encoding of ant files concern, it=20
> should be placed in the first line of the file like:
> <?xml version=3D"1.0" encoding=3D"ISO-8859-1"?>
> and serious editors should take care of it on all OS.
>
> The issue that could concern htmlunit would be: do we want to have the=20=

> possibility to know how the source of node was written or are we only=20=

> interested in it representation in the page's encoding? Personally I=20=

> don't see the interest to be able to determine if for instance a=20
> <b>f=FCr</b> was coded as <b>f=FCr</b>, <b>f&uuml;</b> or=20
> <b>f&#x00fc;</b>. The only case in which I could imagine that it may=20=

> have an interess is for someone that would like to ensure that ONLY=20
> ascci characters are used in pages (but why?). In this case I would=20
> rather perform the test on the whole page's code rather than on single=20=

> nodes.
>
> Marc.
>
> Brad Clarke wrote:
>> It seems like (1) would be the appropriate thing to do. While I=20
>> understand your
>> cross-platform concern I think the appropriate place to address that=20=

>> concern would
>> be as close to the problem as possible: create a way for webtest to=20=

>> read something
>> like (2) or (3) and turn it into (1) before passing it along to=20
>> htmlunit (I know
>> little of webtest or ant's encoding issues but being xml based I'm=20
>> actually quite
>> surprised you can't do this already).
>> Also, after briefly looking at the character set determination in=20
>> htmlunit it's
>> entirely possible that we're interpreting something incorrectly since=20=

>> I don't see
>> anything that will use a char set in the html file (which seems a=20
>> little backwards
>> anyway but it should be possible).
>> Brad C
>> PS Yahoo Mail converted your (3) into (1) when I replied to this=20
>> message! :D
>> --- "Denis N. Antonioli" <den...@ca...> wrote:
>>> what I really want is to write simple, robust tests that handle=20
>>> extended character sets (internationalization?) without excessive=20
>>> magic.
>>>
>>> In the present situation, I have an html page that comes from the=20
>>> server with an encoding and gets translated by htmlunit into a=20
>>> different encoding. As far as I know, that last encoding can be=20
>>> determined by the web server, by the content of the html page or by=20=

>>> some default.
>>> On the other side, I write tests in xml documents which are again=20
>>> translated by an xml parser (ant) into a different encoding.
>>>
>>> The first problem I have is: how to code in the xml document=20
>>> non-ascii characters (such as an uuml or a nbsp)?
>>>
>>> (1) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate f=FCr das =
Team=20
>>> 33" />
>>> (2) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate =
f&amp;uuml;r=20
>>> das Team 33" />
>>> (3) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate f=FCr das =
Team=20
>>> 33" />
>>>
>>>
>>> Then comes: And how to do it in a way that makes the xml document=20
>>> robust in a multi-platform environment? I have seen enough files=20
>>> with =FC or =E9 written on a PC becoming unreadable once moved to, =
e.g.,=20
>>> a Mac or a Linux computer. I'd rather not use solution (1).
>>>
>>> So, in solution (2), there is a comparison of the ascii characters=20=

>>> '&uuml;' , which will most probably be the same in all situations.
>>> In solution (3), am I sure that htmlunit will always translate to=20
>>> unicode?
>>>
>>>
>>> Best
>>> 	dna
>>>
>>>
>>> On 1 avr. 05, at 21:09, Brad Clarke wrote:
>>>
>>>
>>>> I don't really understand why you'd want this. Once the document=20
>>>> has already been
>>>> parsed why does it matter if it was an enitity or not? What exactly=20=

>>>> are you testing
>>>> that you need to know an enitity was used?
>>>>
>>>> The test failures are a known bug in a supporting library that will=20=

>>>> be fixed when
>>>> that library is released again.
>>>>
>>>> Brad C
>>>>
>>>> --- "Denis N. Antonioli" <den...@ca...> wrote:
>>>>
>>>>
>>>>> Hi Marc
>>>>>
>>>>> I had to find the time to look into it before posting again...
>>>>>
>>>>> On a practical level: I modified locally the head htmlunit from=20
>>>>> cvs to
>>>>> behave as proposed, and the changes amount to less than 10=20
>>>>> statements
>>>>> in the productive code, including a pair of setter/getter. The=20
>>>>> included
>>>>> unit tests, at least, show no impacts on the rest of the=20
>>>>> functionality.
>>>>>
>>>>> I see your point, most (all?) common browsers give us a only view=20=

>>>>> on
>>>>> the original document. I do think though that the functionality=20
>>>>> belongs
>>>>> to htmlunit for three main reasons.
>>>>>
>>>>> First, I find it easier to let the entitites untouched, rather =
than
>>>>> converting them to characters in htmlunit and converting the=20
>>>>> characters
>>>>> back to entities in webtest.
>>>>>
>>>>> Second, the two conversions won't be equivalent to the original
>>>>> document. Should the original document contains a mix of entities=20=

>>>>> and
>>>>> characters, the final document will contain only characters (as=20
>>>>> now) or
>>>>> only entities.
>>>>>
>>>>> Third, the need for unconverted entities may arise by other user =
of
>>>>> htmlunit. The necessary conversion code would then be replicated.
>>>>>
>>>>>
>>>>> By the way, I could not test htmlunit with maven 1.0.2, maven
>>>>> complained about attempting to execute scripts that had been=20
>>>>> garbage
>>>>> collected. Is this known?
>>>>>
>>>>>
>>>>> Best
>>>>> 	dna
>>>>>
>>>>> On 25 mars 05, at 13:53, Marc Guillemot wrote:
>>>>>
>>>>>
>>>>>> Hi Denis,
>>>>>>
>>>>>> I now think that htmlunit should resolve the entities as it does=20=

>>>>>> and
>>>>>> that it would be wrong to have the entity code "as it". My=20
>>>>>> motivation
>>>>>> comes from the comparison with browsers: except view source,=20
>>>>>> which is
>>>>>> comparable to WebResponse.getContentAsString(), the different=20
>>>>>> methods
>>>>>> to access the source show the resolved entity: in js innerHTML or
>>>>>> innerText (for IE) and View selection source (for Mozilla).=20
>>>>>> Therefore
>>>>>> I think that htmlunit behaves like browsers, what is correct, and=20=

>>>>>> that
>>>>>> we should handle it only in webtest.
>>>>>>
>>>>>> Marc.
>>>>>>
>>>>>> Denis N. Antonioli wrote:
>>>>>>
>>>>>>> Hi
>>>>>>> I'm using htmlunit through webtest (<http://webtest.canoo.com>,=20=

>>>>>>> for
>>>>>>> those that don't know it).
>>>>>>> In the present case, Webtest lets htmlunit generate a dom of an=20=

>>>>>>> html
>>>>>>> page before querying the document with xpath.
>>>>>>> For example <verifyxpath xpath=3D"/html/body/h2" =
text=3D"Resultate"/>
>>>>>>> makes sure that a h2 header displays the text 'Resultate'.
>>>>>>> I have the problem that, at some time, the text I want webtest =
to
>>>>>>> verify is using html entities:
>>>>>>> <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate =
f&amp;uuml;r=20
>>>>>>> das
>>>>>>> Team 33"/>
>>>>>>> With the help of Marc, I've found that htmlunit is always=20
>>>>>>> generating
>>>>>>> a dom where all entities have been resolved.
>>>>>>> nekohtml seems to provide a feature
>>>>>>> (http://apache.org/xml/features/scanner/notify-builtin-refs) to=20=

>>>>>>> tell
>>>>>>> when/where the
>>>>>>> source contains entities (see description at
>>>>>>> =
<http://cvs.apache.org/~andyc/neko/doc/html/settings.html#notify-
>>>>>>> builtin-html-refs>).
>>>>>>> Does someone know if it is possible to get a dom tree in which=20=

>>>>>>> the
>>>>>>> text nodes contain entitites instead of characters?
>>>>>>> Does it make sense?
>>>>>>> Was it already tried?
>>>>>>> Would it be difficult?
>>>>>>>    dna
>> -------------------------------------------------------
>> SF email is sponsored by - The IT Product Guide
>> Read honest & candid reviews on hundreds of IT Products from real=20
>> users.
>> Discover which products truly live up to the hype. Start reading now.
>> http://ads.osdn.com/?ad_id=3D6595&alloc_id=3D14396&op=3Dclick
>> _______________________________________________
>> Htmlunit-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlunit-user
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real=20
> users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=3D6595&alloc_id=3D14396&op=3Dclick
> _______________________________________________
> Htmlunit-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlunit-user
>
--=20
in a flash of inspiration she had discovered the comic possibilities of=20=

the semi-colon, and of this she had made abundant and exquisite use.
   -- W. Somerset Maugham, The creative impulse

Re: [Htmlunit-user] Handling of entities in htmlunit?

Java GUI-Less browser, supporting JavaScript, to run against web pages

Re: [Htmlunit-user] Handling of entities in htmlunit?