From: Denis N. A. <den...@ca...> - 2005-04-05 21:04:14
|
we now have (3) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate f�fc;r das=20= Team 33" /> (4) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate = f&#x00fc;r=20 das Team 33" /> Solution (3) is equivalent to solution (1), in that the webtest method=20= only sees the single character '=FC', whereas in solution (4) the test=20= method sees the six characters '&' '#' '0' '0' 'f' 'c' ';' and must=20 convert this charcter sequence to the single '=FC'. Isn't it interesting that, as Brad noted, his e-mail application=20 transformed (3) into (1)? It shows that keeping text valid across=20 platforms entails far more than 'a character-set declaration and a good=20= editor': I've also seen text file corrupted when transferred over cvs=20 (i.e. check in from a pc, check out on a solaris account), copied on a=20= usb stick, filtered with ant, ... ... So the experts' opinion seems to be not to change htmlunit and 'play'=20 with the webtest's parameters. I'll try this :-) Thank you for your help! dna On 5 avr. 05, at 09:23, Marc Guillemot wrote: > Hi, > > I think that there are here 2 issues: entity resolution and character=20= > encoding. > > Conversion from ü to =FC is "only" entity resolution and I don't=20= > think that the character encoding matters here (except that my email=20= > is iso-8859-1 encoded). > Solution (3) should probably have been: > (3) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate = f&#x00fc;r=20 > das Team 33" /> > In this case when the text is interpreted as html, the =FC is = specified=20 > with its unicode code but it doesn't mean that the character encoding=20= > of the file is unicode. > > I agree with Brad that it should be the job of webtest to convert (2)=20= > or (3) to (1). What the character encoding of ant files concern, it=20 > should be placed in the first line of the file like: > <?xml version=3D"1.0" encoding=3D"ISO-8859-1"?> > and serious editors should take care of it on all OS. > > The issue that could concern htmlunit would be: do we want to have the=20= > possibility to know how the source of node was written or are we only=20= > interested in it representation in the page's encoding? Personally I=20= > don't see the interest to be able to determine if for instance a=20 > <b>f=FCr</b> was coded as <b>f=FCr</b>, <b>fü</b> or=20 > <b>fü</b>. The only case in which I could imagine that it may=20= > have an interess is for someone that would like to ensure that ONLY=20 > ascci characters are used in pages (but why?). In this case I would=20 > rather perform the test on the whole page's code rather than on single=20= > nodes. > > Marc. > > Brad Clarke wrote: >> It seems like (1) would be the appropriate thing to do. While I=20 >> understand your >> cross-platform concern I think the appropriate place to address that=20= >> concern would >> be as close to the problem as possible: create a way for webtest to=20= >> read something >> like (2) or (3) and turn it into (1) before passing it along to=20 >> htmlunit (I know >> little of webtest or ant's encoding issues but being xml based I'm=20 >> actually quite >> surprised you can't do this already). >> Also, after briefly looking at the character set determination in=20 >> htmlunit it's >> entirely possible that we're interpreting something incorrectly since=20= >> I don't see >> anything that will use a char set in the html file (which seems a=20 >> little backwards >> anyway but it should be possible). >> Brad C >> PS Yahoo Mail converted your (3) into (1) when I replied to this=20 >> message! :D >> --- "Denis N. Antonioli" <den...@ca...> wrote: >>> what I really want is to write simple, robust tests that handle=20 >>> extended character sets (internationalization?) without excessive=20 >>> magic. >>> >>> In the present situation, I have an html page that comes from the=20 >>> server with an encoding and gets translated by htmlunit into a=20 >>> different encoding. As far as I know, that last encoding can be=20 >>> determined by the web server, by the content of the html page or by=20= >>> some default. >>> On the other side, I write tests in xml documents which are again=20 >>> translated by an xml parser (ant) into a different encoding. >>> >>> The first problem I have is: how to code in the xml document=20 >>> non-ascii characters (such as an uuml or a nbsp)? >>> >>> (1) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate f=FCr das = Team=20 >>> 33" /> >>> (2) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate = f&uuml;r=20 >>> das Team 33" /> >>> (3) <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate f=FCr das = Team=20 >>> 33" /> >>> >>> >>> Then comes: And how to do it in a way that makes the xml document=20 >>> robust in a multi-platform environment? I have seen enough files=20 >>> with =FC or =E9 written on a PC becoming unreadable once moved to, = e.g.,=20 >>> a Mac or a Linux computer. I'd rather not use solution (1). >>> >>> So, in solution (2), there is a comparison of the ascii characters=20= >>> 'ü' , which will most probably be the same in all situations. >>> In solution (3), am I sure that htmlunit will always translate to=20 >>> unicode? >>> >>> >>> Best >>> dna >>> >>> >>> On 1 avr. 05, at 21:09, Brad Clarke wrote: >>> >>> >>>> I don't really understand why you'd want this. Once the document=20 >>>> has already been >>>> parsed why does it matter if it was an enitity or not? What exactly=20= >>>> are you testing >>>> that you need to know an enitity was used? >>>> >>>> The test failures are a known bug in a supporting library that will=20= >>>> be fixed when >>>> that library is released again. >>>> >>>> Brad C >>>> >>>> --- "Denis N. Antonioli" <den...@ca...> wrote: >>>> >>>> >>>>> Hi Marc >>>>> >>>>> I had to find the time to look into it before posting again... >>>>> >>>>> On a practical level: I modified locally the head htmlunit from=20 >>>>> cvs to >>>>> behave as proposed, and the changes amount to less than 10=20 >>>>> statements >>>>> in the productive code, including a pair of setter/getter. The=20 >>>>> included >>>>> unit tests, at least, show no impacts on the rest of the=20 >>>>> functionality. >>>>> >>>>> I see your point, most (all?) common browsers give us a only view=20= >>>>> on >>>>> the original document. I do think though that the functionality=20 >>>>> belongs >>>>> to htmlunit for three main reasons. >>>>> >>>>> First, I find it easier to let the entitites untouched, rather = than >>>>> converting them to characters in htmlunit and converting the=20 >>>>> characters >>>>> back to entities in webtest. >>>>> >>>>> Second, the two conversions won't be equivalent to the original >>>>> document. Should the original document contains a mix of entities=20= >>>>> and >>>>> characters, the final document will contain only characters (as=20 >>>>> now) or >>>>> only entities. >>>>> >>>>> Third, the need for unconverted entities may arise by other user = of >>>>> htmlunit. The necessary conversion code would then be replicated. >>>>> >>>>> >>>>> By the way, I could not test htmlunit with maven 1.0.2, maven >>>>> complained about attempting to execute scripts that had been=20 >>>>> garbage >>>>> collected. Is this known? >>>>> >>>>> >>>>> Best >>>>> dna >>>>> >>>>> On 25 mars 05, at 13:53, Marc Guillemot wrote: >>>>> >>>>> >>>>>> Hi Denis, >>>>>> >>>>>> I now think that htmlunit should resolve the entities as it does=20= >>>>>> and >>>>>> that it would be wrong to have the entity code "as it". My=20 >>>>>> motivation >>>>>> comes from the comparison with browsers: except view source,=20 >>>>>> which is >>>>>> comparable to WebResponse.getContentAsString(), the different=20 >>>>>> methods >>>>>> to access the source show the resolved entity: in js innerHTML or >>>>>> innerText (for IE) and View selection source (for Mozilla).=20 >>>>>> Therefore >>>>>> I think that htmlunit behaves like browsers, what is correct, and=20= >>>>>> that >>>>>> we should handle it only in webtest. >>>>>> >>>>>> Marc. >>>>>> >>>>>> Denis N. Antonioli wrote: >>>>>> >>>>>>> Hi >>>>>>> I'm using htmlunit through webtest (<http://webtest.canoo.com>,=20= >>>>>>> for >>>>>>> those that don't know it). >>>>>>> In the present case, Webtest lets htmlunit generate a dom of an=20= >>>>>>> html >>>>>>> page before querying the document with xpath. >>>>>>> For example <verifyxpath xpath=3D"/html/body/h2" = text=3D"Resultate"/> >>>>>>> makes sure that a h2 header displays the text 'Resultate'. >>>>>>> I have the problem that, at some time, the text I want webtest = to >>>>>>> verify is using html entities: >>>>>>> <verifyxpath xpath=3D"/html/body/h2" text=3D"Resultate = f&uuml;r=20 >>>>>>> das >>>>>>> Team 33"/> >>>>>>> With the help of Marc, I've found that htmlunit is always=20 >>>>>>> generating >>>>>>> a dom where all entities have been resolved. >>>>>>> nekohtml seems to provide a feature >>>>>>> (http://apache.org/xml/features/scanner/notify-builtin-refs) to=20= >>>>>>> tell >>>>>>> when/where the >>>>>>> source contains entities (see description at >>>>>>> = <http://cvs.apache.org/~andyc/neko/doc/html/settings.html#notify- >>>>>>> builtin-html-refs>). >>>>>>> Does someone know if it is possible to get a dom tree in which=20= >>>>>>> the >>>>>>> text nodes contain entitites instead of characters? >>>>>>> Does it make sense? >>>>>>> Was it already tried? >>>>>>> Would it be difficult? >>>>>>> dna >> ------------------------------------------------------- >> SF email is sponsored by - The IT Product Guide >> Read honest & candid reviews on hundreds of IT Products from real=20 >> users. >> Discover which products truly live up to the hype. Start reading now. >> http://ads.osdn.com/?ad_id=3D6595&alloc_id=3D14396&op=3Dclick >> _______________________________________________ >> Htmlunit-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real=20 > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=3D6595&alloc_id=3D14396&op=3Dclick > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > --=20 in a flash of inspiration she had discovered the comic possibilities of=20= the semi-colon, and of this she had made abundant and exquisite use. -- W. Somerset Maugham, The creative impulse |