From: Steve B. <Ste...@zv...> - 2003-02-17 20:24:20
|
Moisei RABINOVITCH wrote: > I would say that after 4 years experience of working with TCL and DOM XML I > fill completely lost between different versions of TCL,XML,DOM etc. > I am reading the posts to this list regular and I understand that I am not > alone with this feeling... Any suggestions for improved documentation on the website are welcome. I try and make conference papers, etc, available to help fill-in people's knowledge. > OK, here is my question and may be somebody could show me the light at end > of this tunnel: > > my environment is Windows 2000 > I am ready to use any version of TCL 8.* (I would prefer 8.3) > I am ready to use any version of XML/DOM parser (I would prefer compiled > version rather than pure TLC because of performance issue) > > what I need is to parse Unicode XML file attached below. > The file is Unicode - UTF-16 file that contains Hebrew letters. Roy Nurmi has sent me (private) email over the last couple of weeks to address the issue of handling documents with different character encodings. The problems stem from the fact that Tcl (very nicely) handles the encodings when the document is read, but then libxml2 also attempts to transcode the characters. This can be further confused if the XML document itself contains an 'encoding' attribute in the XML declaration, ie. <?xml version='1.0' encoding='utf-16'?> We have been tossing around some ideas to address the problem, and Roy has his own workarounds. Unfortunately I have been rather busy over the last couple of weeks and so have not been able to give this matter my full attention. > Please, if somebody (Bill, Andreas?) could help me to find pre-compiled > properly version of needed software, I would very appreciate this. Try TclDevKit from ActiveState. I just downloaded v2.0 beta2 yesterday and it has the latest CVS HEAD changes. That is, although it has a TclDOM/libxml2 package that identifies itself as v2.5, it actually is more advanced than that and is closer to v2.6. [BTW, I'm planning on changing my policy on management of the CVS repository to provide more stability.] > 1. I am able to view the file in IE (6) with all Hebrew letters > > 2. I am able to read/write this file through TCL and file is not changed: > ----------- > set f {c:/1.xml} > set fh [open $f] > fconfigure $fh -encoding unicode -translation auto > set xml [read $fh] > close $fh > > set f {c:/2.xml} > set fh [open $f "w"] > fconfigure $fh -encoding unicode -translation auto > puts -nonewline $fh $xml > close $fh > ----------- > The file 2.xml if completely same as 1.xml That's what I would expect, since Tcl is very good at interpreting different character encodings. > 3. I am unable to parse this file in any TCL XML/DOM parser known to me: > > TCL 8.3.5 from ActiveState: <snip/> Well, I'm not surprised. As I said to Roy I have pretty much punted on character encodings up until now. We need to check the libxml2 API to make sure that the data is passed in a correct encoding (utf-8) and that the library knows which encoding is being used. Cheers, Steve Ball -- Steve Ball | XSLT Standard Library | Training & Seminars Zveno Pty Ltd | Web Tcl Complete | XML XSL Schemas http://www.zveno.com/ | TclXML TclDOM | Tcl, Web Development Ste...@zv... +---------------------------+--------------------- Ph. +61 2 6242 4099 | Mobile (0413) 594 462 | Fax +61 2 6242 4099 |