From: Nuno L. <lu...@nl...> - 2004-09-20 23:01:04
Attachments:
BOM-fix.patch
|
This is a quick fix to avoid stupid errors when stupid editors decide to add a Byte Order Marker to the config file. Our weak XML library doesn't like that. I confess I didn't tested it, but it is trivial enough. Regards, ~Nuno Lucas |
From: Martin K. <ka...@po...> - 2004-09-21 15:47:53
|
Hi, in one package i used to parse XML was BOM defined differently for UTF-16 and UTF-8. for UTF8: \xEF \xBB \xBF Martin Nuno Lucas wrote: > This is a quick fix to avoid stupid errors when stupid editors decide to > add a Byte Order Marker to the config file. Our weak XML library doesn't > like that. .... > > + /* Check presence of a BOM marker. > + * Our XML library doesn't like Byte Order Markers */ > + if ( (text[0] == '\xFF' && text[1] == '\xFE') > + || (text[0] == '\xFE' && text[1] == '\xFF') ) > + text += 2; // skip it > + |
From: Nuno L. <lu...@nl...> - 2004-09-21 16:14:23
|
Martin Kanich, dando pulos de alegria, escreveu : > Hi, > > in one package i used to parse XML was BOM defined differently for > UTF-16 and UTF-8. for UTF8: \xEF \xBB \xBF mm, that's the result of patching without thinking too much on it. Disregard this patch, I will read the unicode spec before posting a new one, this time. Thanks for the heads up. Regards, ~Nuno Lucas |
From: Sean B. <sea...@so...> - 2004-09-21 18:20:58
|
Hi, I think the patch is correct. See here: http://www.unicode.org/faq/utf_bom.html#22 Also, take a look here: http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf scroll down to the part that reads: It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a signature to mark the beginning of a UTF-8 file. This practice should definitely not be used on POSIX systems for several reasons etc HIH Nuno Lucas wrote: > Martin Kanich, dando pulos de alegria, escreveu : > >> Hi, >> >> in one package i used to parse XML was BOM defined differently for >> UTF-16 and UTF-8. for UTF8: \xEF \xBB \xBF > > > mm, that's the result of patching without thinking too much on it. > Disregard this patch, I will read the unicode spec before posting a new > one, this time. > > Thanks for the heads up. > > Regards, > ~Nuno Lucas > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > Project Admins to receive an Apple iPod Mini FREE for your judgement on > who ports your project to Linux PPC the best. Sponsored by IBM. > Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php > _______________________________________________ > coLinux-devel mailing list > coL...@li... > https://lists.sourceforge.net/lists/listinfo/colinux-devel > |
From: Nuno L. <lu...@nl...> - 2004-09-21 18:39:18
|
Sean Brook, dando pulos de alegria, escreveu : > Hi, > > I think the patch is correct. See here: > http://www.unicode.org/faq/utf_bom.html#22 Unfortunelly no, as I explain below. > Also, take a look here: > http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf > > scroll down to the part that reads: > It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) > as a signature to mark the beginning of a UTF-8 file. This practice > should definitely not be used on POSIX systems for several reasons etc The patch is intended to fix those "brain dead" editors that add a BOM to an UTF-8 encoded file, not to encode one ourselfs. The XML library we use is a very simple one, and chokes on this. We don't support UCS2, UTF-16 or UTF-32 encoded XML files, so no need to check for other BOMs, as it will probably fail anyway. Regards, ~Nuno Lucas |
From: Martin K. <ka...@po...> - 2004-09-22 06:24:55
|
> > Also, take a look here: > > http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf > > > > scroll down to the part that reads: > > It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) > > as a signature to mark the beginning of a UT F-8file.Thispractice > > should definitely not be used on POSIX systems for several reasons etc Well, you have to read it on the Windows side, too :-) I mean, if you have the CoLinux for Windows. And it's as already mentioned a problem of reader to get the content. > The patch is intended to fix those "brain dead" editors that add a BOM > to an UTF-8 encoded file, not to encode one ourselfs. > The XML library we use is a very simple one, and chokes on this. Brain dead editors just ignore the fact, that the xml document written on Big-Endiand system has another sense as on Low-Endian. So, if your Notepad on WinXP just ignore this fact, you're using the M$ Ignorant Editor (note, that notepad can now UTF8). Regards, Martin |
From: Nuno L. <lu...@nl...> - 2004-09-22 17:55:13
|
Martin Kanich, dando pulos de alegria, escreveu : >> The patch is intended to fix those "brain dead" editors that add a BOM >> to an UTF-8 encoded file, not to encode one ourselfs. >> The XML library we use is a very simple one, and chokes on this. > > Brain dead editors just ignore the fact, that the xml document written > on Big-Endiand system has another sense as on Low-Endian. So, if your > Notepad on WinXP just ignore this fact, you're using the M$ Ignorant > Editor (note, that notepad can now UTF8). I'm not sure I understood your mail, so just wanted to add that an UTF-8 encoded file is always the same, either in big- or little-endian systems. That's why it's idiot to add a BOM to an already platform independent file. The BOM is, off course, the same: "\xEF\xBB\xBF". Regards, ~Nuno Lucas |
From: Sam L. <sa...@li...> - 2004-09-23 00:24:20
|
Nuno Lucas wrote: > Martin Kanich, dando pulos de alegria, escreveu : > >>> The patch is intended to fix those "brain dead" editors that add a BOM >>> to an UTF-8 encoded file, not to encode one ourselfs. >>> The XML library we use is a very simple one, and chokes on this. >> >> >> Brain dead editors just ignore the fact, that the xml document written >> on Big-Endiand system has another sense as on Low-Endian. So, if your >> Notepad on WinXP just ignore this fact, you're using the M$ Ignorant >> Editor (note, that notepad can now UTF8). > > > I'm not sure I understood your mail, so just wanted to add that an UTF-8 > encoded file is always the same, either in big- or little-endian > systems. > > That's why it's idiot to add a BOM to an already platform independent > file. The BOM is, off course, the same: "\xEF\xBB\xBF". Notepad doesn't add BOM if you save the file as ANSI text. If you save as UTF text it does add the BOM merely as an encoding indicator so that applications know how to decode some of the upper characters - as UTF8 instead of according to some default code page. Sam |
From: Doctor B. <do...@gm...> - 2004-09-23 13:34:17
|
The purpose of the BOM is not to let the reader know the byte ordering, but to let the reader know it is UTF8. The XML 1.0 spec tells us the encoding for UTF8 without a BOM will be recognized as follows: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably The net effect is if you do not have the BOM, then if you want to test if you can decode the XML file, you have to actually start parsing and processing the encoding parameter, rather than just checking the first few bytes. Actually processing a encoding parameter itself is a pain, since there is no universal standard indicating how the contents of the string is converted into a locale name. This varies based on platform or even sometimes revision of platform. Does that encoding map to a locale of "English", "en_US", or "en"? Consequently correctly implementing a parser that supports the encoding attribute often requires installing an extra library. While correctly parsing requires little more than a switch block. Bill On Thu, 23 Sep 2004 01:24:32 +0100, Sam Liddicott <sa...@li...> wrote: > Nuno Lucas wrote: > > > Martin Kanich, dando pulos de alegria, escreveu : > > > >>> The patch is intended to fix those "brain dead" editors that add a BOM > >>> to an UTF-8 encoded file, not to encode one ourselfs. > >>> The XML library we use is a very simple one, and chokes on this. > >> > >> > >> Brain dead editors just ignore the fact, that the xml document written > >> on Big-Endiand system has another sense as on Low-Endian. So, if your > >> Notepad on WinXP just ignore this fact, you're using the M$ Ignorant > >> Editor (note, that notepad can now UTF8). > > > > > > I'm not sure I understood your mail, so just wanted to add that an UTF-8 > > encoded file is always the same, either in big- or little-endian > > systems. > > > > That's why it's idiot to add a BOM to an already platform independent > > file. The BOM is, off course, the same: "\xEF\xBB\xBF". > > Notepad doesn't add BOM if you save the file as ANSI text. > > If you save as UTF text it does add the BOM merely as an encoding > indicator so that applications know how to decode some of the upper > characters - as UTF8 instead of according to some default code page. > > Sam > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > Project Admins to receive an Apple iPod Mini FREE for your judgement on > who ports your project to Linux PPC the best. Sponsored by IBM. > Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php > _______________________________________________ > coLinux-devel mailing list > coL...@li... > https://lists.sourceforge.net/lists/listinfo/colinux-devel > |
From: Nuno L. <lu...@nl...> - 2004-09-24 12:54:12
|
I don't think people understand the problem here. All we want is to avoid to have people complain colinux isn't booting because of a mysterious error (can't load file). To achieve this goal we have (at least) the following options: 1- Follow the XML specification: 1.a) Use another XML library that would give us a standard compliant implementation. 1.b) Patch the XML library ourselves. 2- Don't follow the XML specification: 2.a) Just handle the few issues users face with small correction code. 2.b) Stop using the XML library and use other formats (like basic .rc files or even .ini files). IMHO, I would never go for 1.a), because I don't see why we should need a 1MB library (I'm exaggerating, off course) just to read config files. 1.b) would be just dumb, as we would never make it to full compliance without very hard work and probably resulting in 1.a). 2.a) was the intention of my patch: just make sure we don't get users complaining by ignoring the UTF-8 BOM. If users decide to use UCS-2, UTF-16 or UTF-32, that's their problem as they should know better. 2.b) would be the middle-term solution, but I don't see any reason to modify the code just for this. If anyone feels it's a must, they are always free to send a patch. Regards, ~Nuno Lucas |
From: Dan A. <da...@co...> - 2004-10-01 06:44:03
|
On Fri, Sep 24, 2004 at 01:54:00PM +0100, Nuno Lucas wrote: > I don't think people understand the problem here. > All we want is to avoid to have people complain colinux isn't > booting because of a mysterious error (can't load file). > To achieve this goal we have (at least) the following options: > > 1- Follow the XML specification: > > 1.a) Use another XML library that would give us a standard compliant > implementation. > 1.b) Patch the XML library ourselves. How mailing the library's maintainer? Hopefully the next version would be still light but with less bugs in it. -- Dan Aloni da...@co... |
From: Henry N. <Henry.Ne@Arcor.de> - 2004-10-01 17:24:14
|
Dan Aloni wrote: > On Fri, Sep 24, 2004 at 01:54:00PM +0100, Nuno Lucas wrote: > >>I don't think people understand the problem here. >>All we want is to avoid to have people complain colinux isn't >>booting because of a mysterious error (can't load file). >>To achieve this goal we have (at least) the following options: >> >>1- Follow the XML specification: >> >>1.a) Use another XML library that would give us a standard compliant >> implementation. >>1.b) Patch the XML library ourselves. > > > How mailing the library's maintainer? Hopefully the next version > would be still light but with less bugs in it. > Why we not use standard XML, such as static library? Think, this will be work better as this small version of mxml. http://mxml.sourceforge.net/ says it's only a "DOM oriented library". See also a good comparsion about BOM and DOM http://www.oracle.com/technology/pub/notes/technote_encodings.html Actual we use UTF-8 without BOM-Mark. If we use a BOM-Mark, we need also a reader for 2-byte per char. See "What are some of the differences between the UTFs" in http://www.unicode.org/unicode/faq/utf_bom.html I have not a XML-Edidor, but I found, that we can force the type of encoding in header: <?xml version='1.0' encoding='UTF-8' ?> This should all editors give the rigth format for us. UTF-8 is EF BB BF Can anybody send me a file with UTF-16, created with a XML-Editor? (Please zip file before send, so I can see all bytes!) -- Henry Nestler XML only for german readers: http://www.sql-und-xml.de/xml-lernen/internationalisierung-unicode-sonderzeichen.html |
From: Nuno L. <lu...@nl...> - 2004-10-02 08:49:04
|
Dan Aloni, dando pulos de alegria, escreveu : > How mailing the library's maintainer? Hopefully the next version > would be still light but with less bugs in it. My idea would be to just drop xml files in favor of simple .ini/.cfg files. The code needed to implement it is trivial enough (I have a generic .ini file reader code somewhere, don't mind to provide it, only problem is in C++). Even if we use a full XML compliant library, we would need to check the encoding of the Host OS file names, and do appropriate conversions. Different windows versions use different encodings, depending on the country. So, if we use Notepad to edit our config file, we are using the local encoding of the OS (CP-1250 and local variants, in most places). On Linux is the same, as we can have a full UTF-8 system (like RedHat, if I remember right) and a file open operation will expect a UTF-8 path. So, in the end, we need to have a XML library that will return us a path encoded in the local OS "code page", or use something like Unicode, which is really many standards (choose from UTF-8, UTF-16, UTF-32, UCS-2, ...). But there are not many editors that do the conversions right. If we use simple .CFG files (I prefer this suffix in favor of .INI), we can just specify they need to be pure ASCII (7-bit) or encoded with the local OS (by using Notepad, for example). Now we can open a file without problems, because if the user edited it with Notepad (or vi), it will be encoded in the same way as the OS file system. The difference here is only that we can't exchange .cfg files safely between systems anymore (as we could with XML), but that doesn't matter nothing to us. Another disadvantage is the XML processing thing, as it would be easier to implement a config file editor when using a XML config file. But that is more for complex config files (like Makefiles), and doesn't buy us nothing in the simple case. Now we have one less dependency and "we are in control" ;) I really think that for simple tasks like this, XML files are overkill, and end giving more headaches than justify their use (no more end tag and BOM problems). If everyone agrees on this, I don't mind implementig it. If C++ is undesirable for the job, I'll make it in C. An example "config.cfg" file: -------------------------------------------------------------- # NOTES: # # I'm starting from cobd1, because I also think this could be # a good place to start reserving cobd0 for other things (like the # fake partition table we could implement later). # # No need for a "\DosDevice\" prefix, as our loader is a bit smarter. # # Make file references relative to the config file location (so we can # have a generic config file and simply copy it to another directory # with other images with same name) # [Kernel] bootparams = root=/dev/cobd0 ro initrd = initrd.gz vmlinux = vmlinux memory = 64 [Block Devices] cobd1 = Gentoo-2gb.ext2 # To disable a device, just comment it ;) #cobd2 = ../debian/Debian-1gb.ext2 cobd3 = \Device\Cdrom0 #cobd3 = /dev/cdrom # Linux equivalent cobd4 = \Device\HarddiskVolume7 #cobd4 = /dev/hda7 # Linux equivalent cobd5 = swap_768MB.sw [Network] eth0 = CoLinux TAP #eth1 = WinPCAP [CoLinux TAP] name = Local Network 2 type = tap mac = 00:FF:11:22:33:44 enabled = true [WinPCAP] name = Local Network 3 type = bridge mac = 00:FF:11:22:33:44 enabled = false -------------------------------------------------------------- This is just a sample sintax. This is much more readable and easier to edit than any XML file. Waiting for your comments... ~Nuno Lucas |
From: Dan A. <da...@co...> - 2004-10-02 09:00:23
|
On Sat, Oct 02, 2004 at 09:49:13AM +0100, Nuno Lucas wrote: > Dan Aloni, dando pulos de alegria, escreveu : > >How mailing the library's maintainer? Hopefully the next version > >would be still light but with less bugs in it. > > My idea would be to just drop xml files in favor of simple .ini/.cfg > files. The code needed to implement it is trivial enough (I have a > generic .ini file reader code somewhere, don't mind to provide it, only > problem is in C++). > >[..] > > This is just a sample sintax. This is much more readable and easier to > edit than any XML file. Well, if we are going to retreat from the XML configuration solution, we should rather come up with something compatible to User Mode Linux, i.e, use no configuration file and pass all configuration using the command line. The reason is that User Mode Linux is the closest kind of a virtual Linux implementation to coLinux, and it already has a large established user base. Even a simple command line and interface compatibility can attract a large amounts of User Mode Linux users and developers. -- Dan Aloni da...@co... |
From: Henry N. <Henry.Ne@Arcor.de> - 2004-10-03 14:33:05
|
Dan Aloni wrote: > On Sat, Oct 02, 2004 at 09:49:13AM +0100, Nuno Lucas wrote: > >>Dan Aloni, dando pulos de alegria, escreveu : >> >>>How mailing the library's maintainer? Hopefully the next version >>>would be still light but with less bugs in it. >> >>My idea would be to just drop xml files in favor of simple .ini/.cfg >>files. The code needed to implement it is trivial enough (I have a >>generic .ini file reader code somewhere, don't mind to provide it, only >>problem is in C++). >> >>[..] >> >>This is just a sample sintax. This is much more readable and easier to >>edit than any XML file. > > > Well, if we are going to retreat from the XML configuration solution, we > should rather come up with something compatible to User Mode Linux, i.e, > use no configuration file and pass all configuration using the command > line. > > The reason is that User Mode Linux is the closest kind of a virtual Linux > implementation to coLinux, and it already has a large established user > base. Even a simple command line and interface compatibility can attract > a large amounts of User Mode Linux users and developers. > I think, INI or CFG files is better to reading and edit for all, Linux, Windows and simple for parsing. In my case, if I click on XML, my windows views only a empty file :-(, I am must alway use "open with ..." Command line options we should always use to overwrite a config file option. INI file is well welcome. -- Henry Nestler |