Thread: [coLinux-devel] [PATCH] Fix "clever" text editors that decide to add a BOM to the config file | Cooperative Linux

colinux-devel

[coLinux-devel] [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Nuno L. <lu...@nl...> - 2004-09-20 23:01:04

Attachments: BOM-fix.patch

This is a quick fix to avoid stupid errors when stupid editors decide to
add a Byte Order Marker to the config file. Our weak XML library doesn't
like that.

I confess I didn't tested it, but it is trivial enough.

Regards,
~Nuno Lucas

[coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Martin K. <ka...@po...> - 2004-09-21 15:47:53

Hi,

in one package i used to parse XML was BOM defined differently for 
UTF-16 and UTF-8. for UTF8:  \xEF \xBB \xBF

Martin

Nuno Lucas wrote:
> This is a quick fix to avoid stupid errors when stupid editors decide to
> add a Byte Order Marker to the config file. Our weak XML library doesn't
> like that.
....
> 
> +	/* Check presence of a BOM marker.
> +	 * Our XML library doesn't like Byte Order Markers */
> +	if ( (text[0] == '\xFF' && text[1] == '\xFE')
> +			|| (text[0] == '\xFE' && text[1] == '\xFF') )
> +		text += 2;	// skip it
> +

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Nuno L. <lu...@nl...> - 2004-09-21 16:14:23

Martin Kanich, dando pulos de alegria, escreveu :
> Hi,
> 
> in one package i used to parse XML was BOM defined differently for 
> UTF-16 and UTF-8. for UTF8:  \xEF \xBB \xBF

mm, that's the result of patching without thinking too much on it.
Disregard this patch, I will read the unicode spec before posting a new 
one, this time.

Thanks for the heads up.

Regards,
~Nuno Lucas

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Sean B. <sea...@so...> - 2004-09-21 18:20:58

Hi,

I think the patch is correct. See here: 
http://www.unicode.org/faq/utf_bom.html#22

Also, take a look here:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf

scroll down to the part that reads:
It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) 
as a signature to mark the beginning of a UTF-8 file. This practice 
should definitely not be used on POSIX systems for several reasons etc

HIH


Nuno Lucas wrote:
> Martin Kanich, dando pulos de alegria, escreveu :
> 
>> Hi,
>>
>> in one package i used to parse XML was BOM defined differently for 
>> UTF-16 and UTF-8. for UTF8:  \xEF \xBB \xBF
> 
> 
> mm, that's the result of patching without thinking too much on it.
> Disregard this patch, I will read the unicode spec before posting a new 
> one, this time.
> 
> Thanks for the heads up.
> 
> Regards,
> ~Nuno Lucas
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> Project Admins to receive an Apple iPod Mini FREE for your judgement on
> who ports your project to Linux PPC the best. Sponsored by IBM.
> Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
> _______________________________________________
> coLinux-devel mailing list
> coL...@li...
> https://lists.sourceforge.net/lists/listinfo/colinux-devel
>

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Nuno L. <lu...@nl...> - 2004-09-21 18:39:18

Sean Brook, dando pulos de alegria, escreveu :
 > Hi,
 >
 > I think the patch is correct. See here:
 > http://www.unicode.org/faq/utf_bom.html#22

Unfortunelly no, as I explain below.

 > Also, take a look here:
 > http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
 >
 > scroll down to the part that reads:
 > It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF)
 > as a signature to mark the beginning of a UTF-8 file. This practice
 > should definitely not be used on POSIX systems for several reasons etc

The patch is intended to fix those "brain dead" editors that add a BOM
to an UTF-8 encoded file, not to encode one ourselfs.
The XML library we use is a very simple one, and chokes on this.

We don't support UCS2, UTF-16 or UTF-32 encoded XML files, so no need to
check for other BOMs, as it will probably fail anyway.

Regards,
~Nuno Lucas

[coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Martin K. <ka...@po...> - 2004-09-22 06:24:55

>  > Also, take a look here:
>  > http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
>  >
>  > scroll down to the part that reads:
>  > It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF)
>  > as a signature to mark the beginning of a UT   F-8file.Thispractice
>  > should definitely not be used on POSIX systems for several reasons etc

Well, you have to read it on the Windows side, too :-) I mean, if you 
have the CoLinux for Windows.
And it's as already mentioned a problem of reader to get the content.

> The patch is intended to fix those "brain dead" editors that add a BOM
> to an UTF-8 encoded file, not to encode one ourselfs.
> The XML library we use is a very simple one, and chokes on this.
Brain dead editors just ignore the fact, that the xml document written 
on Big-Endiand system has another sense as on Low-Endian. So, if your 
Notepad on WinXP just ignore this fact, you're using the M$ Ignorant 
Editor (note, that notepad can now UTF8).

Regards,
Martin

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Nuno L. <lu...@nl...> - 2004-09-22 17:55:13

Martin Kanich, dando pulos de alegria, escreveu :
>> The patch is intended to fix those "brain dead" editors that add a BOM
>> to an UTF-8 encoded file, not to encode one ourselfs.
>> The XML library we use is a very simple one, and chokes on this.
> 
> Brain dead editors just ignore the fact, that the xml document written 
> on Big-Endiand system has another sense as on Low-Endian. So, if your 
> Notepad on WinXP just ignore this fact, you're using the M$ Ignorant 
> Editor (note, that notepad can now UTF8).

I'm not sure I understood your mail, so just wanted to add that an UTF-8
encoded file is always the same, either in big- or little-endian
systems.

That's why it's idiot to add a BOM to an already platform independent
file. The BOM is, off course, the same: "\xEF\xBB\xBF".

Regards,
~Nuno Lucas

[coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Sam L. <sa...@li...> - 2004-09-23 00:24:20

Nuno Lucas wrote:

> Martin Kanich, dando pulos de alegria, escreveu :
> 
>>> The patch is intended to fix those "brain dead" editors that add a BOM
>>> to an UTF-8 encoded file, not to encode one ourselfs.
>>> The XML library we use is a very simple one, and chokes on this.
>>
>>
>> Brain dead editors just ignore the fact, that the xml document written 
>> on Big-Endiand system has another sense as on Low-Endian. So, if your 
>> Notepad on WinXP just ignore this fact, you're using the M$ Ignorant 
>> Editor (note, that notepad can now UTF8).
> 
> 
> I'm not sure I understood your mail, so just wanted to add that an UTF-8
> encoded file is always the same, either in big- or little-endian
> systems.
> 
> That's why it's idiot to add a BOM to an already platform independent
> file. The BOM is, off course, the same: "\xEF\xBB\xBF".

Notepad doesn't add BOM if you save the file as ANSI text.

If you save as UTF text it does add the BOM merely as an encoding 
indicator so that applications know how to decode some of the upper 
characters - as UTF8 instead of according to some default code page.

Sam

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Doctor B. <do...@gm...> - 2004-09-23 13:34:17

The purpose of the BOM is not to let the reader know the byte
ordering, but to let the reader know it is UTF8.  The XML 1.0 spec
tells us the encoding for UTF8 without a BOM will be recognized as
follows:

UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any
other 7-bit, 8-bit, or mixed-width encoding which ensures that the
characters of ASCII have their normal positions, width, and values;
the actual encoding declaration must be read to detect which of these
applies, but since all of these encodings use the same bit patterns
for the relevant ASCII characters, the encoding declaration itself may
be read reliably

The net effect is if you do not have the BOM, then if you want to test
if you can decode the XML file, you have to actually start parsing and
processing the encoding parameter, rather than just checking the first
few bytes.  Actually processing a encoding parameter itself is a pain,
since there is no universal standard indicating how the contents of
the string is converted into a locale name.  This varies based on
platform or even sometimes revision of platform.  Does that encoding
map to a locale of "English", "en_US", or "en"?  Consequently
correctly implementing a parser that supports the encoding attribute
often requires installing an extra library.  While correctly parsing
requires little more than a switch block.

                                       Bill

On Thu, 23 Sep 2004 01:24:32 +0100, Sam Liddicott <sa...@li...> wrote:
> Nuno Lucas wrote:
> 
> > Martin Kanich, dando pulos de alegria, escreveu :
> >
> >>> The patch is intended to fix those "brain dead" editors that add a BOM
> >>> to an UTF-8 encoded file, not to encode one ourselfs.
> >>> The XML library we use is a very simple one, and chokes on this.
> >>
> >>
> >> Brain dead editors just ignore the fact, that the xml document written
> >> on Big-Endiand system has another sense as on Low-Endian. So, if your
> >> Notepad on WinXP just ignore this fact, you're using the M$ Ignorant
> >> Editor (note, that notepad can now UTF8).
> >
> >
> > I'm not sure I understood your mail, so just wanted to add that an UTF-8
> > encoded file is always the same, either in big- or little-endian
> > systems.
> >
> > That's why it's idiot to add a BOM to an already platform independent
> > file. The BOM is, off course, the same: "\xEF\xBB\xBF".
> 
> Notepad doesn't add BOM if you save the file as ANSI text.
> 
> If you save as UTF text it does add the BOM merely as an encoding
> indicator so that applications know how to decode some of the upper
> characters - as UTF8 instead of according to some default code page.
> 
> Sam
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> Project Admins to receive an Apple iPod Mini FREE for your judgement on
> who ports your project to Linux PPC the best. Sponsored by IBM.
> Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
> _______________________________________________
> coLinux-devel mailing list
> coL...@li...
> https://lists.sourceforge.net/lists/listinfo/colinux-devel
>

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Nuno L. <lu...@nl...> - 2004-09-24 12:54:12

I don't think people understand the problem here.
All we want is to avoid to have people complain colinux isn't
booting because of a mysterious error (can't load file).
To achieve this goal we have (at least) the following options:

1- Follow the XML specification:

1.a) Use another XML library that would give us a standard compliant
      implementation.
1.b) Patch the XML library ourselves.

2- Don't follow the XML specification:

2.a) Just handle the few issues users face with small correction code.
2.b) Stop using the XML library and use other formats (like basic .rc
      files or even .ini files).

IMHO, I would never go for 1.a), because I don't see why we should need
a 1MB library (I'm exaggerating, off course) just to read config files.

1.b) would be just dumb, as we would never make it to full compliance
without very hard work and probably resulting in 1.a).

2.a) was the intention of my patch: just make sure we don't get users
complaining by ignoring the UTF-8 BOM. If users decide to use UCS-2,
UTF-16 or UTF-32, that's their problem as they should know better.

2.b) would be the middle-term solution, but I don't see any reason to
modify the code just for this. If anyone feels it's a must, they are
always free to send a patch.

Regards,
~Nuno Lucas

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Dan A. <da...@co...> - 2004-10-01 06:44:03

On Fri, Sep 24, 2004 at 01:54:00PM +0100, Nuno Lucas wrote:
> I don't think people understand the problem here.
> All we want is to avoid to have people complain colinux isn't
> booting because of a mysterious error (can't load file).
> To achieve this goal we have (at least) the following options:
> 
> 1- Follow the XML specification:
> 
> 1.a) Use another XML library that would give us a standard compliant
>      implementation.
> 1.b) Patch the XML library ourselves.

How mailing the library's maintainer? Hopefully the next version
would be still light but with less bugs in it.

-- 
Dan Aloni
da...@co...

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Henry N. <Henry.Ne@Arcor.de> - 2004-10-01 17:24:14

Dan Aloni wrote:
> On Fri, Sep 24, 2004 at 01:54:00PM +0100, Nuno Lucas wrote:
> 
>>I don't think people understand the problem here.
>>All we want is to avoid to have people complain colinux isn't
>>booting because of a mysterious error (can't load file).
>>To achieve this goal we have (at least) the following options:
>>
>>1- Follow the XML specification:
>>
>>1.a) Use another XML library that would give us a standard compliant
>>     implementation.
>>1.b) Patch the XML library ourselves.
> 
> 
> How mailing the library's maintainer? Hopefully the next version
> would be still light but with less bugs in it.
> 

Why we not use standard XML, such as static library?
Think, this will be work better as this small version of mxml.
http://mxml.sourceforge.net/ says it's only a "DOM oriented library".

See also a good comparsion about BOM and DOM
http://www.oracle.com/technology/pub/notes/technote_encodings.html

Actual we use UTF-8 without BOM-Mark.
If we use a BOM-Mark, we need also a reader for 2-byte per char.
See "What are some of the differences between the UTFs" in
http://www.unicode.org/unicode/faq/utf_bom.html

I have not a XML-Edidor, but I found, that we can force the type of 
encoding in header:
  <?xml version='1.0' encoding='UTF-8' ?>
This should all editors give the rigth format for us. UTF-8 is EF BB BF

Can anybody send me a file with UTF-16, created with a XML-Editor? 
(Please zip file before send, so I can see all bytes!)

-- 
Henry Nestler

XML only for german readers:
http://www.sql-und-xml.de/xml-lernen/internationalisierung-unicode-sonderzeichen.html

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Nuno L. <lu...@nl...> - 2004-10-02 08:49:04

Dan Aloni, dando pulos de alegria, escreveu :
> How mailing the library's maintainer? Hopefully the next version
> would be still light but with less bugs in it.

My idea would be to just drop xml files in favor of simple .ini/.cfg
files. The code needed to implement it is trivial enough (I have a
generic .ini file reader code somewhere, don't mind to provide it, only
problem is in C++).

Even if we use a full XML compliant library, we would need to check
the encoding of the Host OS file names, and do appropriate conversions.

Different windows versions use different encodings, depending on the
country. So, if we use Notepad to edit our config file, we are using
the local encoding of the OS (CP-1250 and local variants, in most
places).

On Linux is the same, as we can have a full UTF-8 system (like RedHat,
if I remember right) and a file open operation will expect a UTF-8 path.

So, in the end, we need to have a XML library that will return us a path
encoded in the local OS "code page", or use something like Unicode,
which is really many standards (choose from UTF-8, UTF-16, UTF-32,
UCS-2, ...). But there are not many editors that do the conversions
right.

If we use simple .CFG files (I prefer this suffix in favor of .INI),
we can just specify they need to be pure ASCII (7-bit) or
encoded with the local OS (by using Notepad, for example).

Now we can open a file without problems, because if the user edited it
with Notepad (or vi), it will be encoded in the same way as the OS file
system.

The difference here is only that we can't exchange .cfg files safely
between systems anymore (as we could with XML), but that doesn't matter
nothing to us.

Another disadvantage is the XML processing thing, as it would be easier
to implement a config file editor when using a XML config file. But that
is more for complex config files (like Makefiles), and doesn't buy us
nothing in the simple case.

Now we have one less dependency and "we are in control" ;)

I really think that for simple tasks like this, XML files are overkill,
and end giving more headaches than justify their use (no more end tag
and BOM problems).

If everyone agrees on this, I don't mind implementig it.
If C++ is undesirable for the job, I'll make it in C.

An example "config.cfg" file:

--------------------------------------------------------------
# NOTES:
#
# I'm starting from cobd1, because I also think this could be
# a good place to start reserving cobd0 for other things (like the
# fake partition table we could implement later).
#
# No need for a "\DosDevice\" prefix, as our loader is a bit smarter.
#
# Make file references relative to the config file location (so we can
# have a generic config file and simply copy it to another directory
# with other images with same name)
#
[Kernel]
bootparams = root=/dev/cobd0 ro
initrd     = initrd.gz
vmlinux    = vmlinux
memory     = 64

[Block Devices]
cobd1 = Gentoo-2gb.ext2
# To disable a device, just comment it ;)
#cobd2 = ../debian/Debian-1gb.ext2
cobd3 = \Device\Cdrom0
#cobd3 = /dev/cdrom          # Linux equivalent
cobd4 = \Device\HarddiskVolume7
#cobd4 = /dev/hda7           # Linux equivalent
cobd5 = swap_768MB.sw

[Network]
eth0  = CoLinux TAP
#eth1  = WinPCAP

[CoLinux TAP]
name    = Local Network 2
type    = tap
mac     = 00:FF:11:22:33:44
enabled = true

[WinPCAP]
name    = Local Network 3
type    = bridge
mac     = 00:FF:11:22:33:44
enabled = false
--------------------------------------------------------------

This is just a sample sintax. This is much more readable and easier to
edit than any XML file.

Waiting for your comments...

~Nuno Lucas

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Dan A. <da...@co...> - 2004-10-02 09:00:23

On Sat, Oct 02, 2004 at 09:49:13AM +0100, Nuno Lucas wrote:
> Dan Aloni, dando pulos de alegria, escreveu :
> >How mailing the library's maintainer? Hopefully the next version
> >would be still light but with less bugs in it.
> 
> My idea would be to just drop xml files in favor of simple .ini/.cfg
> files. The code needed to implement it is trivial enough (I have a
> generic .ini file reader code somewhere, don't mind to provide it, only
> problem is in C++).
> 
>[..]
> 
> This is just a sample sintax. This is much more readable and easier to
> edit than any XML file.

Well, if we are going to retreat from the XML configuration solution, we
should rather come up with something compatible to User Mode Linux, i.e,
use no configuration file and pass all configuration using the command
line.

The reason is that User Mode Linux is the closest kind of a virtual Linux
implementation to coLinux, and it already has a large established user 
base. Even a simple command line and interface compatibility can attract 
a large amounts of User Mode Linux users and developers.

-- 
Dan Aloni
da...@co...

Re: [coLinux-devel] Re: [PATCH] Fix "clever" text editors that decide to add a BOM to the config file

From: Henry N. <Henry.Ne@Arcor.de> - 2004-10-03 14:33:05

Dan Aloni wrote:

> On Sat, Oct 02, 2004 at 09:49:13AM +0100, Nuno Lucas wrote:
> 
>>Dan Aloni, dando pulos de alegria, escreveu :
>>
>>>How mailing the library's maintainer? Hopefully the next version
>>>would be still light but with less bugs in it.
>>
>>My idea would be to just drop xml files in favor of simple .ini/.cfg
>>files. The code needed to implement it is trivial enough (I have a
>>generic .ini file reader code somewhere, don't mind to provide it, only
>>problem is in C++).
>>
>>[..]
>>
>>This is just a sample sintax. This is much more readable and easier to
>>edit than any XML file.
> 
> 
> Well, if we are going to retreat from the XML configuration solution, we
> should rather come up with something compatible to User Mode Linux, i.e,
> use no configuration file and pass all configuration using the command
> line.
> 
> The reason is that User Mode Linux is the closest kind of a virtual Linux
> implementation to coLinux, and it already has a large established user 
> base. Even a simple command line and interface compatibility can attract 
> a large amounts of User Mode Linux users and developers.
> 

I think, INI or CFG files is better to reading and edit for all, Linux, 
Windows and simple for parsing. In my case, if I click on XML, my 
windows views only a empty file :-(, I am must alway use "open with ..."

Command line options we should always use to overwrite a config file option.

INI file is well welcome.

-- 
Henry Nestler