Thread: [Rest2web-develop] (pretty strange) encoding error with rest2web on a Mac
Brought to you by:
mjfoord
From: Ben <bi...@ma...> - 2008-11-22 19:21:35
|
Hello list, I have a strange issue with file encoding. Actually, this problem only occurs on my brand new MacBook -- running MacOSX on a HFS+ filesystem, but everything used to run fine with my previous OS, Linux. I've investiguated as much as I could but I can't find a solution, could you please tell me you opinion about it? First I had some weird behaviors with rest2web encoding handling, the 'file' command returns a file as UTF8 encoded but the log of rest2web guesses latin1. I've added: [uservalues] __encoding__ = UTF-8 in my r2w and encoding: utf-8 in files restindex. Just to let you know my env says: LANG=en_GB.UTF-8. Things are still a bit weird (rest2web still guesses apparently a wrong encoding) but kind of work. Except ONE and last file, an indexfile in a directory. Here is the error output r2w.py procudes (called by a Makefile): -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<--- Processing indexfile. [err] Traceback (most recent call last): [err] File "/Users/benoit/local/rest2web-0.5.1/r2w.py", line 170, in <module> [err] count = main(options, config) [err] File "/Users/benoit/local/rest2web-0.5.1/r2w.py", line 103, in main [err] return processor.walk() [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line 473, in walk [err] errorcheck = self.execute_safely(self.buildsection) [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line 218, in execute_safely [err] val = function(*args, **keywargs) [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line 1401, in buildsection [err] self.sections, final_encoding, self.dir_as_url, target) [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line 1735, in handle_sections [err] page['page-description'], encoding) [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restutils.py", line 240, in encode [err] return instring.encode(encoding) [err] UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 138: ordinal not in range(256) [err] make: *** [build] Error 1 -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<--- And here is the faulty text: -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<--- $ cat src/articles/index.txt .. restindex link-title: Resources encoding: utf-8 page-description: List /description /restindex ======================================== List ======================================== test -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<--- I first though it could be from my template but once again, other pages are sucessfully processed. My locales seem ok... Any idea? I'm lost. Thank you very much in advance. -- Ben |
From: Michael F. <fuz...@vo...> - 2008-11-25 12:01:53
|
Hello Ben, Sorry for the late reply. That's certainly a very odd error (that I've not seen before). Interestingly the error is in an encode operation (not a decode) - so it happens when rest2web is trying to write out a file rather than reading one in (so I don't *think* it is to do with guessing encoding). Can you put the following in the restindex of your main index file and see if it helps: output-encoding: utf8 Michael Foord Ben wrote: > > Hello list, > > I have a strange issue with file encoding. Actually, this problem only > occurs on my brand new MacBook -- running MacOSX on a HFS+ filesystem, > but everything used to run fine with my previous OS, Linux. I've > investiguated as much as I could but I can't find a solution, could > you please tell me you opinion about it? > > First I had some weird behaviors with rest2web encoding handling, the > 'file' command returns a file as UTF8 encoded but the log of rest2web > guesses latin1. > I've added: > [uservalues] > __encoding__ = UTF-8 > in my r2w and encoding: utf-8 in files restindex. > Just to let you know my env says: LANG=en_GB.UTF-8. > > Things are still a bit weird (rest2web still guesses apparently a > wrong encoding) but kind of work. Except ONE and last file, an > indexfile in a directory. > > Here is the error output r2w.py procudes (called by a Makefile): > > -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<--- > Processing indexfile. > [err] Traceback (most recent call last): > [err] File "/Users/benoit/local/rest2web-0.5.1/r2w.py", line 170, in > <module> > [err] count = main(options, config) > [err] File "/Users/benoit/local/rest2web-0.5.1/r2w.py", line 103, in > main > [err] return processor.walk() > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line > 473, in walk > [err] errorcheck = self.execute_safely(self.buildsection) > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line > 218, in execute_safely > [err] val = function(*args, **keywargs) > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line > 1401, in buildsection > [err] self.sections, final_encoding, self.dir_as_url, target) > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line > 1735, in handle_sections > [err] page['page-description'], encoding) > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restutils.py", line 240, > in encode > [err] return instring.encode(encoding) > [err] UnicodeEncodeError: 'latin-1' codec can't encode character > u'\u2019' in position 138: ordinal not in range(256) > [err] > make: *** [build] Error 1 > > -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<--- > > And here is the faulty text: > > > -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<--- > > $ cat src/articles/index.txt > .. > restindex > link-title: Resources > encoding: utf-8 > page-description: > List > /description > /restindex > > ======================================== > List > ======================================== > > test > > -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<--- > > I first though it could be from my template but once again, other > pages are sucessfully processed. My locales seem ok... > > Any idea? I'm lost. > Thank you very much in advance. > > -- Ben > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > ------------------------------------------------------------------------ > > _______________________________________________ > Rest2web-develop mailing list > Res...@li... > https://lists.sourceforge.net/lists/listinfo/rest2web-develop > -- http://www.ironpythoninaction.com/ |
From: Ben <bi...@ma...> - 2008-11-25 17:46:39
|
On Tue, Nov 25, 2008 at 12:01 PM, Michael Foord <fuz...@vo...>wrote: > Hello Ben, Hi Michael, thanks a lot for your email and thanks a million times for rest2web! Sorry for the late reply. No problem at all. I was just wandering if my email ever reached the lidie! > > Interestingly the error is in an encode operation (not a decode) - so it > happens when rest2web is trying to write out a file rather than reading > one in (so I don't *think* it is to do with guessing encoding). OK, interesting and good to know! In fact I've discovered that as well by testing and testing all the files, they are in UTF-8 (a python guru friend of mine has created a little script to test file char by char :)) So the problem seems to be the output. > Can you put the following in the restindex of your main index file and > see if it helps: > > output-encoding: utf8 > Actually, I've tried everything I could, including the directive you mention. Here is what in my updated files: encoding: utf-8 output-encoding: utf-8 in all the files. It doesn't help. I've tried to remove the page-description (because it contained non ascii char). Btw a quick question: the error message is a bit obscure, do you think there is a way to know exactly where it failed? The error message says something like: [err] UnicodeError: Unable to decode input data. Tried the following encodings: 'utf-8'. [err] (UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1690: invalid data) It's not really clear. I would enjoy something like: "can't decode bytes in position 1689-1690: invalid data at "and Hello in Klingon is !@#@#$" But I've been investigating a bit deeper and now I'm completely puzzled: the thing which is really weird is I've copied, with `scp -r` the directory with my faulty website and the rest2web archive (with r2w.py in it) on a Linux box (debian stable). And I run the thing with my Makefile. Guess what? It *works*. Perfectly. Seamlessly. So the problem is probably not the files, and not rest2web. I'm wondering about Python, or HFS+ (?). I know it's weird, but ... I can't say anything else. So here are my python flavors: $ /usr/bin/python --version Python 2.5.1 and using the ports: $ /opt/local/bin/python2.5 --version Python 2.5.2 I've tried to use both by calling directly r2w.py (for the standard python) or /opt/local/bin/python-2.5 r2w.py (for the python-on-ports). I've installed docutils for the 2 separates systems. Both fail in compiling the website. Well, I don't know what to say. I've read your blog, Michael, and I think you have a Mac too. Have you ever get in these issues? Kind regards, -- Ben |
From: Benoit <be...@ma...> - 2008-12-02 22:13:18
|
Hello everybody I've been investigating a bit deeper, from user point of view, the problem I'm facing with this rest2web encoding issue on MacOSX. I've created a single new file, using vim or textwrangler on MacOSX. Here are the results. 1/ Latin1 file - the file is encoded in Latin1 (confirmed by file(1)). - in the file I've set: encoding: latin1 output-encoding: utf-8 and when I run the Makefile, everything works fine (rest2web seems happy and the HTML output looks great in Firefox) 2/ An UTF-8 file - I've converted the exactly same file with iconv(1). iconv -f latin1 -t utf8 myfile.txt > myfile.txt.utf8 mv myfile.txt.utf8 myfile.txt - First attempt, I *DON'T* change the restindex: encoding: latin1 ## yes the file is now UTF-8! output-encoding: utf-8 => results: rest2web seems happy and runs well, but the html output has a double-encoding in UTF-8 (e.g. é -> Ã(c)). - Second attempt, I fix the restindex: encoding: utf-8 output-encoding: utf-8 => results: rest2web stops with an encoding related error message. $ make rest2web version 0.5.1 [ skipping lots of files which are OK! ] Processing "articles" directory. Reading "articles/myfile.txt". [err] Traceback (most recent call last): [err] File "/Users/benoit/local/rest2web-0.5.1/r2w.py", line 170, in <module> [err] count = main(options, config) [err] File "/Users/benoit/local/rest2web-0.5.1/r2w.py", line 103, in main [err] return processor.walk() [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line 457, in walk [err] errorcheck = self.execute_safely(ProcessFile) [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line 218, in execute_safely [err] val = function(*args, **keywargs) [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line 452, in ProcessFile [err] subdir=subdir) [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line 950, in process [err] doctitle=doctitle) [err] File "/Users/benoit/local/rest2web-0.5.1/rest2web/restutils.py", line 183, in html_parts [err] writer_name='html', settings_overrides=overrides) [err] File "/Library/Python/2.5/site-packages/docutils/core.py", line 433, in publish_parts [err] enable_exit_status=enable_exit_status) [err] File "/Library/Python/2.5/site-packages/docutils/core.py", line 614, in publish_programmatically [err] output = pub.publish(enable_exit_status=enable_exit_status) [err] File "/Library/Python/2.5/site-packages/docutils/core.py", line 204, in publish [err] self.settings) [err] File "/Library/Python/2.5/site-packages/docutils/readers/__init__.py", line 68, in read [err] self.input = self.source.read() [err] File "/Library/Python/2.5/site-packages/docutils/io.py", line 357, in read [err] return self.decode(self.source) [err] File "/Library/Python/2.5/site-packages/docutils/io.py", line 124, in decode [err] error_details)) [err] UnicodeError: Unable to decode input data. Tried the following encodings: 'utf-8'. [err] (UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1665-1666: invalid data) [err] Well, of course I could use a latin1 encoding, but well, I would enjoy utf-8! Anyone a clue ? Kind regards, -- Ben |
From: Michael F. <fuz...@vo...> - 2008-12-02 23:17:06
|
Hello Ben, Very odd. Thanks for your investigation - I'll have to find the time to work out what is going wrong. Very odd. Can you send the errant file (or an example one that causes the issue). All the best, Michael Foord Benoit wrote: > > > Hello everybody > > I've been investigating a bit deeper, from user point of view, the problem > I'm facing with this rest2web encoding issue on MacOSX. > > I've created a single new file, using vim or textwrangler on MacOSX. > Here are the results. > > 1/ Latin1 file > > - the file is encoded in Latin1 (confirmed by file(1)). > - in the file I've set: > encoding: latin1 > output-encoding: utf-8 > > and when I run the Makefile, everything works fine (rest2web seems > happy and > the HTML output looks great in Firefox) > > > 2/ An UTF-8 file > > - I've converted the exactly same file with iconv(1). > iconv -f latin1 -t utf8 myfile.txt > myfile.txt.utf8 > mv myfile.txt.utf8 myfile.txt > > - First attempt, I *DON'T* change the restindex: > encoding: latin1 ## yes the file is now UTF-8! > output-encoding: utf-8 > > => results: rest2web seems happy and runs well, but the html output > has a double-encoding in UTF-8 (e.g. é -> é). > > - Second attempt, I fix the restindex: > encoding: utf-8 > output-encoding: utf-8 > > => results: rest2web stops with an encoding related error message. > > $ make > rest2web version 0.5.1 > > [ skipping lots of files which are OK! ] > > Processing "articles" directory. > Reading "articles/myfile.txt". > [err] Traceback (most recent call last): > [err] File "/Users/benoit/local/rest2web-0.5.1/r2w.py", line 170, in > <module> > [err] count = main(options, config) > [err] File "/Users/benoit/local/rest2web-0.5.1/r2w.py", line 103, in > main > [err] return processor.walk() > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line > 457, in walk > [err] errorcheck = self.execute_safely(ProcessFile) > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line > 218, in execute_safely > [err] val = function(*args, **keywargs) > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line > 452, in ProcessFile > [err] subdir=subdir) > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restprocessor.py", line > 950, in process > [err] doctitle=doctitle) > [err] File > "/Users/benoit/local/rest2web-0.5.1/rest2web/restutils.py", line 183, > in html_parts > [err] writer_name='html', settings_overrides=overrides) > [err] File "/Library/Python/2.5/site-packages/docutils/core.py", > line 433, in publish_parts > [err] enable_exit_status=enable_exit_status) > [err] File "/Library/Python/2.5/site-packages/docutils/core.py", > line 614, in publish_programmatically > [err] output = pub.publish(enable_exit_status=enable_exit_status) > [err] File "/Library/Python/2.5/site-packages/docutils/core.py", > line 204, in publish > [err] self.settings) > [err] File > "/Library/Python/2.5/site-packages/docutils/readers/__init__.py", line > 68, in read > [err] self.input = self.source.read() > [err] File "/Library/Python/2.5/site-packages/docutils/io.py", line > 357, in read > [err] return self.decode(self.source) > [err] File "/Library/Python/2.5/site-packages/docutils/io.py", line > 124, in decode > [err] error_details)) > [err] UnicodeError: Unable to decode input data. Tried the following > encodings: 'utf-8'. > [err] (UnicodeDecodeError: 'utf8' codec can't decode bytes in position > 1665-1666: invalid data) > [err] > > Well, of course I could use a latin1 encoding, but well, I would > enjoy utf-8! > > Anyone a clue ? > > Kind regards, > > -- Ben > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > ------------------------------------------------------------------------ > > _______________________________________________ > Rest2web-develop mailing list > Res...@li... > https://lists.sourceforge.net/lists/listinfo/rest2web-develop > -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog |
From: G. M. <mi...@us...> - 2008-12-03 07:55:24
|
On 2.12.08, Benoit wrote: > I've created a single new file, using vim or textwrangler on MacOSX. ... > - I've converted the exactly same file with iconv(1). > iconv -f latin1 -t utf8 myfile.txt > myfile.txt.utf8 > mv myfile.txt.utf8 myfile.txt ... > - Second attempt, I fix the restindex: > encoding: utf-8 > output-encoding: utf-8 > => results: rest2web stops with an encoding related error message. ... > [err] error_details)) > [err] UnicodeError: Unable to decode input data. Tried the following > encodings: 'utf-8'. > [err] (UnicodeDecodeError: 'utf8' codec can't decode bytes in position > 1665-1666: invalid data) > [err] There seems to be still an invalid character in your file. This might be a bug in iconv. Or some non-utf8 encoded char "smuggled" in or a byte was taken out after the conversion. May be you can trim down your example file (keeping the bytes around 1665). Do you get the same error with `rst2html myfile.txt`? Can you open this file with an UTF-8 enabled text editor? What does myfil.txt look like in a web browser if you set the page-encoding manually to UTF-8? In any way, we need an example file to investigate. Günter |
From: Ben <bi...@ma...> - 2008-12-03 08:52:12
|
Hello everyone, first of all, thank you for your help and support! I really appreciate it. > There seems to be still an invalid character in your file. Yes, it looks like an invalid character, but there are other clues that it's not - as Michael pointed out, it was a pb in writing the file, apparently, not reading it (encoding vs decoding) - iconv is perfectly happy with all the files - I've asked a python guru friend to write a program to spot any non-utf8 character in any file and his program, is_unicode.py , is perfectly happy with my files, char by char - all the files are perfectly processed by rst2html.py (for i in `ls -1 *.txt` ; do rst2html.py $i > /dev/null ; done) => no error on stderr - it simply perfectly works on Linux. Same files, same rest2web archive. > May be you can trim down your example file (keeping the bytes around 1665). I did, big time! and the files are really utf8 However, do you know how to jump to byte 1665 with any text editor? I've tried with cat, wc, etc. and I simply didn't find it :< sorry. What I did is to remove all the paragraphs except the intro. It fails. > Do you get the same error with `rst2html myfile.txt`? nope, it works. All of them. > > Can you open this file with an UTF-8 enabled text editor? Yes, vim, emacs 22.2.50.1 with utf8 support, and textwrangler with the explicit mention of utf-8 encoding. I carefully inspected the files, I removed any weird character (even those which are supposedly utf8 like blocked semi space or fancy "). > What does myfil.txt look like in a web browser if you set the > page-encoding manually to UTF-8? I've done that with Firefox, explicitely picking UTF-8 in View -> Character encdoing. The rest source page looks great! > > In any way, we need an example file to investigate. > OK, I'm going to create an archive and post a link to it on this list. Thank you again for your help. -- Ben |
From: G. M. <mi...@us...> - 2008-12-03 09:48:41
|
On 3.12.08, Ben wrote: > > There seems to be still an invalid character in your file. > Yes, it looks like an invalid character, but there are other clues that it's > not > - as Michael pointed out, it was a pb in writing the file, apparently, not > reading it (encoding vs decoding) > - all the files are perfectly processed by rst2html.py > (for i in `ls -1 *.txt` ; do rst2html.py $i > /dev/null ; done) => no > error on stderr > - it simply perfectly works on Linux. Same files, same rest2web archive. You convinced me. Let us assume that there is no "strange" character in the rst input. rst2html works well, so there must be a problem with rest2web: * is your template file UTF-8 clean? * do you use any of the special replacement features of rest2web? * what is the locale when you run rest2web? > I did, big time! and the files are really utf8 > However, do you know how to jump to byte 1665 with any text editor? you might need a binary editor... or read the file into a string and slice [1660:1670]. However, it might be worth to look at around character 1660 in the *exported* latin-1 file. Günter |