PyRtfLib / Bugs / #5 Unhandled RTF escapes sequences

#5 Unhandled RTF escapes sequences

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2007-10-10

Created: 2007-10-10

Creator: Chris Clark

Private: No

I've attached a demo RTF file that demos some problems I've seen with real RTF files in the wild. It maybe that there are different bugs and this bug report may need to be broken into smaller bug reports.

* e?dashes not preserved/converted.
* optional hypens causes loss of data (sometimes it just breaks the word into 2 words).
* newlines in source rtf in the middle of words are treated as newlines in output.

See attached rtf file - problem occurs with to text and to html conversion.

"""Bug report:

* e?dashes not preserved/converted. See "hello" "world" examples.
* optional hypens causes loss of data (sometimes it just breaks the word
into 2 words). See word "angle"
* newlines in source rtf in the middle of words are treated as newlines
in output. See word "word"
* NUL at end of text - NOTE this may not be a real bug (rtf file was
created in Windows WordPad, then manually edited, then a new empty
rtf file created in WordPad and the updated one copy/pasted from
one wordpad to the other)
"""

import rtf.Rtf2Html

def textwrap_string_variable(in_value, variable_name='wrapped_string', wrap_length=65):
## chop_string from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302069
chop_string = lambda s,p: [ s[i:i+p] for i in range(0,len(s),p) ]

repr_very_long_value = repr(in_value)
## replace single quotes with triple
repr_very_long_value = "'''" + repr_very_long_value[1:-1] + "'''"

result_sr = []
result_sr.append('%s = \\\n' % variable_name)
handle_long_lines=2
for x in chop_string(repr_very_long_value, wrap_length):
if x[-1] == '\\':
# if last char is escape , potentiall do something different (chop sooner...?)
## find first escape and then chop before it.
if handle_long_lines == 1:
# potentially VERY long line length
result_sr.append(x)
elif handle_long_lines == 2:
# The longest a line could be is 2*wrap_length
# however can end up with empty lines....
slash_pos = x.find('\\')
temp_x = x[:slash_pos]
result_sr.append(temp_x)
result_sr.append('\\\n')
temp_x = x[slash_pos:]
result_sr.append(temp_x)
else:
raise NotImplemented('handle_long_lines style %r' % handle_long_lines)

if x[-1] != '\\':
## not pretty and is long but preserves data!
result_sr.append(x)
result_sr.append('\\\n')
if result_sr[-1] == '\\\n':
# remove trailing slash and newline
del result_sr[-1]
new_python_code = ''.join(result_sr)
return new_python_code

"""
in_filename = 'test_doc.rtf'
#in_filename = 'test_doc_2.rtf'
file_ptr = open(in_filename, 'rb')
input_text = file_ptr.read()
print repr(input_text)
print textwrap_string_variable(input_text, 'input_text')
"""
input_text = \ '''{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033\\deflangfe1033{\ \ \\fonttbl{\\f0\\froman\\fprq2\\fcharset0 Times New Roman;}{\\f1\\f\ swiss\\fcharset0 Arial;}}\r\n{\\colortbl ;\\red0\\green0\\blue0;\ }\r\n{\\*\\generator Msftedit 5.41.15.1507;}\\viewkind4\\uc1\\par\ d\ \\qc\\cf1\\i\\f0\\fs52 Chapter One\\par\r\n\\pard\\cf0\\i0\\f1\\fs20 Hel\ lo world.\ \\par\r\n2nd line/para.\\par\r\n\\par\r\nhello\\emdash world\\par\r\n\ hello\\endash world\\par\r\nThis is a longis\ h line that may wrap, using optional hypen, the an\ \\-gle.\\par\r\nAnother really long line that wil get broken into \ two pieces in \ the middle of a wo\r\nrd.\\par\r\n\\par\r\n}\r\n\x00'''

# short test case to debug
#input_text = '{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033\nhello\\emdash world\\par\r\nhello\\endash world\\par\r\nThis is a longish line that may wrap, using optional hypen, the an\\-gle\\par\r\n\\par\r\n}\r\n\x00'

#input_text = rtf.Rtf2Html.getHtml(input_text)
#input_text = rtf.Rtf2Txt.getTxt(input_text)

test_function = rtf.Rtf2Html.getHtml
#test_function = rtf.Rtf2Txt.getTxt
input_text = test_function(input_text)

print '-'*65
print input_text
print '-'*65

Discussion

Chris Clark - 2007-10-10

test_doc.rtf - sample rtf file with multiple rtf escapes that do not convert as expected

test_doc.rtf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Chris Clark - 2007-10-10

summary: Unhandled RTF escapes equences --> Unhandled RTF escapes sequences
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unhandled RTF escapes sequences

Group

Searches

Help

#5 Unhandled RTF escapes sequences

Discussion