I've attached a demo RTF file that demos some problems I've seen with real RTF files in the wild. It maybe that there are different bugs and this bug report may need to be broken into smaller bug reports.
* e?dashes not preserved/converted.
* optional hypens causes loss of data (sometimes it just breaks the word into 2 words).
* newlines in source rtf in the middle of words are treated as newlines in output.
See attached rtf file - problem occurs with to text and to html conversion.
"""Bug report:
* e?dashes not preserved/converted. See "hello" "world" examples.
* optional hypens causes loss of data (sometimes it just breaks the word
into 2 words). See word "angle"
* newlines in source rtf in the middle of words are treated as newlines
in output. See word "word"
* NUL at end of text - NOTE this may not be a real bug (rtf file was
created in Windows WordPad, then manually edited, then a new empty
rtf file created in WordPad and the updated one copy/pasted from
one wordpad to the other)
"""
import rtf.Rtf2Html
def textwrap_string_variable(in_value, variable_name='wrapped_string', wrap_length=65):
## chop_string from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302069
chop_string = lambda s,p: [ s[i:i+p] for i in range(0,len(s),p) ]
repr_very_long_value = repr(in_value)
## replace single quotes with triple
repr_very_long_value = "'''" + repr_very_long_value[1:-1] + "'''"
result_sr = []
result_sr.append('%s = \\\n' % variable_name)
handle_long_lines=2
for x in chop_string(repr_very_long_value, wrap_length):
if x[-1] == '\\':
# if last char is escape , potentiall do something different (chop sooner...?)
## find first escape and then chop before it.
if handle_long_lines == 1:
# potentially VERY long line length
result_sr.append(x)
elif handle_long_lines == 2:
# The longest a line could be is 2*wrap_length
# however can end up with empty lines....
slash_pos = x.find('\\')
temp_x = x[:slash_pos]
result_sr.append(temp_x)
result_sr.append('\\\n')
temp_x = x[slash_pos:]
result_sr.append(temp_x)
else:
raise NotImplemented('handle_long_lines style %r' % handle_long_lines)
if x[-1] != '\\':
## not pretty and is long but preserves data!
result_sr.append(x)
result_sr.append('\\\n')
if result_sr[-1] == '\\\n':
# remove trailing slash and newline
del result_sr[-1]
new_python_code = ''.join(result_sr)
return new_python_code
"""
in_filename = 'test_doc.rtf'
#in_filename = 'test_doc_2.rtf'
file_ptr = open(in_filename, 'rb')
input_text = file_ptr.read()
print repr(input_text)
print textwrap_string_variable(input_text, 'input_text')
"""
input_text = \
'''{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033\\deflangfe1033{\
\
\\fonttbl{\\f0\\froman\\fprq2\\fcharset0 Times New Roman;}{\\f1\\f\
swiss\\fcharset0 Arial;}}\r\n{\\colortbl ;\\red0\\green0\\blue0;\
}\r\n{\\*\\generator Msftedit 5.41.15.1507;}\\viewkind4\\uc1\\par\
d\
\\qc\\cf1\\i\\f0\\fs52 Chapter One\\par\r\n\\pard\\cf0\\i0\\f1\\fs20 Hel\
lo world.\
\\par\r\n2nd line/para.\\par\r\n\\par\r\nhello\\emdash world\\par\r\n\
hello\\endash world\\par\r\nThis is a longis\
h line that may wrap, using optional hypen, the an\
\\-gle.\\par\r\nAnother really long line that wil get broken into \
two pieces in \
the middle of a wo\r\nrd.\\par\r\n\\par\r\n}\r\n\x00'''
# short test case to debug
#input_text = '{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang1033\nhello\\emdash world\\par\r\nhello\\endash world\\par\r\nThis is a longish line that may wrap, using optional hypen, the an\\-gle\\par\r\n\\par\r\n}\r\n\x00'
#input_text = rtf.Rtf2Html.getHtml(input_text)
#input_text = rtf.Rtf2Txt.getTxt(input_text)
test_function = rtf.Rtf2Html.getHtml
#test_function = rtf.Rtf2Txt.getTxt
input_text = test_function(input_text)
print '-'*65
print input_text
print '-'*65
test_doc.rtf - sample rtf file with multiple rtf escapes that do not convert as expected