Re: [Lurker-users] How to prefer HTML if a message has both plain and HTML?
Brought to you by:
terpstra
From: Kaz K. <ka...@ky...> - 2013-10-04 19:40:04
|
I have made more progress on this. My Lurker installation now has a "html_filter" configuration option. Here you can specify a command to use to sanitize the HTML. If nothing is specified, there is no filtering. I wrote a C utility to do the filtering, using GNU flex to scan, with some minimal parsing logic on top of that. (Most likely a huge reinvention of the wheel, but what the heck). It has a hard-coded white list of HTML4 tags, so it will nicely take out all junk, while allowing most useful markup. (It needs some refinement: parsing the attributes accurately and allowing only whitelisted ones. Little by little I will improve things ...) Here is a mailing list in action: http://www.kylheku.com/lurker/list/ada-mp1.en.html On 03.10.2013 11:49, Kaz Kylheku wrote: > Hi Wesley, > > Thanks for taking the time to follow up to all my postings. > > I see what is going on in the code with regard to HTML; basically it is gutted! > > Doing this properly requires a filter which strips out dangerous tags, while keeping the safe subset of the markup. > > I am experimenting with a patch which just sticks in the raw HTML. Later I will add some filtering on it, and also treatment of cid: URL's for proper display of inline images. > > What I have so far is this: > > * I made the code which handles multipart/alternative messages to prefer the HTML part, rather than the plain part. > * HTML is not stripped at all, or subject to any special handling. Instead, the raw UTF-8 is wrapped in an XML CDATA block. > * In the UI XSL code, I put in a template rule that a mime body part matching the "text/html" content type is copied, but with escaping disabled. > > This mostly works the way I want, modulo safety, and handling of links that point to attachments. > > I may implement the HTML filtering via an external program that will be configurable in the lurker.conf. > > On 02.10.2013 06:47, Wesley W. Terpstra wrote: > The reason lurker prefers the plain text is that the formatting is more likely to be correct. For security reasons, lurker strips all html tags from the html formatted mail. I don't want lurker to show embedded javascript or links or what-have-you to a user. So, I think the current default is the best option for most people. > > On Sat, Sep 28, 2013 at 4:54 AM, Kaz Kylheku <ka...@ky...> wrote: > Hi all, > > I started using Lurker and right away noticed that when an e-mail is > archived which is written in HTML, but for which the mail client also > rendered a plain-text version, Lurker renders the plain text by default. > You have to click on a link on the far right to view the HTML, which > looks fine. > > Can we change the default, so that if both are available in the body, > Lurker will prefer the HTML? > > The plain text versions can be ugly, and displaying them in a web > archive defeats the point in an archiver which can deal with HTML and > MIME! > > For example, my mail client renders links as footnotes. These then > appear in the Lurker archive, instead of the straightforward HTML with > its anchor elements. |