POPFile - Automatic Email Classification / Discussion / Bleeding Edge

MatthewJurgens - 2008-02-12

I recently was working on another perl project and because it was being delivered over a WAN needed to "compress" the HTML as much as possible to speed the delivery of the pages. I developed my own 2 reg exes that saved a significant amount of space. I tried looking for a perl module to do this and found HTML::Clean and another one called HTML::Tidy (which was missing). I tried to install HTML::Clean but it never really worked (I can't recall why) so I enhanced my own routine using some of the ideas from HTML::Clean. I now have a simple function which takes HTML as input as output the same HTML, but with lots of size reduction.

I then thought that this might be a good idea for popfile and changed HTML.pm to incorporate the HTML cleaning routine. This has made the page delivery a little faster on a LAN but if you use it over a slower link then it is heaps faster. As an example, my history page with 100 entries used to be about 217k and the cleaned version (delivery exactly the same page) is only 108k.

I have patched HTML.pm to make it easy for me to re-patch upon version changes so its probably not the best way to code it but I can post it if this is of interest.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Texas Fett - 2008-02-12
  
  Part of the reason for all the white space in the HTML is to make it easy to read for editing. Some of the others are side effects of our template system combining files. Is 100k really that much savings anyway? I think the slow part of loading the history page with that many entries is the browser rendering the large table. Most people don't need to spend a lot of time in the POPFile UI once it is well trained since it is pretty accurate.
  
  I don't think we want to totally "compress" the HTML for the readability reason, but there may be some optimizations we can do if the "compressing" process doesn't slow things down any itself. For most people currently (at least on Windows), POPFile is run on the local machine so bandwidth is not as much of an issue as processor usage. It would certainly be worth looking into. Can you post some sample HTML from before and after?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Manni - 2008-02-12
  
  That's an interesting idea. Of course, any kind of compression would save bandwidth while at the same time, imposing costs on the CPU.
  
  Now that I think about it, what about gzip compression?
  
  Manni
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- MatthewJurgens - 2008-02-12
  
  The compression I am doing is not compression along the lines of gzip - the browser still needs to be able to the read the code. The HTML is still readable since the templates do not change and are as readable as ever. Additionally, developers could turn off the clean function call and the output HTML would be just as readable. The "compression" process is incredibly simple and takes milliseconds.
  
  Here's the clean sub:
  If you're worried about performance use only the 2 commented out "basic" steps as they provide about 95% of the compression.
  
  sub clean_html {
  my ($source_html)=@_;
  
  # my 2 basic cleaning steps
  # $source_html=~s/^\s*//mg;
  # $source_html=~s/>\n/>/mg;
  
  # the enhanced steps are
  
  $source_html =~ s,[\r\n]+,\n,sg; # Carriage/LF -> LF
  $source_html =~ s,\s+\n,\n,sg; # empty line
  $source_html =~ s,\n\s+<,\n<,sg; # space before tag
  $source_html =~ s,\n\s+,\n ,sg; # other spaces
  
  $source_html =~ s,>\n\s*<,><,sg; # LF/spaces between tags..
  
  # remove space within tags <center > becomes <center>
  $source_html =~ s,\s+>,>,sg;
  $source_html =~ s,<\s+,<,sg;
  
  # join lines with a space at the beginning/end of the line
  # and a line that begins with a tag
  $source_html =~ s,>\n ,> ,sig;
  $source_html =~ s, \n<, <,sig;
  
  return $source_html;
  }
  
  For a short time a before and after page are available (no css) at:
  http://www.mmcit.co.nz/popfile_before.html and
  http://www.mmcit.co.nz/popfile_after.html
  
  For me these 2 pages show the speed advantage of the compression since I am about a 250ms return trip away from them. The "after" page loads in about 60-70% of the time that the after page does.
  
  As I originally said its major advantage is for web page delivery over slower networks.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Texas Fett - 2008-02-12
    
    The white space removal compression you are doing does help, but the gzip compression Manni was talking about is like the gzip/deflate option for Apache. It would be invisible to humans since it would be decoded by the browser I think, but would reduce the page size transfered. That would likely cause a tiny slowdown to compress the pages and then another tiny amount when the browser uncompresses. But if we keep the actual compression minimal, it should not add much work for the CPU.
    
    I will have to check the skins with a bunch of browsers to be sure the white space compression doesn't affect anything. Sometimes removal of white space affects browser rendering.
    
    If we do use your method, we can also get rid of the HTML comments that aren't wrapping javascript code. They are mostly inserted to tell developers where inserted template files start and end. As long as developers can turn this option off, there is no reason to keep the comments in the output HTML since it is not very human readable.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cleaning HTML

Forums

Help

Cleaning HTML

Cleaning HTML

Forums

Help

Cleaning HTML document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Cleaning HTML