Willow Filtering Proxy Code
Brought to you by:
davidredwaratah
WILLOW is a caching, content filtering proxy/http server written in python. IMPORTANT NOTE: Willow is designed to interact most compatibly with current HTTP/1.1 web servers. Anything designed since 2006 should fit the idea of "current". As of April 2010, certain servers still use HTTP/1.0, which pose problems for Willow - namely Apple's software update server (http://swscan.apple.com) and the Edna MP3/OGG server, among likely others. Certain data exchanges may not fully work with these legacy servers, and their operators should be challenged to modernize their systems for increased efficiency. For reference, see http://www2.research.att.com/~bala/papers/h0vh1.html This backwards compatibility issue seems deeply ingrained in Willow's original design, and no efforts are currently planned to fully support HTTP/1.0 servers. As a workaround, define known PROBLEM HTTP/1.0 sites to bypass the proxy. This can often be done in the client (eg. Firefox and MacOSx proxy settings). Willow Filtering Proxy "He took some of the seed of your land and put it in fertile soil. He planted it like a willow by abundant water, and it sprouted and became a low, spreading vine. Its branches turned toward him, but its roots remained under it. So it became a vine and produced branches and put out leafy boughs." -- THE Bible, Ezekiel 17:5-6 Willow is a content-filtering proxy server. It bears one similarity to the many other pieces of software available for web filtering in that it is designed to filter web content. That, however, is where the similarities end. The differences between Willow and other solutions are significant, and these differences make Willow the first really usable internet filter. * Expense: Willow is available free of charge. This is the complete, full-featured version of the software. There are no holdbacks or catches. In addition, any improvements or updates will be free of charge. * Code: Willow is open source (under the Gnu Public License). The source code is available free of charge to anyone and anyone is allowed to make any modification they wish (as long as they also release the source code). There are many reasons that we make our software open source. First, it makes the user able to customize the code for their own use. If any of our software doesn't do exactly what you want, feel free to change it yourself - we don't care. In fact, we would encourage you to do so because this will make our software better. We try very hard to make our software the best that it can possibly be. However, we know that there are many smart people out there and the more input that we have on our code, the better it is going to be. So, if you decide to have a go at our code, let us know. We will incorporate your ideas into our software and distibute the improvements to everyone that is using it. * Filtering Algorithm: Other commercial internet filters sell you a subscription to a list containing bad sites. They attempt to keep this list up to date, although they don't actually let you see the list. With the massive increase in websites over the last several years, this model for web filtering has become obsolete. With over 2,000,000,000 sites on the web, it has become impossible to categorize all these sites and keep this list sufficiently up-to-date to accomplish effective filtering. Willow uses a Bayesian algorithm to classify pages on the fly based upon previous pages that have already been classified. Willow comes with a set of pages that were classified for Woodland Hills School District in Pittsburgh, Pennsylvania. Sites that use Willow can start with these pages and add their own sets of good and bad pages, or they can start just from scratch with pages that they classify. Willow puts the control of the filtering algorithm with the users, not with a single corporation. In addition to being the first web filter to really work, Willow was also designed to make life easy on network administrators. To this end Willow supports the following: * HTTPS tunneling * response caching * filtering based on any part of the request or response (domain, url, headers, etc.) * through-the-web management * authentication to a Windows NT/2000 domain * authentication through unix password files WARNING: The Bayesian Filtering algorithm the filter uses allows it to determine whether or not an item is "okay" or "bad" based upon previous "okay" and "bad" content that is has seen. To support the filter working "out of the box" there is "okay" and "bad" content in the download files. The "bad" content is pornography. While the software will not show you the "bad" content, it is possible to browse through the directory it is in (since it isn't encrypted in any way). If you are offended by this or it is illegal for you to view this please do not download the files.