Willow Filtering Proxy Code
Brought to you by:
davidredwaratah
WILLOW is a caching, content filtering proxy/http server written in python.
IMPORTANT NOTE:
Willow is designed to interact most compatibly with current HTTP/1.1 web servers.
Anything designed since 2006 should fit the idea of "current".
As of April 2010, certain servers still use HTTP/1.0, which pose problems for
Willow - namely Apple's software update server (http://swscan.apple.com) and the
Edna MP3/OGG server, among likely others. Certain data exchanges may not fully work
with these legacy servers, and their operators should be challenged to modernize their
systems for increased efficiency.
For reference, see http://www2.research.att.com/~bala/papers/h0vh1.html
This backwards compatibility issue seems deeply ingrained in Willow's original design,
and no efforts are currently planned to fully support HTTP/1.0 servers.
As a workaround, define known PROBLEM HTTP/1.0 sites to bypass the proxy. This can
often be done in the client (eg. Firefox and MacOSx proxy settings).
Willow Filtering Proxy
"He took some of the seed of your land and put it in fertile soil. He
planted it like a willow by abundant water, and it sprouted and became a
low, spreading vine. Its branches turned toward him, but its roots remained
under it. So it became a vine and produced branches and put out leafy
boughs."
-- THE Bible, Ezekiel 17:5-6
Willow is a content-filtering proxy server. It bears one similarity to the many
other pieces of software available for web filtering in that it is designed to
filter web content. That, however, is where the similarities end. The
differences between Willow and other solutions are significant, and these
differences make Willow the first really usable internet filter.
* Expense:
Willow is available free of charge. This is the complete, full-featured
version of the software. There are no holdbacks or catches. In addition,
any improvements or updates will be free of charge.
* Code:
Willow is open source (under the Gnu Public License). The source code is
available free of charge to anyone and anyone is allowed to make any
modification they wish (as long as they also release the source code).
There are many reasons that we make our software open source. First, it
makes the user able to customize the code for their own use. If any of our
software doesn't do exactly what you want, feel free to change it yourself
- we don't care. In fact, we would encourage you to do so because this
will make our software better. We try very hard to make our software the
best that it can possibly be. However, we know that there are many smart
people out there and the more input that we have on our code, the better
it is going to be. So, if you decide to have a go at our code, let us
know. We will incorporate your ideas into our software and distibute the
improvements to everyone that is using it.
* Filtering Algorithm:
Other commercial internet filters sell you a subscription to a list
containing bad sites. They attempt to keep this list up to date, although
they don't actually let you see the list. With the massive increase in
websites over the last several years, this model for web filtering has
become obsolete. With over 2,000,000,000 sites on the web, it has become
impossible to categorize all these sites and keep this list sufficiently
up-to-date to accomplish effective filtering. Willow uses a Bayesian
algorithm to classify pages on the fly based upon previous pages that have
already been classified. Willow comes with a set of pages that were
classified for Woodland Hills School District in Pittsburgh, Pennsylvania.
Sites that use Willow can start with these pages and add their own sets of
good and bad pages, or they can start just from scratch with pages that
they classify. Willow puts the control of the filtering algorithm with the
users, not with a single corporation.
In addition to being the first web filter to really work, Willow was also
designed to make life easy on network administrators. To this end Willow
supports the following:
* HTTPS tunneling
* response caching
* filtering based on any part of the request or response (domain, url, headers, etc.)
* through-the-web management
* authentication to a Windows NT/2000 domain
* authentication through unix password files
WARNING: The Bayesian Filtering algorithm the filter uses allows it to determine
whether or not an item is "okay" or "bad" based upon previous "okay" and "bad"
content that is has seen. To support the filter working "out of the box" there
is "okay" and "bad" content in the download files. The "bad" content is
pornography. While the software will not show you the "bad" content, it is
possible to browse through the directory it is in (since it isn't encrypted in
any way). If you are offended by this or it is illegal for you to view this
please do not download the files.