[Vrspace-dev] caching proxy AKA VRSpace wget

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Rob Meyers wrote:

> Just tried to get the new gates from cvs, doesn't seem to be any changes 
> since I last checked things in, please commit.

Well, coulnd't pay my bills so ISP turned me of:) Now need to comment 
out portions of code to compile... coming.
Anyway, I'll describe the functionality, hope to get some input.

To enable it, just set http proxy in your web browser to localhost 8081 
and that's it.
Then what happens. Proxy fetches each and every requested doc and stores 
it to (hardcoded)
<vrspace install dir>/cache/<protocol>/<server>/<path>
i.e.
cache/http/www.vrspace.org/index.html
for later reference.
On subsequent requests, it does not attempt to retrieve the content from 
the original location for some time, default 24h.
This is _not_ http compliant: spec sez proxies are required to take into 
account expires and if-modified-since. But, maybe 20% of content is 
usefull information, while the rest are adds and other BS (according to 
eddie's measurement). Adds tipically have expires set to 15 mins, and 
browsers reload them again and again till we close the window, meaning 
we waste bandwidth even when we went for lunch... this bandwidth-saving 
strategy is borrowed from squid setup used in my LinProxy distro;)
But it can lead to trouble. Usual way out of it is to hit reload button. 
Though I'm not sure we can use this approach for our purposes: I don't 
want to reload all the vrml, I want only say one specific avatar.
Sure, one way is to set timeout low, say 15 mins, but then there's 
little use of cache.
Well, open issues.

Proxy also stores all the http metadata. Should store them to DB, so far 
only keeps them in memory. Most important is the content-type, but proxy 
needs them all: last-modified, expires, etc.

And there's more to it. During download html and xml (wml, openoffice, 
...) documents are parsed, sure we should parse vrml too. You wanna 
search, you gotta see what's inside.
So far I look only for external references, with the intent to check 
broken links, maybe download textures and pics along with document etc. 
Actually, I hoped to rewrite all the urls to refer to locally stored 
copies rather than originals. This would allow me to open docs directly 
from the filesystem. But this leads to trouble: browser then requests 
local textures and pics instead of original ones.
So I dropped the idea. Meaning there's dead code inside. But this code 
aims to wget functionallity... dunno, let's leave it there for a while, 
maybe we'll need it.
OK, parsing continues. This part will wait for NeuroGrid implementation: 
document title will be stored to NG DB as <url> IS_RELATED_TO <keyword>, 
document body will (optionally) go with <url> CONTAINS <keyword> etc. So 
when you search the net, you first search all the stuff you've ever 
seen, then ask google. Well, whoom you ask depends on where you usually 
find what you're looking for, but that's another long story.

And some implementation notes. A new package, org.vrspace.vfs - virtual 
file system. I don't like it much, but org.vrspace.server got too 
crowdy:) This parsing stuff requires com.arthurdo.parser package, so you 
won't be able to compile it unless you copy com.arthurdo from src/chisel 
to src/main. Don't like that neither... furthermore, NG uses another 
html parser by Somik Raha. A mess:)))
In short - how to parse html?

Regards...