[Vrspace-dev] caching proxy AKA VRSpace wget
Brought to you by:
jalmasi
|
From: Josip A. <jo...@vr...> - 2004-09-06 08:40:50
|
Rob Meyers wrote: > Just tried to get the new gates from cvs, doesn't seem to be any changes > since I last checked things in, please commit. Well, coulnd't pay my bills so ISP turned me of:) Now need to comment out portions of code to compile... coming. Anyway, I'll describe the functionality, hope to get some input. To enable it, just set http proxy in your web browser to localhost 8081 and that's it. Then what happens. Proxy fetches each and every requested doc and stores it to (hardcoded) <vrspace install dir>/cache/<protocol>/<server>/<path> i.e. cache/http/www.vrspace.org/index.html for later reference. On subsequent requests, it does not attempt to retrieve the content from the original location for some time, default 24h. This is _not_ http compliant: spec sez proxies are required to take into account expires and if-modified-since. But, maybe 20% of content is usefull information, while the rest are adds and other BS (according to eddie's measurement). Adds tipically have expires set to 15 mins, and browsers reload them again and again till we close the window, meaning we waste bandwidth even when we went for lunch... this bandwidth-saving strategy is borrowed from squid setup used in my LinProxy distro;) But it can lead to trouble. Usual way out of it is to hit reload button. Though I'm not sure we can use this approach for our purposes: I don't want to reload all the vrml, I want only say one specific avatar. Sure, one way is to set timeout low, say 15 mins, but then there's little use of cache. Well, open issues. Proxy also stores all the http metadata. Should store them to DB, so far only keeps them in memory. Most important is the content-type, but proxy needs them all: last-modified, expires, etc. And there's more to it. During download html and xml (wml, openoffice, ...) documents are parsed, sure we should parse vrml too. You wanna search, you gotta see what's inside. So far I look only for external references, with the intent to check broken links, maybe download textures and pics along with document etc. Actually, I hoped to rewrite all the urls to refer to locally stored copies rather than originals. This would allow me to open docs directly from the filesystem. But this leads to trouble: browser then requests local textures and pics instead of original ones. So I dropped the idea. Meaning there's dead code inside. But this code aims to wget functionallity... dunno, let's leave it there for a while, maybe we'll need it. OK, parsing continues. This part will wait for NeuroGrid implementation: document title will be stored to NG DB as <url> IS_RELATED_TO <keyword>, document body will (optionally) go with <url> CONTAINS <keyword> etc. So when you search the net, you first search all the stuff you've ever seen, then ask google. Well, whoom you ask depends on where you usually find what you're looking for, but that's another long story. And some implementation notes. A new package, org.vrspace.vfs - virtual file system. I don't like it much, but org.vrspace.server got too crowdy:) This parsing stuff requires com.arthurdo.parser package, so you won't be able to compile it unless you copy com.arthurdo from src/chisel to src/main. Don't like that neither... furthermore, NG uses another html parser by Somik Raha. A mess:))) In short - how to parse html? Regards... |