Thread: [Dproxy-devel] dproxy 1.x
Brought to you by:
mattpratt
From: jeroen <pe...@ch...> - 2000-02-09 18:58:49
|
Andi, I just downloaded the 1.x source tree, looks nice. (although I did saw a few goto's in dproxy.c....not a real problem but dangerous.....) I haven't yet looked at it really close, but I saw a few things about ipv6, is that working already??? I have been thinking about the problems you mentioned about refreshing, but I think they are solved if a config option 'max_entries' is added, this would cause the oldest to be deleted if the cache gets to big, if it is frequently used it will appear at the end of the cache the next time it gets looked up. Everybody: About the memory cache and other things I saw in the last few mails, I agree with Matty that we should be carefull not to bloat it to much. I started using dproxy because it was nice and small and didn't need much config (I know.... I started the config thing.......) I think we should leave all the little extra's to our 'big brothers' like bind or dent, and concentrate on what dproxy was originally made for: small LAN's. I don't think we should change the deny file to dbase format, it would also bloat dproxy. Maybe it is an idea to add a keyword search to it??? e.g. *xxx in the deny file would cause all domains with 'xxx' to get rejected? (just my opinion) Grtz, Jeroen |
From: Andreas H. <hof...@in...> - 2000-02-09 20:11:16
|
Hi, jeroen wrote: > > Andi, > > I just downloaded the 1.x source tree, looks nice. (although I did saw a > few goto's in dproxy.c....not a real problem but dangerous.....) I > haven't yet looked at it really close, but I saw a few things about > ipv6, is that working already??? Nope, I need to read the RFC's first. Those points where you saw something about ipv6 are some sort of 'fixme' markers for the future. > > I have been thinking about the problems you mentioned about refreshing, > but I think they are solved if a config option 'max_entries' is added, > this would cause the oldest to be deleted if the cache gets to big, if > it is frequently used it will appear at the end of the cache the next > time it gets looked up. FIFO is not the best cache strategy one can think of, but it would work. There is another problem with that 'refresh' thing: Consider you conatct a large number of sites one day, your 'CACHE_PURGE_TIME' is set to n days. On the n'th day, you do not connect to the net. If you now connect the net on the n+1 day, all entries are refreshed in one run. The situation will get even worse if you do not connect to the net for a longer periode (say hollidays). Once all the hosts looked up in that single run, they will have the same creation time, so they will always refreshed together ... IMO those problems can not be solved with an on disk cache, as long as this has no fixed record structure, but this would mean that the cache wasted much space on disk and was not human readable anymore. We could solve the refresh problems with an memory cache ... > > Everybody: > > About the memory cache and other things I saw in the last few mails, I > agree with Matty that we should be carefull not to bloat it to much. I > started using dproxy because it was nice and small and didn't need much > config (I know.... I started the config thing.......) I think we should > leave all the little extra's to our 'big brothers' like bind or dent, > and concentrate on what dproxy was originally made for: small LAN's. Sure, but some features are usefull even for smaller networks, e.g. there is a small network monitor 'tkined' that could use HINFO records, or if you want to have a central mail server for your net, MX records are very usefull. Also small depends on your point of view, if you have 3-4 hosts, you may think 100 hosts make a big network, if you are admin for a net with some 1000 hosts, 100 are a small network. IMO 100 hosts is a number that fits most of the applications for dproxy - home networks, department of an university, a workstation pool and small companies. > I don't think we should change the deny file to dbase format, it would > also bloat dproxy. Do you talk about the 'hosts' file, the cache or about dproxy.deny ? > Maybe it is an idea to add a keyword search to it??? e.g. *xxx in the > deny file would cause all domains with 'xxx' to get rejected? Thats already in, dproxy tries to match the strings from 'dproxy.deny' with the end of a hostname, so a query for 'a.b.c' will be blocked by a 'dproxy.deny' line like 'b.c' - btw. this would also block x.yb.c' -> all domain names in the deny file should have a dot as first char. Ciao Andreas -- Andreas Hofmeister (see http://www.informatik.uni-freiburg.de/~hofmeist) "I can do it in Quake, why not in real life ?" (Lan Solaris (Illiad)) -----BEGIN GEEK CODE BLOCK----- Version: 3.12 GCS/IT d-- s--:>++: a C++ ULSC+++$ P+++ L+++ E+ W++ N+ o? !w--- !O !M-- V-- PS++ PE-- Y+ PGP- t+ 5? X+ !R !tv b+ DI? D++ G e>+++ h(++) r% y+ -----END GEEK CODE BLOCK----- |
From: jeroen <pe...@ch...> - 2000-02-09 21:19:00
|
Andreas Hofmeister wrote: > > > > Maybe it is an idea to add a keyword search to it??? e.g. *xxx in the > > deny file would cause all domains with 'xxx' to get rejected? > > Thats already in, dproxy tries to match the strings from 'dproxy.deny' > with the > end of a hostname, so a query for 'a.b.c' will be blocked by a > 'dproxy.deny' line like 'b.c' - btw. this would also block x.yb.c' -> > all domain names in the deny file should have a dot as first char. I know that's in, I wrote that part, but if you have 'hundreds' of sites you want to block this might be a little more effective for reducing the size of this file. (I think a firewall rule might be a better place for this kind of thing.) > > Ciao > Andreas Grtz, Jeroen |
From: Andreas H. <hof...@in...> - 2000-02-10 02:32:11
|
jeroen wrote: > > Andreas Hofmeister wrote: > > > Maybe it is an idea to add a keyword search to it??? e.g. *xxx in the > > > deny file would cause all domains with 'xxx' to get rejected? > > <snip> > ... if you have 'hundreds' of sites you want to block this might be a little > more effective for reducing the > size of this file. You think of some sort of regex. Mhhh - only problem with this is, one must be extremely careful with them - you might remember that AOL drama: they tried to block newsgroups like alt.sex etc. but they also blocked some groups about breast cancer, aids and such ... There are some functions in the libc for regex comparison we could use. Only thing about regex is, they are expensive, either in means of memory or speed. To be efficient, a regular expression must be pre compiled. We can not re-read and recompile them for every query we get - and of course, not for every of some hundred sites to block. Maybe we could implement that regex stuff like this : every line that starts with a special marker- say '@' (not part of any domainname) - will be regarded as a regex. On startup, dproxy reads the block file once, checks for regex and compiles them. In normal operation, dproxy simply ignores every line starting with a '@' > (I think a firewall rule might be a better place for this kind of > thing.) Both are different aspects of blocking, a FW rule can block the data transfer to a site, a blocking rule in the name server can hide the existence of a site. BTW. your refresh things are not yet in dproxy-1.x . Please wait a little before you start to implement this. I do some experiments with 'late forks', which makes a mem cache possible This will allow us to a do refresh implementation without the problems I mentioned. Also note, that the semantics for the time field in the cache file has changed! We now get and use the real TTL from the upstream DNS, 'cache_purge_time' has a very different meaning now, (it is simply the default TTL send to the clients, param should be renamed) There are two additional parameters that would make sense with this, 'cache_min_ttl' and 'cache_max_ttl', but both are not implemented yet. The first was for sites that send unbelievable short ttl's (e.g. netscape.com sys 1h ) , the second is for a little bit more security. Ciao Andreas |
From: <qu...@ba...> - 2000-02-10 00:25:04
|
On Wed, 9 Feb 2000, Andreas Hofmeister wrote: <SNIP> > FIFO is not the best cache strategy one can think of, but it would work. I suggest a reference count in the dproxy file ie: name ip timeout refcount anyotherstuff (Setting the file up this way means we can work with older dproxy cache files - support backward compatability I say :) When each site is accessed we increment the reference count. Thus when we come to purge we purge the entries using ttl as the primary and refcount as the secondary check. If we just use refcount then we may have a site just entered (so refcount =1 ) and it will be purges straight away - not what we want. > There is another problem with that 'refresh' thing: Consider you conatct > a large number of sites one day, your 'CACHE_PURGE_TIME' is set to n > days. On the n'th day, you do not connect to the net. If you now connect > the net on the n+1 day, all entries are refreshed in one run. The > situation will get even worse if you do not connect to the net for a > longer periode (say hollidays). > Once all the hosts looked up in that single run, they will have the same > creation time, so they will always refreshed together ... How about basing the cache purge time on how long the person was online? Ie in last/first line of proxy.cache keep an indicator. This indicator could be used for the time and so ttl entries would be purged as they expire based on the online time. This would however mean that the ttl entrie wasn't used how it was initailly intended. > IMO those problems can not be solved with an on disk cache, as long as > this has no fixed record structure, but this would mean that the cache > wasted much space on disk and was not human readable anymore. We could > solve the refresh problems with an memory cache ... I think it is possible on disk and in human readable format. I do think that dproxy should have a limited memory cache for the most used entries. A purge could write the memory cache to disk, purge it then reread the new most used entries. If a server relies heavily on the dns for local machine names, why should we have to go to disk each time? Cheers, Benjamin |
From: Andreas H. <hof...@in...> - 2000-02-10 02:31:58
|
qu...@ba... wrote: > > On Wed, 9 Feb 2000, Andreas Hofmeister wrote: > > <SNIP> > > FIFO is not the best cache strategy one can think of, but it would work. > > I suggest a reference count in the dproxy file ie: > Ok I would suggest the following memory cache implementation with or whithout refresh 1. The mem-cache is a linked list of entries 2. Every time we search and find an entry in the mem-cache, we move it to the front of the list. 2a. (refresh) if an entry was referenced, we increase a reference count. 3. if we can not find the entry in the mem cache, we search it in the disk cache and the dhcp file (in that order), if we find it there, we will add it to the mem cache as the first element. If there are too many entries in the mem cache, we remove the last element from the list. 4. we can not check for the entry on the DNS server! If we get an entry from the DNS, that entry will only be added to the disk cache. Only if it is referenced again, it will get into to the memory. 5a. (no refresh) if an entry from the cache expires, it is removed. 5b. (refresh) on each purge run on the mem cache, we decrease the reference count. If an entry is expired and the reference count is greater 0, we put that entry into a 'refresh' list. Names in the refresh list will be refreshed in a separate process and written into the disk cache. 6. We do not refresh entries in the disk cache. 7. After startup, the memory cache is empty. Hardware people sometimes call this a 'write through 1st level cache'. IMO this implementation will work well, if some conditions are meet : - The cache is 'large enough', all heavily used entries can be kept in memory. - We don't have linear access patterns that are longer than the cache size, this would lead to 'cache trashing'. - refresh with that implementation will work, as long as the system (and dproxy) are started and connected at least once in every 'cache_min_time' interval, - the TTL of a mem cache entry and a disk cache entry are always the same. detail alternatives : - Instead of a reference count, one could use the LRU time - Instead of starting with an empty cache, one could fill up the cache with the last 'mem_cache_size' entries from the disk cache. Ciao Andreas |
From: Benjamin C. <clo...@lu...> - 2000-02-10 03:15:57
|
On Thu, 10 Feb 2000, Andreas Hofmeister wrote: > Ok I would suggest the following memory cache implementation with or > whithout refresh I like the idea! > 2. Every time we search and find an entry in the mem-cache, we move it > to the front of the list. > 2a. (refresh) if an entry was referenced, we increase a reference count. We are still going to have the problem of sharing the list between processes. I suggest that the parent searches the ll, then disk, and returns the appropriate entry if found. Else we fork off the child to do the dns, since it may timeout. If we have the ip/name the delay shouldn't be that noticable so I think this would work well. > 3. if we can not find the entry in the mem cache, we search it in the > disk cache and the dhcp file (in that order), if we find it there, we > will add it to the mem cache as the first element. If there are too many > entries in the mem cache, we remove the last element from the list. If we find the entry in the disk cache and move it to memory we should probably also remove it/mark it invalid from the disk cache. We may end up with data inconsistency otherwise. > 6. We do not refresh entries in the disk cache. What happens to expired entries in the disk cache? On next refrence we move the to the ll and inc the refcount. However these may be way out of date. I think if we get an entry from the disk cache and the TTL has expired, refresh them add to ll. > Hardware people sometimes call this a 'write through 1st level cache'. > IMO this implementation will work well, if some conditions are meet : > > - The cache is 'large enough', all heavily used entries can be kept in > memory. I think this is already somewhat done. The reference count will force new entries in the ll so the ll will grow with the demand and shrink as the demand drops (ref count ==0) > - the TTL of a mem cache entry and a disk cache entry are always the > same. Removing the entry from the disk cache solves this problem. We may just want to mark it invalid (TTL=-1?)and have the purge routine rewrite the file when we go off the net. > detail alternatives : > > - Instead of a reference count, one could use the LRU time > - Instead of starting with an empty cache, one could fill up the cache > with the last > 'mem_cache_size' entries from the disk cache. Reloading the whole mem cache may be a little pointless. I think just reload the entries with a reference count > x where x is a defined value. Oh, who wan't to write what? Cheers, Benjamin PS: sorry emails to the address quakevr@bandi... will bounce - firewalled |
From: Andreas H. <hof...@in...> - 2000-02-10 17:58:56
|
Benjamin Close wrote: > Else we fork off the child to do > the dns, since it may timeout. If we have the ip/name the delay shouldn't > be that noticable so I think this would work well. I just moved the fork out of the main loop up to 'handle_query' to see what happens. I need to change some more little things to move the fork into the handler function. See no problem here. > If we find the entry in the disk cache and move it to memory we should > probably also remove it/mark it invalid from the disk cache. We may end up > with data inconsistency otherwise. I think it is not neccassary to mark entries on disk, because we already store the expiration time on disk. We should ignore expired entries in the disk cache instead. > > 6. We do not refresh entries in the disk cache. > > What happens to expired entries in the disk cache? On next refrence we > move the to the ll and inc the refcount. However these may be way out of > date. I think if we get an entry from the disk cache and the TTL has > expired, refresh them add to ll. If we ignore expired entries, we can reley on the usual mechanismen here : - If an entry is expired im mem, it is also expired on disk. - if we need to lookup an entry that is neither in memory nor on disk, we will get it from the DNS and write it back into the disk cache. Remember we still have the disk cache purge thing, so the only time interval an inconsitency problem could occur is in between the expiration time of an entry and the next disk cache purge. > > > > Hardware people sometimes call this a 'write through 1st level cache'. > > IMO this implementation will work well, if some conditions are meet : > > > > - The cache is 'large enough', all heavily used entries can be kept in > > memory. > > I think this is already somewhat done. The reference count will force new > entries in the ll so the ll will grow with the demand and shrink as the > demand drops (ref count ==0) Good idea, but we should need a LRU time instead of an refcount. Or we use a combination of refcount and LRU time like this: - every time an entry is referenced, we incremt the refcount. - every time the LRU time is < now - cache_expiration_time , we decrement the refcount and set the LRU to now. - if the refcount drops to 0, rmove the entry. > > > - the TTL of a mem cache entry and a disk cache entry are always the > > same. > > Removing the entry from the disk cache solves this problem. We may just > want to mark it invalid (TTL=-1?)and have the purge routine rewrite the > file when we go off the net. No problem because of the expiration time is stored on disk. > > - Instead of starting with an empty cache, one could fill up the cache > > with the last > > 'mem_cache_size' entries from the disk cache. > > Reloading the whole mem cache may be a little pointless. I think just > reload the entries with a reference count > x where x is a defined value. It is maybe usefull for that refresh stuff, we can leave this decission open and start with an empty mem cache for now. > > Oh, who wan't to write what? > Roadmap suggestion : --- Job #1 -- I just added a new data structure for cache entries, including some methods on them (read,write,dup,free). I will check that in tonight, Next I want to rewrite the disk cache stuff. The cache->hostentry conversion need to be rewitten to use both, the mem and the disk cache and should go somewhere else (xhostentry.c ?) The mem cache and the disk cache should have the same/simmiliar API, something like void init() cache_data_t * XX_find_name( char * type, char * name) cache_data_t * XX_find_next_name(char * type, char * name) cache_data_t * XX_find_data(char * type, char * data) cache_data_t * XX_find_next_data(char * type, char * data) int XX_add(cache_data_t *) void purge() A caller must not expect a returned pointer to be stable for a longer periode than until the next call to one of those function. The functions need some state information (a fd or a list entry) ---- --- Job #2 --- We need a LL implementation. As we maybe want to re-use it (refresh list ...), it should be in a separate module, i would suggest an api like this: typedef struct {...} list_t; tyepdef struct { list_t * list , ll_entry_t *pNext , cache_data_t * data } ll_entry; Constructor, an add method and a destructor list_t * ll_new(); int ll_add(list_t * , cache_data_t *) void ll_delete(ll_t *) ... an iterator ... ll_entry_t * ll_first(list_t *); ll_entry_t * ll_next(ll_entry_t *); ... element destructor ... void ll_remove(ll_entry_t *) ... and finally a move method void ll_to_front(ll_entry *) We need to remove entries from the middle of the list, so we maybe should implement this as a double linked list. Some of the functions are trivial and could be implemented as macros. (or inlines, but some compilers can not use inline) The LL implementation does not depend on the cache rewrite stuff, parallel work possible. --- Job #3 --- When the LL stuff is ready, one can implement the mem cache stuff. Ciao Andreas -- Andreas Hofmeister (see http://www.informatik.uni-freiburg.de/~hofmeist) "I can do it in Quake, why not in real life ?" (Lan Solaris (Illiad)) |
From: Benjamin C. <li...@se...> - 2000-02-11 04:20:00
|
On Thu, 10 Feb 2000, Andreas Hofmeister wrote: <SNIP> > combination of refcount and LRU time like this: > > - every time an entry is referenced, we incremt the refcount. > - every time the LRU time is < now - cache_expiration_time , we decrement > the refcount and set the LRU to now. > - if the refcount drops to 0, rmove the entry. I like it! > --- Job #1 -- ...disk cache stuff > --- Job #2 --- > We need a LL implementation. As we maybe want to re-use it (refresh list > ...), it should be in a separate module, i would suggest an api like this: I'll do this. I enjoy lls. > typedef struct {...} list_t; > tyepdef struct { list_t * list , ll_entry_t *pNext , cache_data_t * data } > ll_entry; The cache_data_t will could cause problems if the implementation of the refresh list requires extra fields (ie a delay to rechech the host later if it's down). Guess we can cross that hurdle when we get there though. > We need to remove entries from the middle of the list, so we maybe should > implement this as a double linked list. Some of the functions are trivial and > could be implemented as macros. (or inlines, but some compilers can not use > inline) Macro's are the safer option, most compilers like them. I think we can get away with a single linked list. After all we have to go from the front to the end on our search anyway - why not keep a pointer to the previous entry as we search. I take it the ll will only be searched by the initial instance of dproxy. Hence avoiding locks/memory sharing. If not let me know. Cheers, -- * Benjamin Close * Be...@se... * Web Page: http://users.senet.com.au/~benjsc |