Re: [Apt-proxy-users] endless recycling

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Monday 31 Jan 2005 22:10, Jonathan Koren wrote:
> Then don't use atime at all.  Use a served-time that only apt-proxy knows
> about.  When it sends the file it updates it.  When it goes to clean up
> the cache, it checks for it, and fills it with the mtime if it doesn't
> exist.  If it can't get an mtime, it uses the current time.

The mtime is not really all that useful.  There are lots of packages that may 
be quite old, e.g. all of Woody, where it is more important to know whether 
clients are actively downloading the package rather than the time since it 
was created.  Or perhaps you were thinking of setting the mtime to the time 
when ap wrote it, instead of the timestamp of the file itself?

> For whatever reason, recycling worked one system, and not on the other.
> If it worked on both systems, I wouldn't have even noticed it.  But I did
> notice it, and what I saw, I didn't like.

My guess would be a problem with the database permissions bugs that took 
several attempts to fix.  The uploads were only to unstable, and life in 
unstable is not guaranteed to be perfect I'm afraid.

> The way I understand it, the database needs to be updated so that the
> cleanup will work.  This means you have three processes running:  A server
> that fetches files, stores files in the cache, and sends files to clients
> while updating the file's last access time in a database.  A cleaner that
> periodically removes old files from the cache by comparing the current
> time to each file's last access time that is stored in a database.
> Finally, a recycler that periodically looks through the cache and makes
> sure all files in the cache are in the database.
>
> There are two processes that search through the cache and compares each
> file to the database.  That's redundant.  Furthermore, the recycler runs
> constantly, and if all goes well finds nothing the vast majority of the
> time.  Where as the cleaner runs only periodically (or at least it
> should).  Finally, the recycler runs excruciatingly slowly.  It shouldn't
> take 16 hours to update that small of a database, or any database for that
> matter.

There aren't multiple processes - it is effectively co-operative multitasking 
using the twisted framework.  I've not looked at that code closely enough to 
know all the details; you need to actually look at the code to find out what 
is going on.  For example, the recycler reads just a little part of the cache 
and then sleeps for a second before moving on.  I think that was supposed to 
help spread the load (since it is not easy to 'nice' IO in the same way as 
you can CPU time), but as you have noted it doesn't really notice very fast 
when large numbers of files are copied into the cache.  However, since the 
recycling is just taking a note of the time the file entered the cache, it 
doesn't matter if it takes a while before apt-proxy notices.  The database 
entry is not needed when serving the file, only when doing cache cleaning.

> Now what I REALLY don't like is that an arcane command is needed to update
> a database that already has an automatic process to keep the database
> updated.  That is beyond dumb.  It implies that automatic process doesn't
> work, or at least not well enough to be trusted; and if that's the case,
> what's the point of having the automatic process in the first place?

> Not only does apt-proxy-import redo the recycler's job, but it also
> tries to the server's job by checking if the file to be copied is actually
> new or not.  And when it tries to do that job, it fails miserably because
> for some reason it can't find backends that the server finds just
> fine.

apt-proxy-import was written to do a different job, and we're ending up trying 
to use a square peg to fit a round hole.  a-p-i is designed to automatically 
find the right places for files in the mirror without assuming that the files 
to import are in the right hierarchy already.  Say you have been running apt 
updates from a machine that does not use ap.  You will end up with a 
directory full of .debs in /var/cache/apt/archives, and a-p-i was written to 
be able to take these files and import them into the right backend and 
subdirectory in the cache.  This means a-p-i either needs to find the package 
listed in one of the Packages files, or it needs to guess based on pattern 
matching.  It sounds like some of this went wrong in your case.  I was not 
aware of it not working, and indeed it works well enough for me here.  
Perhaps you could provide some examples of the problem you are experiencing 
and detailed logs of a-p-i's behviour, thanks.

> If apt-proxy-import was simply cp, then the recycler would eventually find
> the new file and update the database.  When the file is requested, the
> server would check if it's new or not, and do the right thing accordingly,
> apt-proxy-import is solving a problem that doesn't exist.

Well, a problem you do not have in this case, because you have the correct 
directory layout already.

> So yeah, I don't like apt-proxy-import at all.

Fine, just don't use it and use cp.

> Shouldn't the server process handle that automatically?  Doesn't the
> server process already handle that automatically?  If it didn't download a
> complete file (which it can detect by comparing the number of bytes
> recieved with the number of bytes expected), then delete the partial file
> and try again.

(There are wishlist bugs open asking for download resume, which makes a lot of 
sense for large packages)

> (This can trivally be extended to check that the server 
> didn't just receive any old bytes, but the correct bytes.)  After n tries,
> fail so the client can request the next file.  Next time the client
> requests the file, it won't be in the database, so the process starts all
> over again.  You should never cache a broken file.

It would be helpful, though, to cache a partially downloaded file.  But it's 
orthoginal to the database - the file can be downloaded to <filename>.partial 
and then renamed if the d/l is sucessfull

> If you're talking about files getting damaged that are already on the
> disk, I don't think that's going to happen very often, and if so, the disk
> damage is probably far more extensive than just a couple of files limited
> to the apt-proxy cache.  If the damage is extensive, that's the job for a
> disk recovery tool.

I do agree that it's not likely to happen very often.  The problem that needs 
to be solved is what happens if the file does happen to be damaged and 
clients request the file.   With normal http sources, apt re-requests the 
file and you get a new copy.  But with ap, it just serves the file again 
without realising that there is a problem, and the clients never get a new 
copy.  So it is important for ap to have a way to know that it needs to reget 
a file from the backend, otherwise you get complaints from users of the cache 
who cannot correct the problem.

> Are you proposing to compare a checksum every time a file is served?
> I think that's a lot of work for something that just isn't going to
> happen.  If on the off chance a file in the cache did manage to become
> corrupted, the sysadmin would check the logs see the error report (e.g.
> "unexpected end of file") and simply delete the broken file from the
> cache, which would cause apt-proxy to get a new copy of the file.  If that
> file is also corrupted, then something much more serious is wrong, and
> apt-proxy couldn't possibly fix it.  (i.e. the remote source file is
> corrupted, or the disk is dying)

I don't want to have to involve the sysadmin.  They could be not available and 
ap has enough information to work out the problem for itself.  Oh, and that 
assumes there is a competent sysadmin available who realises that this is a 
problem with the local ap and doesn't assume it is a problem with the backend 
server itself.

Chris