|
From: Jonathan D. <jd...@wu...> - 2005-03-05 02:28:01
|
Good reply! The distribution system has two aspects: "politics" and performance. Let me explain. The idea was this: the initial development would create a standalone audioscrobbler-like clone. Anyone could download the software and setup a music tracking web site for themselves, their friends, whoever. They can customize it if they want. Submit patches. Improve it. From here, my idea was to link together the independent systems into one large statistical pool. What this means is that it's not being built from the top down, but instead the bottom up. If we did it this way, we would be "stuck" with trying to regroup it into the distributed system. Now, I'm not saying it has to go down that way. But that's the way I thought it up. So lets talk about ignoring the pretenses under which this all happened. The rest of my reply is inline below. On Mar 4, 2005, at 12:33 AM, Mr.Deep wrote: > So, what exactly is the limiting factor to AudioScrobler? Is it the > the processing time required? or the traffic (submissions, or viewing > data)? A better understanding of the limitations would help me > understand what we're trying to fix, and what we must avoid. I think > that the most difficult task is processing the data, but I'm not sure > if we also have to take submission / viewing traffic into > consideration. What exactly do we hope to gain by spreading the work > (whatever it may be) over multiple servers? I can really only guess what AS's problems are. I can only guess based on what they've done to fix it. They've moved processing into the background or into memory (using memcached), which means that generating things on the fly was a problem. So that's too many queries on viewing. By reducing the number of queries, they probably have an easier time doing INSERTs as well. The INSERTs are also put into memory (so its not the connections either) and are a queue for the "cruncher" whose sole responsibility is to save the statistics. So basically it's all processing and not traffic at all. But the processing traffic comes from both submissions and viewing. You typically can't SELECT and INSERT at the same time (because you need to lock the table). > What you have to say about non-inter-user statistics and each cluster > generating its own stats sounds good. I'm just wondering what exactly > is going to happen when the statistics are generated. Is it going to > be "look at everything" get totals, etc etc, or will the statistics > generating be incremental? I imagine it will initially be "look at everything" but we may evolve towards incremental updating. We could just jump to incremental if its not too hard. I suppose it really depends what the statistic in question is. Obviously if its "in the past week" it will be "look at everything in the past week." Of course the majority of data is regarding song total and things.... I don't know, I think I need to think more about this. > Concerning not aggregating data until a user requests it, the only > problem I see with this is: what will the user experience be? If > someone clicks on the song for the first time, how long will it take > to get the information they requested? I'm not really sure. It really depends on who receives the request. If its a cluster, then it could display local results immediately (as well as cached results if available) and ask the central server, or all the other servers, for an update if necessary. If its the central server, I suppose it wouldn't have anything, but it shouldn't take very long to fetch results from the servers. I imagine we could throw up a "Please wait" screen while it fetches the data. > I think we definitely need to decide if we are going to have some sort > of weekly rollover type thing like AS has, because if we do, each > server could assign ids as it wishes, then go through some sort of > reconciliation phase. Otherwise, we could make use of the hashing > system mentioned in the other email. But we really should make this > decision as it will probably effect other things too. The hashing system is a cool idea, but it doesn't generate very good keys. 32 byte (or more - I looked at some other "better" hashing algorithms) keys are pretty long - that's the equivalent of 8 integers. That's probably be hard to index (*tries to remember how btrees work* - well I think its only a matter of comparing the data). Regardless, it might be the best idea going forward. A reconciliation phase could also work, though. Alright, that's all for now. --JD |