Re: [Openscrobbler-devel] Distributing the system

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Good reply!

The distribution system has two aspects: "politics" and performance. 
Let me explain.

The idea was this: the initial development would create a standalone 
audioscrobbler-like clone. Anyone could download the software and setup 
a music tracking web site for themselves, their friends, whoever. They 
can customize it if they want. Submit patches. Improve it.

 From here, my idea was to link together the independent systems into 
one large statistical pool. What this means is that it's not being 
built from the top down, but instead the bottom up. If we did it this 
way, we would be "stuck" with trying to regroup it into the distributed 
system.

Now, I'm not saying it has to go down that way. But that's the way I 
thought it up. So lets talk about ignoring the pretenses under which 
this all happened. The rest of my reply is inline below.

On Mar 4, 2005, at 12:33 AM, Mr.Deep wrote:

> So, what exactly is the limiting factor to AudioScrobler? Is it the 
> the processing time required? or the traffic (submissions, or viewing 
> data)? A better understanding of the limitations would help me 
> understand what we're trying to fix, and what we must avoid.  I think 
> that the most difficult task is processing the data, but I'm not sure 
> if we also have to take submission / viewing traffic into 
> consideration.  What exactly do we hope to gain by spreading the work 
> (whatever it may be) over multiple servers?

I can really only guess what AS's problems are. I can only guess based 
on what they've done to fix it. They've moved processing into the 
background or into memory (using memcached), which means that 
generating things on the fly was a problem. So that's too many queries 
on viewing.  By reducing the number of queries, they probably have an 
easier time doing INSERTs as well. The INSERTs are also put into memory 
(so its not the connections either) and are a queue for the "cruncher" 
whose sole responsibility is to save the statistics.

So basically it's all processing and not traffic at all. But the 
processing traffic comes from both submissions and viewing. You 
typically can't SELECT and INSERT at the same time (because you need to 
lock the table).

> What you have to say about non-inter-user statistics and each cluster 
> generating its own stats sounds good.  I'm just wondering what exactly 
> is going to happen when the statistics are generated.  Is it going to 
> be "look at everything" get totals, etc etc, or will the statistics 
> generating be incremental?

I imagine it will initially be "look at everything" but we may evolve 
towards incremental updating. We could just jump to incremental if its 
not too hard.

I suppose it really depends what the statistic in question is. 
Obviously if its "in the past week" it will be "look at everything in 
the past week." Of course the majority of data is regarding song total 
and things.... I don't know, I think I need to think more about this.

> Concerning not aggregating data until a user requests it, the only 
> problem I see with this is: what will the user experience be? If 
> someone clicks on the song for the first time, how long will it take 
> to get the information they requested?

I'm not really sure. It really depends on who receives the request. If 
its a cluster, then it could display local results immediately (as well 
as cached results if available) and ask the central server, or all the 
other servers, for an update if necessary. If its the central server, I 
suppose it wouldn't have anything, but it shouldn't take very long to 
fetch results from the servers. I imagine we could throw up a "Please 
wait" screen while it fetches the data.

> I think we definitely need to decide if we are going to have some sort 
> of weekly rollover type thing like AS has, because if we do, each 
> server could assign ids as it wishes, then go through some sort of 
> reconciliation phase.   Otherwise, we could make use of the hashing 
> system mentioned in the other email.  But we really should make this 
> decision as it will probably effect other things too.

The hashing system is a cool idea, but it doesn't generate very good 
keys. 32 byte (or more - I looked at some other "better" hashing 
algorithms) keys are pretty long - that's the equivalent of 8 integers. 
That's probably be hard to index (*tries to remember how btrees work* - 
well I think its only a matter of comparing the data). Regardless, it 
might be the best idea going forward. A reconciliation phase could also 
work, though.

Alright, that's all for now.

--JD