Re: [Openscrobbler-devel] Distributing the system

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Continuing my inline-whoring....

On Mar 4, 2005, at 11:56 PM, Mr.Deep wrote:

> I think it would be better to develop it as the central 
> server/clusters system that we have been discussing, as you mentioned, 
> building an AS-like clone may end up just making things harder on us 
> because we'll have to put a significant amount of effort into 
> regrouping it into a distributed system.  I guess the bad part of 
> going straight to the central server/clusters system is that it will 
> take longer, right?

Yea, it could take a really long time (especially at the current pace) 
to get there. It would of course make sense to avoid as much 
re-development as possible, but I think it reasonable to assume that 
we're not going to jump from 0% to 100% - we're going to need a way to 
get there, and that probably involves a "standalone" cluster-ish system 
in the shorter term.

> I finally took a look at the docs, and I am still having difficulty 
> figuring out exactly what sort if db interaction is going to be taking 
> place when a song play is submitted, and when a view [misc data] 
> request is received.  I *think* it is better from a db design 
> standpoint to simply insert the fact that a song is played when it is 
> (and I think that's what the song_data table is for), but I think we 
> would be able to provide a faster overall experience to the users if 
> we were to include play counts with every song, artist, album, etc, 
> and update them with every submission.  I think it would be worth it 
> to have faster statistic browsing at the cost of slower submission 
> processing.  I think i'm pretty much suggesting that we keep a 
> submission queue / cruncher, and hope to have faster / simpler queries 
> for viewing statistics.  Are we already planing on doing something 
> like this (updating total playcounts) and I'm just not seeing it being 
> mentioned? Is it really stupid for some reason that I don't 
> understand? Are we doing anything to improve upon AS beyond turning it 
> into a distributed system? (and is this even one of the project 
> goals?, does it need to be?)

The database at the moment is currently a result of the ERD and is not 
final nor optimized.  I also did it before I came up with any solutions 
for the distributed system.

At the moment, there is also no easy place to put the "cached" data in 
the DB. (We call it caching, even though it's not in RAM or anything.) 
For example:
- Total song count [for all users] is easy. Just put it in songs.
- Song count per user is not. There's no "user-songs" table. (Yet)

Only saving aggregated statistics makes you lose granularity (basically 
you lose the "time" element); for instance, you can't say what happened 
in the past week, unless you capture that specifically. We're having 
the same issue at work trying to create a stats package for our game - 
balancing lots of details with performance, as well as storage.

Yes, at the moment this is all in "song_data" which does the job just 
great, just not too quickly. My hope was to escape this problem by:
- clustering
- caching data to memory or disk - use memcached and/or store generated 
profile data somewhere.

As far as goals... no it doesn't need to be distributed (or, depending 
on your point of view, aggregated). This is a "would be cool" factor 
that would help bring all users together. It stemmed from the fact that 
it would be awesome if all the music tracking sites could be networked 
in a way so that there could be a "definitive" aggregation. Having 5, 
10, or 100 little sites with all their own statistics would be 
inherently bad. (For example: LiveJournal. You want everyone to have 
their journal at LJ so that you don't have to hop around the web, etc.) 
There is a definite advantage to having lots of people on one system. 
By aggregating the pieces, you create one system where there was 
previously many.

Maybe this is why I always think of it as "bringing together the 
clusters" because part of my idea was even a site like Audioscrobbler, 
which does not run Openscrobbler, could possibly contribute to the 
global statistics. If there was an API that could be implemented for 
any system, then even this would be possible.

But I digress.

The more important goal in the shorter-term is/was to get an 
open-source listener tracking system that is geared to providing a 
smaller number of users a larger number of features (compared to 
Audioscrobbler). I want to see the return of time played, 
weekly/monthly/etc stats, and stats that update more often then 
whenever-they-feel-like-it. I also want to see albums! My real 
motivation for doing this is more out of user frustration than geek 
pride.

> The "Please Wait ..." screen should be fine, it would definitely 
> better than just having the page take forever.

Yea, that'd be bad. I think the other option is to display something 
like:
"Global statistics has not yet been generated for this 
<song,album,artist>. Your request has been added to the queue and will 
be processed shortly. Please check back in a few minutes."

> Concerning the ids we're going to be assigning, what if the central 
> server just had some sort of id maping table, so it would know that 
> song 13 on server a = song 337 on server b?

Not sure. Need more thought on how IDs will be used any how critical it 
is that things get "aligned."

--JD