You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
(16) |
Apr
|
May
(1) |
Jun
|
Jul
(3) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|---|
|
From: Jonathan D. <jd...@wu...> - 2005-07-14 19:39:40
|
I'm thinking that in the short term, RoR may get us going - if for no other reason but to use RoR. In time, pieces may be ported to other things. As I understand it, RoR is used by Basecamp (www.basecamp.com) and other 37signals projects (37signals.com) - in fact, 37signals developed RoR. I'm not sure how much Basecamp is used and what kind of infrastructure has behind it. On Jul 14, 2005, at 3:12 PM, Mr. Deep wrote: > Based on minor bits of research, it would appear that ror, while > fantastic for quick / agile development, has not yet been really > tested for large scale things as we hope this will one day be. > > FastCGI looks good though, so does mod_ruby. Ultimately, you > understand the intricacies of the performance dilemma far better > than I do, so you should make the decision. Personally, I'm pro RoR. > > On 7/13/05, Jonathan Dance <jd...@wu...> wrote: > Any thoughts on doing OpenScrobbler in Ruby on Rails (RoR)? It might > give us new incentive to work on it, and it means we could probably > start faster (the framework I was working on in PHP is, basically, > RoR). > > Downside: RoR is not as widely available, and is generally slower and > would require at least some system software optimization (i.e. > mod_ruby and/or FastCGI for Ruby or something); also, don't know > about availability of things like memcache libraries for RoR. > > --JD > > > ------------------------------------------------------- > This SF.Net email is sponsored by the 'Do More With Dual!' webinar > happening > July 14 at 8am PDT/11am EDT. We invite you to explore the latest in > dual > core and dual graphics technology at this free one hour event > hosted by HP, > AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar > _______________________________________________ > Openscrobbler-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openscrobbler-devel > > > > -- > - deep |
|
From: Mr. D. <mr...@gm...> - 2005-07-14 19:13:26
|
Based on minor bits of research, it would appear that ror, while fantastic= =20 for quick / agile development, has not yet been really tested for large=20 scale things as we hope this will one day be.=20 FastCGI looks good though, so does mod_ruby. Ultimately, you understand the= =20 intricacies of the performance dilemma far better than I do, so you should= =20 make the decision. Personally, I'm pro RoR.=20 On 7/13/05, Jonathan Dance <jd...@wu...> wrote: >=20 > Any thoughts on doing OpenScrobbler in Ruby on Rails (RoR)? It might > give us new incentive to work on it, and it means we could probably > start faster (the framework I was working on in PHP is, basically, RoR). >=20 > Downside: RoR is not as widely available, and is generally slower and > would require at least some system software optimization (i.e. > mod_ruby and/or FastCGI for Ruby or something); also, don't know > about availability of things like memcache libraries for RoR. >=20 > --JD >=20 >=20 > ------------------------------------------------------- > This SF.Net <http://SF.Net> email is sponsored by the 'Do More With Dual!= '=20 > webinar happening > July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual > core and dual graphics technology at this free one hour event hosted by= =20 > HP, > AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar > _______________________________________________ > Openscrobbler-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openscrobbler-devel >=20 --=20 - deep |
|
From: Jonathan D. <jd...@wu...> - 2005-07-13 17:19:00
|
Any thoughts on doing OpenScrobbler in Ruby on Rails (RoR)? It might give us new incentive to work on it, and it means we could probably start faster (the framework I was working on in PHP is, basically, RoR). Downside: RoR is not as widely available, and is generally slower and would require at least some system software optimization (i.e. mod_ruby and/or FastCGI for Ruby or something); also, don't know about availability of things like memcache libraries for RoR. --JD |
|
From: Jonathan D. <jd...@wu...> - 2005-05-05 04:06:47
|
I was wondering what was wrong with this horrible math. Hexidecimal is 0-15, true, but that's not 16 bits... its 4 bits! Duh! 32 * 4 bits = 16 bytes, or 4 ints. I think this is a reasonable key size... my guess is that this is the best way forward for keys. Alternately we could use sha1 which is 20 bytes, further reducing the chance of a collision. On Mar 3, 2005, at 12:12 AM, Jonathan Dance wrote: > Err, its' 32 * 16 bits, or 64 bytes. This makes the collision > chance slightly bigger but you get the idea. > > --JD > > On Mar 3, 2005, at 12:10 AM, Jonathan Dance wrote: > > >>> Another issue is unique IDs. Assuming we store songs/albums/ >>> artists in a database, how will the clusters have the same IDs as >>> the central database (or, every other database)? The first >>> inclination is to store this on the central server and have the >>> clusters download this information. When a new song is submitted >>> to a cluster, it tells the central server. >>> >> >> Just thought of this: >> IDs could be based on a hashing algorithm, like MD5. MD5 is 34 * >> 16 bits = 34 * 2 bytes = 68 bytes. That's a pretty long ID but it >> has a 3.38 x 10^-21 chance of a collision. >> >> --JD >> >> > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Openscrobbler-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openscrobbler-devel > |
|
From: Mr. D. <mr...@gm...> - 2005-03-13 00:18:28
|
Ok, I'm all set on the svn stuff, let the devlopment begin! Almighty Obenscrobbler dev lead, assign a task and i shall complete it. - deep On Mar 12, 2005, at 1:04 AM, Jonathan Dance wrote: > Hi All, > > I have the following development news. Note that we aren't really > ready to start "real" development but I figured I'd get the ball > rolling. > > Our subversion repository has been up for some time. It is accessible > at http://wuputah.com:81/svn/openscrobbler. The repository is now > world-readable, but you will need an account to commit changes. When > you need an account, please contact me. > > To learn how to use Subversion, I have attached a document I wrote up > a while ago at Agora. The URLs are wrong but the idea is the same. > Also, don't use the links to the software; go to the web sites and get > the latest version. > > PHP5 is running on our Apache2 server on Port 81 (the same port as > access to subversion). An automatically updating checked out copy of > the code (a so-called "sandbox") is available for testing at > http://openscrobbler.wuputah.com:81. What this means is when you > commit a change to the SVN repository, the web site above > automatically updates itself with the latest code. > > Which brings me to this point. I would prefer if all code that gets > committed is "parsable." (Basically, it shouldn't contain parse > errors.) In compiled languages, this would mean you would compile it > and make sure it doesn't "break the build." There is no way to do this > in PHP but to test it some other way. Now, getting PHP5/MySQL/Apache > running on your local computer is a minor feat; it's possible, yes, > but not easy. I would not recommend it because, in addition, you will > have issues keeping your copy of the MySQL database in sync with the > development one. > > An alternative is to do the development on the server. You check out a > copy on the server, set it up as necessary, make it web accessible, > and start working. You could then use Transmit + SubEthaEdit (or your > favorite FTP client/text editor) to edit the files. Alternately, you > could download the whole project and use Transmit to synchronize the > folders (this way you can browse the project locally, without being > slowed up by Transmit) as necessary. Either way, once you have tested > your changes, you can commit them on the server. Be aware: > - If you choose to synchronize, be careful when doing "svn update" - > you should synchronize your local copy UP before updating, and > synchronize DOWN after updating, else your local copy will suck and > everyone will call you a persnickety flibbertigibbet. > - Also, I highly recommend getting Transmit 3 if you're doing any work > like this; the column view is fantastic. > > That's all for now. > > So say we all, > > --JD > > <svn-help.html> |
|
From: Jonathan D. <jd...@wu...> - 2005-03-12 22:45:18
|
On second thought, you don't need to worry about being a persnickety flibbertigibbet (phew!). SVN is smart enough to not allow you to be so foolish. My recommendation is this, if you want to work on it locally: - Get SVN installed on your local computer. - Checkout (or update) the project. Work on it. Etc. - Upload it to the server (or synchronize it) to a web-accessible location. You could choose to ignore the .svn folder when synchronizing. - Test your changes, fix your bugs, test your bug fixes, rinse repeat. - When you're done, update, then check it in from your local computer. You can update anytime on your local machine. Basically you just use the server as a test environment, but do all your work locally. --JD On Mar 12, 2005, at 1:04 AM, Jonathan Dance wrote: > An alternative is to do the development on the server. You check out a > copy on the server, set it up as necessary, make it web accessible, > and start working. You could then use Transmit + SubEthaEdit (or your > favorite FTP client/text editor) to edit the files. Alternately, you > could download the whole project and use Transmit to synchronize the > folders (this way you can browse the project locally, without being > slowed up by Transmit) as necessary. Either way, once you have tested > your changes, you can commit them on the server. Be aware: > - If you choose to synchronize, be careful when doing "svn update" - > you should synchronize your local copy UP before updating, and > synchronize DOWN after updating, else your local copy will suck and > everyone will call you a persnickety flibbertigibbet. > - Also, I highly recommend getting Transmit 3 if you're doing any work > like this; the column view is fantastic. |
|
From: Mr. D. <mr...@gm...> - 2005-03-12 22:10:54
|
Thanks for the clarifications, sorry for the delay in reply time. On Mar 5, 2005, at 1:08 AM, Jonathan Dance wrote: > Continuing my inline-whoring.... > > On Mar 4, 2005, at 11:56 PM, Mr.Deep wrote: > >> I think it would be better to develop it as the central >> server/clusters system that we have been discussing, as you >> mentioned, building an AS-like clone may end up just making things >> harder on us because we'll have to put a significant amount of effort >> into regrouping it into a distributed system. I guess the bad part >> of going straight to the central server/clusters system is that it >> will take longer, right? > > Yea, it could take a really long time (especially at the current pace) > to get there. It would of course make sense to avoid as much > re-development as possible, but I think it reasonable to assume that > we're not going to jump from 0% to 100% - we're going to need a way to > get there, and that probably involves a "standalone" cluster-ish > system in the shorter term. > >> I finally took a look at the docs, and I am still having difficulty >> figuring out exactly what sort if db interaction is going to be >> taking place when a song play is submitted, and when a view [misc >> data] request is received. I *think* it is better from a db design >> standpoint to simply insert the fact that a song is played when it is >> (and I think that's what the song_data table is for), but I think we >> would be able to provide a faster overall experience to the users if >> we were to include play counts with every song, artist, album, etc, >> and update them with every submission. I think it would be worth it >> to have faster statistic browsing at the cost of slower submission >> processing. I think i'm pretty much suggesting that we keep a >> submission queue / cruncher, and hope to have faster / simpler >> queries for viewing statistics. Are we already planing on doing >> something like this (updating total playcounts) and I'm just not >> seeing it being mentioned? Is it really stupid for some reason that I >> don't understand? Are we doing anything to improve upon AS beyond >> turning it into a distributed system? (and is this even one of the >> project goals?, does it need to be?) > > The database at the moment is currently a result of the ERD and is not > final nor optimized. I also did it before I came up with any > solutions for the distributed system. > > At the moment, there is also no easy place to put the "cached" data in > the DB. (We call it caching, even though it's not in RAM or anything.) > For example: > - Total song count [for all users] is easy. Just put it in songs. > - Song count per user is not. There's no "user-songs" table. (Yet) > > Only saving aggregated statistics makes you lose granularity > (basically you lose the "time" element); for instance, you can't say > what happened in the past week, unless you capture that specifically. > We're having the same issue at work trying to create a stats package > for our game - balancing lots of details with performance, as well as > storage. > > Yes, at the moment this is all in "song_data" which does the job just > great, just not too quickly. My hope was to escape this problem by: > - clustering > - caching data to memory or disk - use memcached and/or store > generated profile data somewhere. > > As far as goals... no it doesn't need to be distributed (or, depending > on your point of view, aggregated). This is a "would be cool" factor > that would help bring all users together. It stemmed from the fact > that it would be awesome if all the music tracking sites could be > networked in a way so that there could be a "definitive" aggregation. > Having 5, 10, or 100 little sites with all their own statistics would > be inherently bad. (For example: LiveJournal. You want everyone to > have their journal at LJ so that you don't have to hop around the web, > etc.) There is a definite advantage to having lots of people on one > system. By aggregating the pieces, you create one system where there > was previously many. > > Maybe this is why I always think of it as "bringing together the > clusters" because part of my idea was even a site like Audioscrobbler, > which does not run Openscrobbler, could possibly contribute to the > global statistics. If there was an API that could be implemented for > any system, then even this would be possible. > > But I digress. > > The more important goal in the shorter-term is/was to get an > open-source listener tracking system that is geared to providing a > smaller number of users a larger number of features (compared to > Audioscrobbler). I want to see the return of time played, > weekly/monthly/etc stats, and stats that update more often then > whenever-they-feel-like-it. I also want to see albums! My real > motivation for doing this is more out of user frustration than geek > pride. > >> The "Please Wait ..." screen should be fine, it would definitely >> better than just having the page take forever. > > Yea, that'd be bad. I think the other option is to display something > like: > "Global statistics has not yet been generated for this > <song,album,artist>. Your request has been added to the queue and will > be processed shortly. Please check back in a few minutes." > >> Concerning the ids we're going to be assigning, what if the central >> server just had some sort of id maping table, so it would know that >> song 13 on server a = song 337 on server b? > > Not sure. Need more thought on how IDs will be used any how critical > it is that things get "aligned." > > --JD > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Openscrobbler-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openscrobbler-devel > |
|
From: Jonathan D. <jd...@wu...> - 2005-03-12 06:04:58
|
SVN Directions
Reference
Subversion Path
You will use the following Subversion path when checking things out:
http://dev.agorastudios.com:8080/svn/agora/project
Where project is the directory you want to check out. Examples (right
now) are magenta/chips or magenta/growthlink.
You will also need a repository username and password. If you do not
have one of these, please contact JD.
Repository Browsing
You can browse the repository by pointing your web browser to
http://dev.agorastudios.com:8080/svn/agora/. You will be asked for your
repository username and password.
Terminology
Check out
Checks out a working copy to your local computer.
Update
Updates your checked out copy to the latest revision
Add
Adds new files to be part of the repository. Your file is not
uploaded until you commit.
Commit
Commits your changes to the repository.
How to use a command line version
Checkout
svn checkout http://dev.agorastudios.com:8080/svn/agora/magenta/chips
svn co ...
Update
svn update
svn up
Add
svn add filename
Commit
svn commit -m "Explanation of what you're committing"
svn ci ...
Mac OS X Instructions
Command line version
1. Install Fink (Direct link to latest version as of 4/7/04
<http://aleron.dl.sourceforge.net/sourceforge/fink/Fink-0.7.0-Installer.dmg>).
2. In the next step, the commands will ask for a password; you need
to enter your local account password for OS X.
3. Open a terminal, type sudo apt-get update, and press enter. You
need to enter your local account password. Type sudo apt-get
install svn-client. Press yes to any prompts.
4. At this point you are ready to use the command line version. You
can install the Subversion Finder plugin if you wish, which follows.
Finder plugin
Please note that this is pre-release software, and may cause
permanent and irreperable damage to your system. Please take
appropriate measures to safeguard data.
1. Do steps above to get command line version.
2. In a Terminal, type cd /usr/local/bin, press enter. Type sudo ln
-s /sw/bin/svn, and press enter. It will ask for a password; enter
your OS X account password. You can now close the Terminal.
3. Get the Finder plugin from http://homepage.mac.com/pavicich/.
4. Follow the installation instructions at
http://svn.red-bean.com/scplugin/trunk/INSTALLATION.txt. ~ refers
to your home directory.
Using the Finder plugin
Right click or ctrl-click any folder. You should see a Subversion
context menu. From here you can access most of Subversion's
functionality. The plugin should be considered alpha so you may run into
troubles with certain operations.
Windows Instructions
Installation: Shell Extension
/Note: You do not need the command line tool to use the shell extension./
The name of the shell extension is TortoiseSVN. You can download the
latest version from http://tortoisesvn.tigris.org/download.html.
Using the shell extension
Right click anywhere in the Windows file system and you will see a
Subversion context menu, as well as quick access to common commands.
Using these you can access all of the necessary Subversion commands.
Installation: Command line tool
You can get the command line tools at
http://subversion.tigris.org/servlets/ProjectDocumentList?folderID=91.
Current recommended download is
http://subversion.tigris.org/files/documents/15/12170/svn-1.0.1-setup-2.exe.
|
|
From: Jonathan D. <jd...@wu...> - 2005-03-05 06:08:34
|
Continuing my inline-whoring.... On Mar 4, 2005, at 11:56 PM, Mr.Deep wrote: > I think it would be better to develop it as the central > server/clusters system that we have been discussing, as you mentioned, > building an AS-like clone may end up just making things harder on us > because we'll have to put a significant amount of effort into > regrouping it into a distributed system. I guess the bad part of > going straight to the central server/clusters system is that it will > take longer, right? Yea, it could take a really long time (especially at the current pace) to get there. It would of course make sense to avoid as much re-development as possible, but I think it reasonable to assume that we're not going to jump from 0% to 100% - we're going to need a way to get there, and that probably involves a "standalone" cluster-ish system in the shorter term. > I finally took a look at the docs, and I am still having difficulty > figuring out exactly what sort if db interaction is going to be taking > place when a song play is submitted, and when a view [misc data] > request is received. I *think* it is better from a db design > standpoint to simply insert the fact that a song is played when it is > (and I think that's what the song_data table is for), but I think we > would be able to provide a faster overall experience to the users if > we were to include play counts with every song, artist, album, etc, > and update them with every submission. I think it would be worth it > to have faster statistic browsing at the cost of slower submission > processing. I think i'm pretty much suggesting that we keep a > submission queue / cruncher, and hope to have faster / simpler queries > for viewing statistics. Are we already planing on doing something > like this (updating total playcounts) and I'm just not seeing it being > mentioned? Is it really stupid for some reason that I don't > understand? Are we doing anything to improve upon AS beyond turning it > into a distributed system? (and is this even one of the project > goals?, does it need to be?) The database at the moment is currently a result of the ERD and is not final nor optimized. I also did it before I came up with any solutions for the distributed system. At the moment, there is also no easy place to put the "cached" data in the DB. (We call it caching, even though it's not in RAM or anything.) For example: - Total song count [for all users] is easy. Just put it in songs. - Song count per user is not. There's no "user-songs" table. (Yet) Only saving aggregated statistics makes you lose granularity (basically you lose the "time" element); for instance, you can't say what happened in the past week, unless you capture that specifically. We're having the same issue at work trying to create a stats package for our game - balancing lots of details with performance, as well as storage. Yes, at the moment this is all in "song_data" which does the job just great, just not too quickly. My hope was to escape this problem by: - clustering - caching data to memory or disk - use memcached and/or store generated profile data somewhere. As far as goals... no it doesn't need to be distributed (or, depending on your point of view, aggregated). This is a "would be cool" factor that would help bring all users together. It stemmed from the fact that it would be awesome if all the music tracking sites could be networked in a way so that there could be a "definitive" aggregation. Having 5, 10, or 100 little sites with all their own statistics would be inherently bad. (For example: LiveJournal. You want everyone to have their journal at LJ so that you don't have to hop around the web, etc.) There is a definite advantage to having lots of people on one system. By aggregating the pieces, you create one system where there was previously many. Maybe this is why I always think of it as "bringing together the clusters" because part of my idea was even a site like Audioscrobbler, which does not run Openscrobbler, could possibly contribute to the global statistics. If there was an API that could be implemented for any system, then even this would be possible. But I digress. The more important goal in the shorter-term is/was to get an open-source listener tracking system that is geared to providing a smaller number of users a larger number of features (compared to Audioscrobbler). I want to see the return of time played, weekly/monthly/etc stats, and stats that update more often then whenever-they-feel-like-it. I also want to see albums! My real motivation for doing this is more out of user frustration than geek pride. > The "Please Wait ..." screen should be fine, it would definitely > better than just having the page take forever. Yea, that'd be bad. I think the other option is to display something like: "Global statistics has not yet been generated for this <song,album,artist>. Your request has been added to the queue and will be processed shortly. Please check back in a few minutes." > Concerning the ids we're going to be assigning, what if the central > server just had some sort of id maping table, so it would know that > song 13 on server a = song 337 on server b? Not sure. Need more thought on how IDs will be used any how critical it is that things get "aligned." --JD |
|
From: Mr. D. <mr...@gm...> - 2005-03-05 04:57:17
|
I think it would be better to develop it as the central server/clusters system that we have been discussing, as you mentioned, building an AS-like clone may end up just making things harder on us because we'll have to put a significant amount of effort into regrouping it into a distributed system. I guess the bad part of going straight to the central server/clusters system is that it will take longer, right? I finally took a look at the docs, and I am still having difficulty figuring out exactly what sort if db interaction is going to be taking place when a song play is submitted, and when a view [misc data] request is received. I *think* it is better from a db design standpoint to simply insert the fact that a song is played when it is (and I think that's what the song_data table is for), but I think we would be able to provide a faster overall experience to the users if we were to include play counts with every song, artist, album, etc, and update them with every submission. I think it would be worth it to have faster statistic browsing at the cost of slower submission processing. I think i'm pretty much suggesting that we keep a submission queue / cruncher, and hope to have faster / simpler queries for viewing statistics. Are we already planing on doing something like this (updating total playcounts) and I'm just not seeing it being mentioned? Is it really stupid for some reason that I don't understand? Are we doing anything to improve upon AS beyond turning it into a distributed system? (and is this even one of the project goals?, does it need to be?) The "Please Wait ..." screen should be fine, it would definitely better than just having the page take forever. Concerning the ids we're going to be assigning, what if the central server just had some sort of id maping table, so it would know that song 13 on server a = song 337 on server b? - deep On Mar 4, 2005, at 9:27 PM, Jonathan Dance wrote: > Good reply! > > The distribution system has two aspects: "politics" and performance. > Let me explain. > > The idea was this: the initial development would create a standalone > audioscrobbler-like clone. Anyone could download the software and > setup a music tracking web site for themselves, their friends, > whoever. They can customize it if they want. Submit patches. Improve > it. > > From here, my idea was to link together the independent systems into > one large statistical pool. What this means is that it's not being > built from the top down, but instead the bottom up. If we did it this > way, we would be "stuck" with trying to regroup it into the > distributed system. > > Now, I'm not saying it has to go down that way. But that's the way I > thought it up. So lets talk about ignoring the pretenses under which > this all happened. The rest of my reply is inline below. > > On Mar 4, 2005, at 12:33 AM, Mr.Deep wrote: > >> So, what exactly is the limiting factor to AudioScrobler? Is it the >> the processing time required? or the traffic (submissions, or viewing >> data)? A better understanding of the limitations would help me >> understand what we're trying to fix, and what we must avoid. I think >> that the most difficult task is processing the data, but I'm not sure >> if we also have to take submission / viewing traffic into >> consideration. What exactly do we hope to gain by spreading the work >> (whatever it may be) over multiple servers? > > I can really only guess what AS's problems are. I can only guess based > on what they've done to fix it. They've moved processing into the > background or into memory (using memcached), which means that > generating things on the fly was a problem. So that's too many queries > on viewing. By reducing the number of queries, they probably have an > easier time doing INSERTs as well. The INSERTs are also put into > memory (so its not the connections either) and are a queue for the > "cruncher" whose sole responsibility is to save the statistics. > > So basically it's all processing and not traffic at all. But the > processing traffic comes from both submissions and viewing. You > typically can't SELECT and INSERT at the same time (because you need > to lock the table). > >> What you have to say about non-inter-user statistics and each cluster >> generating its own stats sounds good. I'm just wondering what >> exactly is going to happen when the statistics are generated. Is it >> going to be "look at everything" get totals, etc etc, or will the >> statistics generating be incremental? > > I imagine it will initially be "look at everything" but we may evolve > towards incremental updating. We could just jump to incremental if its > not too hard. > > I suppose it really depends what the statistic in question is. > Obviously if its "in the past week" it will be "look at everything in > the past week." Of course the majority of data is regarding song total > and things.... I don't know, I think I need to think more about this. > >> Concerning not aggregating data until a user requests it, the only >> problem I see with this is: what will the user experience be? If >> someone clicks on the song for the first time, how long will it take >> to get the information they requested? > > I'm not really sure. It really depends on who receives the request. If > its a cluster, then it could display local results immediately (as > well as cached results if available) and ask the central server, or > all the other servers, for an update if necessary. If its the central > server, I suppose it wouldn't have anything, but it shouldn't take > very long to fetch results from the servers. I imagine we could throw > up a "Please wait" screen while it fetches the data. > >> I think we definitely need to decide if we are going to have some >> sort of weekly rollover type thing like AS has, because if we do, >> each server could assign ids as it wishes, then go through some sort >> of reconciliation phase. Otherwise, we could make use of the >> hashing system mentioned in the other email. But we really should >> make this decision as it will probably effect other things too. > > The hashing system is a cool idea, but it doesn't generate very good > keys. 32 byte (or more - I looked at some other "better" hashing > algorithms) keys are pretty long - that's the equivalent of 8 > integers. That's probably be hard to index (*tries to remember how > btrees work* - well I think its only a matter of comparing the data). > Regardless, it might be the best idea going forward. A reconciliation > phase could also work, though. > > Alright, that's all for now. > > --JD > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Openscrobbler-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openscrobbler-devel > |
|
From: Mr. D. <mr...@gm...> - 2005-03-05 04:17:11
|
Hm, I guess that the cluster-servers also serving pages makes good sense. So the central server will handle inter-cluster statistics (top artist, etc), and the clusters will handle user-level statistics. On Mar 4, 2005, at 12:44 AM, Jonathan Dance wrote: > On Mar 4, 2005, at 12:33 AM, Mr. Deep wrote: > >> My initial reaction to the replication issue was to simply have pairs >> of servers, such that they would constantly update one-another in >> case either of them went down, but I decided that was really lame and >> we need something better. >> >> I think that we need incremental backups (of either raw data or final >> statistics) sent to the main server, these should be done on a >> regular basis (weekly?) or possibly be done somewhat continuously. >> Also, every client should have multiple servers that it tries to >> submit data to in case the main one fails. > > The biggest difference from my thoughts here is this is a lot more > data than I was imagining a central server ever having. > >> In general, I think i really need to read the lovely docs that you've >> created before I go babbling on like an idiot (specifically, the >> chicken w/head cut off variety). > > Well... these don't exist. The docs in SVN mostly deal with the code > itself: guidelines, how it works, etc etc. > >> My main thought at the moment has to do with each client-server >> generating an incremental update that is sent to the main sever, >> which then applies it, this incremental update tells the main sever >> how to update itself to reflect the new submissions that the >> client-server has received. > > As before... I think the difference we're seeing is that in my head, > "clusters" would not only receive/record submissions, but also serve > web pages. If each user is assigned to a cluster, then they can get > all their detailed, personalized stats from that server. Aggregated > stats can be obtained from a central source and served, or the central > source can serve them. > > There seems to be three paths: > - The central server would be completely transparent and is purely in > the background. No user would use it directly. Requests are served > from clusters. > - The central server does some of the requests, while "clusters" do > others. > - The central server handles all requests > >> The main server would store tables, etc in a way that would make >> accessing the data as fast as possible. The rate at which these >> incremental updates would be determined by the # of individual song >> submissions that have been received. > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Openscrobbler-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openscrobbler-devel > |
|
From: Jonathan D. <jd...@wu...> - 2005-03-05 04:02:03
|
To quote www.audioscrobbler.com : > Welcome to Audioscrobbler > Audioscrobbler builds a profile of your musical taste using a plugin > for your media player (Winamp, iTunes, XMMS etc..). Plugins send the > name of every song you play to the Audioscrobbler server, which > updates your musical profile with the new song. Every person with a > plugin has their own page on this site which shows their listening > statistics. The system automatically matches you to people with a > similar music taste, and generates personalised recommendations. On Mar 4, 2005, at 10:56 PM, Peter Golbus wrote: > ok, I mostly understand what you guys are saying, except that I've > never used audio-scrobbler. What precisely does it do? |
|
From: Peter G. <pg...@gm...> - 2005-03-05 03:57:08
|
ok, I mostly understand what you guys are saying, except that I've never used audio-scrobbler. What precisely does it do? > The idea was this: the initial development would create a standalone > audioscrobbler-like clone. Anyone could download the software and setup > a music tracking web site for themselves, their friends, whoever. They > can customize it if they want. Submit patches. Improve it. > |
|
From: Jonathan D. <jd...@wu...> - 2005-03-05 02:28:01
|
Good reply! The distribution system has two aspects: "politics" and performance. Let me explain. The idea was this: the initial development would create a standalone audioscrobbler-like clone. Anyone could download the software and setup a music tracking web site for themselves, their friends, whoever. They can customize it if they want. Submit patches. Improve it. From here, my idea was to link together the independent systems into one large statistical pool. What this means is that it's not being built from the top down, but instead the bottom up. If we did it this way, we would be "stuck" with trying to regroup it into the distributed system. Now, I'm not saying it has to go down that way. But that's the way I thought it up. So lets talk about ignoring the pretenses under which this all happened. The rest of my reply is inline below. On Mar 4, 2005, at 12:33 AM, Mr.Deep wrote: > So, what exactly is the limiting factor to AudioScrobler? Is it the > the processing time required? or the traffic (submissions, or viewing > data)? A better understanding of the limitations would help me > understand what we're trying to fix, and what we must avoid. I think > that the most difficult task is processing the data, but I'm not sure > if we also have to take submission / viewing traffic into > consideration. What exactly do we hope to gain by spreading the work > (whatever it may be) over multiple servers? I can really only guess what AS's problems are. I can only guess based on what they've done to fix it. They've moved processing into the background or into memory (using memcached), which means that generating things on the fly was a problem. So that's too many queries on viewing. By reducing the number of queries, they probably have an easier time doing INSERTs as well. The INSERTs are also put into memory (so its not the connections either) and are a queue for the "cruncher" whose sole responsibility is to save the statistics. So basically it's all processing and not traffic at all. But the processing traffic comes from both submissions and viewing. You typically can't SELECT and INSERT at the same time (because you need to lock the table). > What you have to say about non-inter-user statistics and each cluster > generating its own stats sounds good. I'm just wondering what exactly > is going to happen when the statistics are generated. Is it going to > be "look at everything" get totals, etc etc, or will the statistics > generating be incremental? I imagine it will initially be "look at everything" but we may evolve towards incremental updating. We could just jump to incremental if its not too hard. I suppose it really depends what the statistic in question is. Obviously if its "in the past week" it will be "look at everything in the past week." Of course the majority of data is regarding song total and things.... I don't know, I think I need to think more about this. > Concerning not aggregating data until a user requests it, the only > problem I see with this is: what will the user experience be? If > someone clicks on the song for the first time, how long will it take > to get the information they requested? I'm not really sure. It really depends on who receives the request. If its a cluster, then it could display local results immediately (as well as cached results if available) and ask the central server, or all the other servers, for an update if necessary. If its the central server, I suppose it wouldn't have anything, but it shouldn't take very long to fetch results from the servers. I imagine we could throw up a "Please wait" screen while it fetches the data. > I think we definitely need to decide if we are going to have some sort > of weekly rollover type thing like AS has, because if we do, each > server could assign ids as it wishes, then go through some sort of > reconciliation phase. Otherwise, we could make use of the hashing > system mentioned in the other email. But we really should make this > decision as it will probably effect other things too. The hashing system is a cool idea, but it doesn't generate very good keys. 32 byte (or more - I looked at some other "better" hashing algorithms) keys are pretty long - that's the equivalent of 8 integers. That's probably be hard to index (*tries to remember how btrees work* - well I think its only a matter of comparing the data). Regardless, it might be the best idea going forward. A reconciliation phase could also work, though. Alright, that's all for now. --JD |
|
From: Jonathan D. <jd...@wu...> - 2005-03-04 05:44:18
|
On Mar 4, 2005, at 12:33 AM, Mr. Deep wrote: > My initial reaction to the replication issue was to simply have pairs > of servers, such that they would constantly update one-another in case > either of them went down, but I decided that was really lame and we > need something better. > > I think that we need incremental backups (of either raw data or final > statistics) sent to the main server, these should be done on a regular > basis (weekly?) or possibly be done somewhat continuously. Also, > every client should have multiple servers that it tries to submit data > to in case the main one fails. The biggest difference from my thoughts here is this is a lot more data than I was imagining a central server ever having. > In general, I think i really need to read the lovely docs that you've > created before I go babbling on like an idiot (specifically, the > chicken w/head cut off variety). Well... these don't exist. The docs in SVN mostly deal with the code itself: guidelines, how it works, etc etc. > My main thought at the moment has to do with each client-server > generating an incremental update that is sent to the main sever, which > then applies it, this incremental update tells the main sever how to > update itself to reflect the new submissions that the client-server > has received. As before... I think the difference we're seeing is that in my head, "clusters" would not only receive/record submissions, but also serve web pages. If each user is assigned to a cluster, then they can get all their detailed, personalized stats from that server. Aggregated stats can be obtained from a central source and served, or the central source can serve them. There seems to be three paths: - The central server would be completely transparent and is purely in the background. No user would use it directly. Requests are served from clusters. - The central server does some of the requests, while "clusters" do others. - The central server handles all requests > The main server would store tables, etc in a way that would make > accessing the data as fast as possible. The rate at which these > incremental updates would be determined by the # of individual song > submissions that have been received. |
|
From: Mr. D. <mr...@gm...> - 2005-03-04 05:34:45
|
Err, replied to wrong place ... On Mar 4, 2005, at 12:33 AM, Mr. Deep wrote: > My initial reaction to the replication issue was to simply have pairs > of servers, such that they would constantly update one-another in case > either of them went down, but I decided that was really lame and we > need something better. > > I think that we need incremental backups (of either raw data or final > statistics) sent to the main server, these should be done on a regular > basis (weekly?) or possibly be done somewhat continuously. Also, > every client should have multiple servers that it tries to submit data > to in case the main one fails. > > In general, I think i really need to read the lovely docs that you've > created before I go babbling on like an idiot (specifically, the > chicken w/head cut off variety). My main thought at the moment has to > do with each client-server generating an incremental update that is > sent to the main sever, which then applies it, this incremental update > tells the main sever how to update itself to reflect the new > submissions that the client-server has received. The main server > would store tables, etc in a way that would make accessing the data as > fast as possible. The rate at which these incremental updates would > be determined by the # of individual song submissions that have been > received. > > On Mar 3, 2005, at 12:05 AM, Jonathan Dance wrote: > >> Another serious issue we face is that of replication. This really has >> two facets: >> >> Backups: what if a server has a hardware failure, or, for one reason >> or another, disappears forever? It would be best if the user data is >> (also) stored somewhere else. >> >> Temporary failure: what if a server goes down temporarily? Ideally, >> the system should be able to handle this situation as well. >> >> I don't have many good answers for these. Some possible answers >> (which does not necessarily cover both facets): >> - Each server sends us a backup of itself every X time period. >> - Each server replicates itself (somehow) to another server. In case >> the first server fails, the second server starts serving the users of >> the first server, in addition to its own. If the first server comes >> back, the second server stops serving those users. If the server >> never comes back, the second server moves some users somewhere else. >> (This is basically some kind of dynamic cluster-to-cluster >> user-handling system, where each system pushes users around as >> necessary. Also, the data is always in at least two places.) >> >> Lots of fun stuff to think about! >> >> ------------------- >> There was a significant typo in my last e-mail: >> >> This perfect for "clustering" where each server is responsible for >> any number of servers. => This is perfect for "clustering" where each >> server is responsible for any number of users. >> ------------------- >> >> --JD >> >> >> >> ------------------------------------------------------- >> SF email is sponsored by - The IT Product Guide >> Read honest & candid reviews on hundreds of IT Products from real >> users. >> Discover which products truly live up to the hype. Start reading now. >> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >> _______________________________________________ >> Openscrobbler-devel mailing list >> Ope...@li... >> https://lists.sourceforge.net/lists/listinfo/openscrobbler-devel >> > |
|
From: Mr. D. <mr...@gm...> - 2005-03-04 05:34:06
|
So, what exactly is the limiting factor to AudioScrobler? Is it the the processing time required? or the traffic (submissions, or viewing data)? A better understanding of the limitations would help me understand what we're trying to fix, and what we must avoid. I think that the most difficult task is processing the data, but I'm not sure if we also have to take submission / viewing traffic into consideration. What exactly do we hope to gain by spreading the work (whatever it may be) over multiple servers? What you have to say about non-inter-user statistics and each cluster generating its own stats sounds good. I'm just wondering what exactly is going to happen when the statistics are generated. Is it going to be "look at everything" get totals, etc etc, or will the statistics generating be incremental? Concerning not aggregating data until a user requests it, the only problem I see with this is: what will the user experience be? If someone clicks on the song for the first time, how long will it take to get the information they requested? I think we definitely need to decide if we are going to have some sort of weekly rollover type thing like AS has, because if we do, each server could assign ids as it wishes, then go through some sort of reconciliation phase. Otherwise, we could make use of the hashing system mentioned in the other email. But we really should make this decision as it will probably effect other things too. - Deep On Mar 2, 2005, at 11:41 PM, Jonathan Dance wrote: > So I have some fairly concrete thoughts about how to distribute the > system over the Internet. It's not "peer-to-peer" yet but we'll see. > > First, it may not be necessary for users to be directly aware of the > multi-server atmosphere. Usernames could include the server the user > is assigned to or the central server could know which server holds > each user. > > My observation is the majority of statistics are not inter-user. They > are about one user at a time. This perfect for "clustering" where each > server is responsible for any number of servers. What remains is the > aggregated stats. I believe the solution for this is each cluster to > generate its own aggregate stats - this is generally the "hard work." > The central server then takes the results from those aggregate stats > and combines them into a central aggregation. > > This still presents a problem, though. First, this is a lot of data. > The initial stuff like "top artists" and "top users" is easy. What is > not: every artist has top songs and top users. There are thousands of > artists. Every song has top users. There are TONS of songs. And this > assumes we're "only" copying the Audioscrobbler feature set. > > Another idea is to not aggregate something until a user requests it, > and then cache it and only re-aggregate it at most once a week. > Assuming a very large number of songs will never be requested, this > could save a lot. Plus it would distribute the requests to the > clusters more slowly. > > Another issue is unique IDs. Assuming we store songs/albums/artists in > a database, how will the clusters have the same IDs as the central > database (or, every other database)? The first inclination is to store > this on the central server and have the clusters download this > information. When a new song is submitted to a cluster, it tells the > central server. > > Obviously this isn't very "peer-to-peer," it's really coordinated > internet clustering. There's still a lot for the central server to do, > and I believe it needs more thought. > > --JD > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Openscrobbler-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openscrobbler-devel > |
|
From: Jonathan D. <jd...@wu...> - 2005-03-03 05:12:48
|
Err, its' 32 * 16 bits, or 64 bytes. This makes the collision chance slightly bigger but you get the idea. --JD On Mar 3, 2005, at 12:10 AM, Jonathan Dance wrote: >> Another issue is unique IDs. Assuming we store songs/albums/artists >> in a database, how will the clusters have the same IDs as the central >> database (or, every other database)? The first inclination is to >> store this on the central server and have the clusters download this >> information. When a new song is submitted to a cluster, it tells the >> central server. > > Just thought of this: > IDs could be based on a hashing algorithm, like MD5. MD5 is 34 * 16 > bits = 34 * 2 bytes = 68 bytes. That's a pretty long ID but it has a > 3.38 x 10^-21 chance of a collision. > > --JD > |
|
From: Jonathan D. <jd...@wu...> - 2005-03-03 05:05:31
|
Another serious issue we face is that of replication. This really has two facets: Backups: what if a server has a hardware failure, or, for one reason or another, disappears forever? It would be best if the user data is (also) stored somewhere else. Temporary failure: what if a server goes down temporarily? Ideally, the system should be able to handle this situation as well. I don't have many good answers for these. Some possible answers (which does not necessarily cover both facets): - Each server sends us a backup of itself every X time period. - Each server replicates itself (somehow) to another server. In case the first server fails, the second server starts serving the users of the first server, in addition to its own. If the first server comes back, the second server stops serving those users. If the server never comes back, the second server moves some users somewhere else. (This is basically some kind of dynamic cluster-to-cluster user-handling system, where each system pushes users around as necessary. Also, the data is always in at least two places.) Lots of fun stuff to think about! ------------------- There was a significant typo in my last e-mail: This perfect for "clustering" where each server is responsible for any number of servers. => This is perfect for "clustering" where each server is responsible for any number of users. ------------------- --JD |
|
From: Jonathan D. <jd...@wu...> - 2005-03-03 04:41:13
|
So I have some fairly concrete thoughts about how to distribute the system over the Internet. It's not "peer-to-peer" yet but we'll see. First, it may not be necessary for users to be directly aware of the multi-server atmosphere. Usernames could include the server the user is assigned to or the central server could know which server holds each user. My observation is the majority of statistics are not inter-user. They are about one user at a time. This perfect for "clustering" where each server is responsible for any number of servers. What remains is the aggregated stats. I believe the solution for this is each cluster to generate its own aggregate stats - this is generally the "hard work." The central server then takes the results from those aggregate stats and combines them into a central aggregation. This still presents a problem, though. First, this is a lot of data. The initial stuff like "top artists" and "top users" is easy. What is not: every artist has top songs and top users. There are thousands of artists. Every song has top users. There are TONS of songs. And this assumes we're "only" copying the Audioscrobbler feature set. Another idea is to not aggregate something until a user requests it, and then cache it and only re-aggregate it at most once a week. Assuming a very large number of songs will never be requested, this could save a lot. Plus it would distribute the requests to the clusters more slowly. Another issue is unique IDs. Assuming we store songs/albums/artists in a database, how will the clusters have the same IDs as the central database (or, every other database)? The first inclination is to store this on the central server and have the clusters download this information. When a new song is submitted to a cluster, it tells the central server. Obviously this isn't very "peer-to-peer," it's really coordinated internet clustering. There's still a lot for the central server to do, and I believe it needs more thought. --JD |