From: Marcus H. <mar...@ta...> - 2009-02-17 13:07:47
|
Hi. We are implementing a sitevisitor similarity engine where we match the unique sitevisitors against each other. Tried to do this with mysql and monetdb example: something like this (pseudo): foreach(source in sites) sourceUV = select count(distinct uid) from UniqueSiteVisitorSample where site = $source foreach(target in sites) targetUV = select count(distinct uid) from UniqueSiteVisitorSample where site = $target totalUV = select count(distinct uid) from UniqueSiteVisitorSample where site in ($source, $target ) calcAndStoreSimilarity(source,target,sourceUV,targetUV,totalUV) Thing is that we have 40 000 sites in our network which each will be compared against eachother = 1.6 billion comparisons or (40 000 x 39 999) We need to be able to compare at least a few hundred sites per second and optimally a few thousand sites/sec to be able to get the job done in a reasonable time (30-40 days) at the current rate we will be done in 5 years :) So my questions are: * Do you guys think that LucidDB could help in storing the underlying datastructure thus speeding up the queries above ? * If not do you know about any other storage engine which would perform this kind of matrix like storage ? Current DDL (some cols removed): CREATE TABLE UniqueSiteVisitorSample ( uid bigint NOT NULL, site int NOT NULL, PRIMARY KEY (uid,site) ); We will do more and more of these Cluster Analysis stuff which involves comparing each item to every other item and we cannot be first in the universe doing this right ? Kindly //Marcus Herou -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 mar...@ta... http://www.tailsweep.com/ http://blogg.tailsweep.com/ |