From: Neil W. <neilw@ActiveState.com> - 2002-08-19 19:12:59
|
Brian Ingerson [19/08/02 11:13 -0700]: > > I generated an enormous list of words, about 10MB, and I wanted to > > dump the data structure using YAML. Unfortunately, the current > > YAML.pm dumps to a string, and then writes to a file, which consumed > > all available memory (200MB) and then crashed. An iterative version > > would have finished merrily. > > I totally agree with you here. And now we have a good failing test! If I > write an iterative interface, will you test it out for me? I'd love to get > your feedback as well. Okay. Deal. > BTW, If you can share the dataset, I'm sure Steve and Why would like to use > it as well. Heck, can you post the data somewhere? The data currently is a *large* Storable file, which isn't any good to anyone, unfortunately. The script which generates it is also tightly integrated with PerlMx's support libraries, which aren't public. The format I invented for myself is very simple, and somewhat yamlish: you,see: good=1 bad=0 prob=0.01 Here's what the structure expands to in memory: $stats = { good => { 'you,see' => 15, }, bad => { 'you,see' => 145, }, prob => { 'you,see' => 0.8267, }, } This means if you see the words "you" followed by "see", the email has an 82% likelihood of being spam (based on my collection of spam). Of course, you have to weigh all the other word pairs too. I call "word pairs" "chains", since they're basically Markov Chains. [ During the training stage, I can specify the size of the chains to use. Size 1 means to simply count the frequency of every word and use it. Size 2 (shown here) does the frequency of word pairs. Increasing the size means your data set gets much bigger :) ] The size 2 results I currently have result in 407325 chains, which means each hash (good, bad, and prob) has 407325 entries in it. The file containing lots of rules is here [16MB]. http://ttul.org/~nwatkiss/misc/rules This might just work (not tested): my %stats; while (my $line = <>) { my ($chain, @rest) = split /[:\w]+/, $line; for my $part (@rest) { my ($key, $value) = split /=/, $part; $stats{$key}{$chain} = $value; } } Have fun! Neil |