From: Brian I. <in...@tt...> - 2002-08-19 18:13:49
|
On 19/08/02 10:46 -0700, Neil Watkiss wrote: > Hi Steve, > > I had a bad experience with YAML(.pm) this weekend, so I'm going to argue > with you just 'cause I'm feeling nasty. :) > > Steve Howell [19/08/02 12:10 -0400]: > > Well, I am probably a little extreme in my philosophy, but I believe > > libraries should avoid "convenience" interfaces and focus on providing > > only the essential building blocks. If somebody does not have my > > yaml.dump(), then they have to write their own equivalent > > implementation of what has taken Clark and me 234 lines of Python so > > far. On the other hand, if somebody did not have access to my > > dumpFile(), then they have to write their own 5 lines of Python, but > > they get complete control over the interface. I've already suggested > > one variation on dumpFile()--interspersing comments between > > documents--and I am sure there are other variations that I haven't > > anticipated. > > This weekend I wrote a very interesting spam checking backend for PerlMx (a > sendmail milter engine, http://www.ActiveState.com/PerlMx). The current > backend is based on SpamAssassin, and has hundreds of rules, basically > regular expressions. Mine is based on an excellent article posted on slashdot > last week: http://www.paulgraham.com/spam.html. > > The basic idea is you start a large corpus of "good" and "bad" email, and you > search for the word frequency of every word in both. This allows you to > calculate the probability that any word is contained in a "bad" email. You > can use this technique to predict whether an incoming email is "good" or > "bad", and you can assign a probability to it. > > I generated an enormous list of words, about 10MB, and I wanted to dump the > data structure using YAML. Unfortunately, the current YAML.pm dumps to a > string, and then writes to a file, which consumed all available memory > (200MB) and then crashed. An iterative version would have finished merrily. I totally agree with you here. And now we have a good failing test! If I write an iterative interface, will you test it out for me? I'd love to get your feedback as well. BTW, If you can share the dataset, I'm sure Steve and Why would like to use it as well. Heck, can you post the data somewhere? > Conclusion? YAML libraries (in any language) need to offer much more than > simple load/dump. At the very least, they must provide iterative or callback > APIs, so I can dump the structure to a file a little bit at a time. I > definitely don't want to consume 200MB to print out some debugging > information. Obviously, the load/dump is most important to start with. But as Neil points out, it only works with small datasets. I think the next most important API is load_next/dump_next. Where each document needs to completely be in memory, but not the entire stream. This is a little bit of work for rather large benefit. Much later, we can look at truly streaming APIs. These get tricky due to aliasing. You almost *need* to have an entire document in memory to know where the aliases are. A truly streaming application filter, could just use the parser and emitter and never load the document at all. But it would need to keep track of aliases itself, and preserve them on output. > My next optimization was to use word pairs, not single words, to guess > whether an email was bad. This time my dataset was over 50MB, and I didn't > even bother with YAML. In fact, I haven't yet found a *single* dumper of > *any* kind on CPAN that can handle dumping this data structure[1]. People > writing Perl modules are obviously leaning towards "laziness" as the only > virtue. > > So I wrote my own. It's not YAML, but it showed me what I needed to see, and > I fixed some bugs. Now I'm happy... but not as happy as I'd have been if YAML > had solved my problem for me :) Let's fix that! Cheers, Brian |