From: Brian I. <in...@tt...> - 2002-08-17 22:06:18
|
Where to start? YAML.pm has a dead simple interface for dumping memory as a YAML serialization and for doing the reverse. It's like this: use YAML; $yaml_string = Dump(@object_refs); @object_refs = Load($yaml_string); Even though this is simple, I think there's a lot to say about it. First is the names 'dump' and 'load'. The main reason I chose these is because the spec talks about the concepts of a Dumper and a Loader. I actually used to use Store() instead of Dump(). But Store() is now deprecated. I also chose Dump() because it is in line with the other Perl serializing modules: Module Serialize Deserialize ---------------------------------------- Data::Dumper Dumper eval Dump Data::Dump dump eval Data::Denter Indent Undent Data::DumpXML dump_xml $p->parsefile FreezeThaw freeze thaw Storable freeze thaw YAML Dump Load freeze thaw Indent Undent I would be happy if we could at least all agree to support 'dump' and 'load' in our interfaces. I think it will give new users a good point of reference. The actual specifics of the calls will probably be different from language to language. I'll describe how mine work in a moment. I export Load() and Dump() by default. This is typical in Perl because it allows the smallest syntax for the simplest case. For instance, a one liner to dump the symbol table can look like this: perl -MYAML -e 'print Dump *::' In some languages it might be considered rude to export symbols by default. In Perl, it's ok as long as you don't go overboard. A cautious user can defeat the exporting with: use YAML(); which specifies an empty export list. I also use title case (Dump instead of dump) for the exported things. This is simply a matter of style. I save lowercase for oo calls like: YAML->new->dump(@objects); Now I'll talk a bit about how the calls work. In Perl, you have list and scalar context. Dump() returns a single YAML string in either case. Load() returns all objects in the YAML stream in List context, and the last object in the stream in scalar context. I'm actually not satisfied with Load() in scalar context. I think I might change it to return the next object in the stream. Or maybe an iterator. But that's talk for another day... These calls only deal with YAML as a string, not as a file or a filehandle. I'd like to start talks about a clean way of doing that. Currently I support: LoadFile(filename); DumpFile(filename, @objects); Load and Dump also do operations in one shot. There is not yet an iterative interface for adding to or parsing from a stream. In other words, they are atomic at the stream level. I think that's actually best for the basic interface. But I'd like to add loading and dumping at the document level. One point; it is important to remember that YAML documents in a stream cannot know anything about each other. They have separate tab policies, anchor/aliasing, etc. Be sure to reset your configurations between documents. That's all I have for now. Cheers, Brian |
From: why t. l. s. <yam...@wh...> - 2002-08-18 07:04:22
|
Show and tell, then? Or truth or dare? Brian Ingerson (in...@tt...) wrote: > use YAML; > $yaml_string = Dump(@object_refs); > @object_refs = Load($yaml_string); YAML.rb has load() as well. I could add a dump() method quite easily, as it will just run the to_yaml method of the object passed to it. Here's your code in Perl's much prettier younger sister ;) .. require 'yaml' yaml_string = object.to_yaml object = YAML::load( yaml_string ) > I would be happy if we could at least all agree to support 'dump' and > 'load' in our interfaces. I think it will give new users a good point of > reference. The actual specifics of the calls will probably be different > from language to language. I'll describe how mine work in a moment. I can appreciate the background on 'dump' and 'load' and I think it's a good convention. I see good reason for some identical semantics between implementations. > I'm actually not satisfied with Load() in scalar context. I think I might > change it to return the next object in the stream. Or maybe an iterator. But > that's talk for another day... YAML.rb has three methods for loading YAML, all of which I am quite fond of. The first is the simple YAML::load illustrated above. The second is a YAML::load_document call, which loads an entire stream at once into an object. This is useful if you want to load the entire stream, alter it, and spit it back out, keeping many of the conventions found in the original YAML file: require 'yaml' ydoc = YAML::load_document( File.open( 'EMPLOYEES.yml' ) ) ydoc.add( { 'name' => 'Why', 'salary' => 1.0/00 } ) File.open( 'EMPLOYEES.yml' ).write( ydoc.emit ) Perhaps 'emit' could be exchanged for 'dump' if it's decided. I'm not totally involve with the load_document method, but it's convenient. The last method, the iterator, is great. The only problem is that it handles document separators and pauses identically. But it will read from a stream (TCPSocket, File, etc.) and run the Proc for each document. require 'yaml' YAML::Parser.new.parse_documents( File.open( 'EMPLOYEES.yml' ) ) { |ydoc| puts "Employee found: " + ydoc['name'] } I'd like to do more with the above and I'm sure I will when I get more real usage from it. All of my loading methods can take a String or any other IO object. > One point; it is important to remember that YAML documents in a stream > cannot know anything about each other. They have separate tab policies, > anchor/aliasing, etc. Be sure to reset your configurations between > documents. Good point. I think YAML.rb is on about that, but I need to double check. Another thing: much of what I've done in YAML.rb is based on the spec's description of native model, generic model, serial model. I've noticed that information takes up much of the spec, too. Would that be more useful in an implementor's document? It's very useful, but seems geared toward implementors rather than users. _why |
From: Steve H. <sh...@zi...> - 2002-08-18 18:05:08
|
From: "Brian Ingerson" <in...@tt...> > [...] > > I would be happy if we could at least all agree to support 'dump' and > 'load' in our interfaces. I think it will give new users a good point of > reference. The actual specifics of the calls will probably be different > from language to language. I'll describe how mine work in a moment. > [...] Well, I've now used YAML in three different languages--Perl, Python, and Ruby. There is enough consistency between the implementations that you feel like you're on familiar ground, but you also get the sense that yaml.pm is Perlish, that yaml.py is Pythonic, and that yaml.rb is too damn clever and well thought out to ever catch on in the mainstream. Here is how the Python interface works for loading: import yaml readme = yaml.loadFile("README") The variable "readme" is a Python iterator, which returns a data structure for each YAML document in the README file. An iterator basically is an object that supports a next() method: first_doc_in_readme = readme.next() second_doc_in_readme = readme.next() Like all Python iterators, readme throws a StopIteration exception when there are no more documents. If Python iterators make you feel queasy, there is no reason to fear. Iterators actually make your life easier: 1) You can turn an interator into a list: readme = yaml.loadFile("README") readme_docs = list(readme) # calls iterator repeatedly to build a list 2) Not surprisingly, you can iterate with an iterator: readme = yaml.loadFile("README") for doc in readme: print doc Of course, you don't have to load from a file. Quoting directly from demo.py in the PyYaml distribution, you can use the load method as follows: testData = \ """ program: PyYaml author: Steve Howell --- shopping list: - apple - banana todo: - eat more fruit: - especially bananas! - good for you - write a better demo """ print "YAML INSIDE YOUR PROGRAM" for x in yaml.load(testData): print repr(x) print "\n\n" Of course, there are times when your YAML only includes one document. testdata=\ """ name: steve language: python """ info = yaml.load(testdata).next() I suppose having to use the next() method here is a bit clunky, but right now I leave it up to the user to wrap this. Both yaml.load and yaml.loadFile take an optional extra argument for resolving classes. See my 8/16/2002 post "new PyYaml--better support for serializing objects" for more detail on that. Then, there's dumping. Right now there's only one method in the interface: print yaml.dump({'foo': [1,2,3]}) Pass in the data structure, get back a string. Soon there will be a richer interface for dumping, including a dumpFile method, and optional arguments, but the simple "dump" is all you have for now. Cheers, Steve |
From: Brian I. <in...@tt...> - 2002-08-18 22:25:20
|
On 18/08/02 14:04 -0400, Steve Howell wrote: > From: "Brian Ingerson" <in...@tt...> > > [...] > > > > I would be happy if we could at least all agree to support 'dump' and > > 'load' in our interfaces. I think it will give new users a good point of > > reference. The actual specifics of the calls will probably be different > > from language to language. I'll describe how mine work in a moment. > > [...] > > Well, I've now used YAML in three different languages--Perl, Python, and Ruby. > There is enough consistency between the implementations that you feel like > you're on familiar ground, but you also get the sense that yaml.pm is Perlish, > that yaml.py is Pythonic, and that yaml.rb is too damn clever and well thought > out to ever catch on in the mainstream. > > Here is how the Python interface works for loading: It's interesting how you two both have fancy loaders and simple dumpers. There's something asymetrical about that. I like: stream = dump(list_of_objects) list_of_objects = load(stream) Nice symetry. I guess the Perl language make that more natural since it supports lists. I can say for sure that Perl helped make the decision that YAML documents would be completely independent of each other. I guess I'd want to have that same symetry on an iterative interface. stream = yaml->new document = stream->load_next stream->dump_next(document) > > import yaml > readme = yaml.loadFile("README") > > The variable "readme" is a Python iterator, which returns a data structure for > each YAML document in the README file. An iterator basically is an object that > supports a next() method: > > first_doc_in_readme = readme.next() > second_doc_in_readme = readme.next() > > Like all Python iterators, readme throws a StopIteration exception when there > are no more documents. > > If Python iterators make you feel queasy, there is no reason to fear. Iterators > actually make your life easier: > > 1) You can turn an interator into a list: > > readme = yaml.loadFile("README") > readme_docs = list(readme) # calls iterator repeatedly to build a list > > 2) Not surprisingly, you can iterate with an iterator: > > readme = yaml.loadFile("README") > for doc in readme: > print doc > > Of course, you don't have to load from a file. Quoting directly from demo.py in > the PyYaml distribution, you can use the load method as follows: > > testData = \ > """ > program: PyYaml > author: Steve Howell > --- > shopping list: > - apple > - banana > todo: > - eat more fruit: > - especially bananas! > - good for you > - write a better demo > """ > > print "YAML INSIDE YOUR PROGRAM" > for x in yaml.load(testData): > print repr(x) > print "\n\n" > > Of course, there are times when your YAML only includes one document. > > testdata=\ > """ > name: steve > language: python > """ > info = yaml.load(testdata).next() > > I suppose having to use the next() method here is a bit clunky, but right now I > leave it up to the user to wrap this. > > Both yaml.load and yaml.loadFile take an optional extra argument for resolving > classes. See my 8/16/2002 post "new PyYaml--better support for serializing > objects" for more detail on that. > > Then, there's dumping. Right now there's only one method in the interface: > > print yaml.dump({'foo': [1,2,3]}) How do you dump multiple documents? Does dump produce a YAML separator by default? > > Pass in the data structure, get back a string. Soon there will be a richer > interface for dumping, including a dumpFile method, and optional arguments, but > the simple "dump" is all you have for now. Gotta go have fun. More later. Cheers, Brian |
From: Steve H. <sh...@zi...> - 2002-08-18 23:35:49
|
Brian Ingerson wrote: > [...] > It's interesting how you two both have fancy loaders and simple dumpers. Well, I think Why and I have both spent more time on the loader then the dumper, for whatever reason. As we turn our attention to the dumpers, I am sure we will find ways to cruft up the dumper interface. ;) There is some inherent asymmetry between the loader and dumper interfaces, though. With a loader, you really need the yaml library to manage the stream for you, because it needs to detect headers. With a dumper, though, the application layer can just as easily append to a file as the yaml layer can. > > How do you dump multiple documents? Does dump produce a YAML separator by > default? > Yes, dump produces a separator by default, and you can keep appending new documents to a file. |
From: why t. l. s. <yam...@wh...> - 2002-08-19 05:53:22
|
Brian Ingerson (in...@tt...) wrote: > It's interesting how you two both have fancy loaders and simple dumpers. > There's something asymetrical about that. I like: > > stream = dump(list_of_objects) > list_of_objects = load(stream) > > Nice symetry. I guess the Perl language make that more natural since it > supports lists. I can say for sure that Perl helped make the decision that > YAML documents would be completely independent of each other. > > I guess I'd want to have that same symetry on an iterative interface. My feelings exactly. I would like to work toward symmetry with each of the three techniques for Ruby: 1. Simple single-document, single-object dump'n'load: stream = YAML::dump( obj ) obj = YAML::load( stream ) 2. Multi-document dump'n'load: # where 'ydoc' is a YAML::Document object stream = ydoc.dump ydoc = YAML::load_document( stream ) 3. Iterative is currently read-only but I'd like to make it so you could iterate through each object and provide a means for providing an output stream so that any object returned from the block would be written to the output. In other words, to half all the salaries in the employee list: emplist = YAML::Stream.new( File.open( 'EMPLOYEES.yml' ) ) emplist2 = YAML::Stream.new( File.open( 'EMPLOYEES.yml~' ) ) emplist.each_output_to( emplist2 ) { |e| e['salary'] /= 2 } I have some problems with my Document class, as it's actually a class that contains many YAML documents. The problem is that I think most people think of a Stream as some form of IO (thanks to streaming audio or the like) and I don't want to confuse users. Any ideas? _why |
From: Steve H. <sh...@zi...> - 2002-08-19 13:55:10
|
From: "why the lucky stiff" <yam...@wh...> > Brian Ingerson (in...@tt...) wrote: > > It's interesting how you two both have fancy loaders and simple dumpers. > > There's something asymetrical about that. I like: > > > > stream = dump(list_of_objects) > > list_of_objects = load(stream) > > > > Nice symetry. I guess the Perl language make that more natural since it > > supports lists. I can say for sure that Perl helped make the decision that > > YAML documents would be completely independent of each other. > > > > I guess I'd want to have that same symetry on an iterative interface. > > My feelings exactly. I would like to work toward symmetry with each of > the three techniques for Ruby: > > 1. Simple single-document, single-object dump'n'load: > > stream = YAML::dump( obj ) > obj = YAML::load( stream ) I like making the simple case simple, so this makes sense. Instead of saying "stream" in your example, though, I would say "str," as in string data. str = yaml.dump(obj) obj = yaml.load(str) Here would be the proposed interface in more context: print yaml.dump({'name': 'steve'}) data = \ """ name: steve phone: 555 """ print yaml.load(data)['phone'] # should print 555 > 2. Multi-document dump'n'load: > > # where 'ydoc' is a YAML::Document object > stream = ydoc.dump > ydoc = YAML::load_document( stream ) > LOADING Let's talk about the loading case first. If you have a file with multiple YAML documents, then you want yaml to return some kind of an object that will let you get one document at a time. (In Python this would happen to be an iterator, but it really just needs to be an object that supports a next() method.) I think you are basically getting a loader back. str = \ """ --- name: why port: ruby --- name: brian port: perl """ loader = yaml.loader(str) the_stiff = loader.next() ingy = loader.next() # or... loader = yaml.loader(str) yaml_porters = list(yaml.loader(str)) # or... for yaml_porter in yaml.loader(str): pass # or... loader = yaml.fileLoader("yaml_porters.txt") for yaml_porter = loader: print yaml_porter['name'] Of course, load() and loadFile() are just convenience methods, so you might implement like this in your yaml implementation: def load(str): return loader(str).next() def loadFile(str): return fileLoader(str).next() DUMPING As for dumping, I stand by original comment, which is that yaml doesn't need to manage the writing of multiple documents. It's better handled at the application layer. Start with a simple method: str = yaml.dump(['apple', 'banana', 'carrot', 'dog']) assertEquals(str, """ --- - apple - banana - carrot - dog """) Then, let the application do its own I/O. For example, this little Python doc writes 3 yaml docs to stdout, interspersing its own comments: import yaml porters = [ ('steve', {'name': 'showell', 'lang': 'python'}), ('rolf', {'name': 'Veen', 'lang': 'java'}), ('clark', {'name': 'cce', 'lang': 'python'}) ] for (name, data) in porters: print "# YAML output for %s" % name print yaml.dump(data) Here is the code adapted to use a file for output: f = open('porters.txt', 'w') for (name, data) in porters: f.write("# YAML output for %s\n" % name) f.write(yaml.dump(data)) f.close() Keeping the dump() interface simple gives the application developer ultimate flexibility. > 3. Iterative is currently read-only but I'd like to make it so > you could iterate through each object and provide a means > for providing an output stream so that any object returned > from the block would be written to the output. In other > words, to half all the salaries in the employee list: > > emplist = YAML::Stream.new( File.open( 'EMPLOYEES.yml' ) ) > emplist2 = YAML::Stream.new( File.open( 'EMPLOYEES.yml~' ) ) > emplist.each_output_to( emplist2 ) { |e| > e['salary'] /= 2 > } > I think your example makes sense in Ruby, but I would do it like this in Python: f = open('pissed_employees.yml', 'w') f.write('# All these employees got their salaries cut in half.\n") for emp in yaml.fileLoader('employees.yml'): emp['salary'] /= 2 f.write(yaml.dump(emp)) f.close() Cheers, Steve |
From: why t. l. s. <yam...@wh...> - 2002-08-19 15:39:50
|
Steve Howell (sh...@zi...) wrote: > > Let's talk about the loading case first. If you have a file with multiple YAML > documents, then you want yaml to return some kind of an object that will let you > get one document at a time. (In Python this would happen to be an iterator, but > it really just needs to be an object that supports a next() method.) I think > you are basically getting a loader back. > > str = \ > """ > --- > name: why > port: ruby > --- > name: brian > port: perl > """ > loader = yaml.loader(str) > the_stiff = loader.next() > ingy = loader.next() Tell me how this would work in Python. Is the stream loaded entirely into memory? Or does it read a single document before polling for more? > DUMPING > > As for dumping, I stand by original comment, which is that yaml doesn't need to > manage the writing of multiple documents. It's better handled at the > application layer. Start with a simple method: Don't you think that the interface you described above needs a dumper? str = \ """ --- name: why port: ruby --- name: brian port: perl """ loader = yaml.loader(str) loader.add( { 'name' => 'steve', 'port' => 'python' } ) print loader.dump() My reasoning for the above idiom is that the Loader class (or whatever class you use to store many documents) can store configuration information about the stream. The indentation level can be saved, the version number, the type of foldings used. Since YAML is meant to be read and edited by hand, perhaps allowing this information to be reused in the dump could help improve the readability and friendliness of generated documents. I personally find it convenient, but I don't know if it's just a personal ideal rather than a common need that we should have in all implementations. Probably. :P > I think your example makes sense in Ruby, but I would do it like this in Python: > > f = open('pissed_employees.yml', 'w') > f.write('# All these employees got their salaries cut in half.\n") > for emp in yaml.fileLoader('employees.yml'): > emp['salary'] /= 2 > f.write(yaml.dump(emp)) > f.close() Cool. I totally see your point on having a single dumper. I have a single dumper. But I can see convenience in having some classes that provide a single interface to building a YAML stream, either by parsing or dumping. Check ya later, _why |
From: Steve H. <sh...@zi...> - 2002-08-19 16:11:02
|
From: "why the lucky stiff" <yam...@wh...> > Steve Howell (sh...@zi...) wrote: > > str = \ > > """ > > --- > > name: why > > port: ruby > > --- > > name: brian > > port: perl > > """ > > loader = yaml.loader(str) > > the_stiff = loader.next() > > ingy = loader.next() > > Tell me how this would work in Python. Is the stream loaded entirely into > memory? Or does it read a single document before polling for more? Currently, the string is read entirely into memory up front, but only to split the lines. The actual parsing of each document does not occur until the calls to loader.next(). > > DUMPING > > > > As for dumping, I stand by original comment, which is that yaml doesn't need to > > manage the writing of multiple documents. It's better handled at the > > application layer. Start with a simple method: > > Don't you think that the interface you described above needs a dumper? > > str = \ > """ > --- > name: why > port: ruby > --- > name: brian > port: perl > """ > loader = yaml.loader(str) > loader.add( { 'name' => 'steve', 'port' => 'python' } ) > print loader.dump() > > My reasoning for the above idiom is that the Loader class (or whatever > class you use to store many documents) can store configuration information > about the stream. The indentation level can be saved, the version number, > the type of foldings used. Since YAML is meant to be read and edited by > hand, perhaps allowing this information to be reused in the dump could > help improve the readability and friendliness of generated documents. > > I personally find it convenient, but I don't know if it's just a personal > ideal rather than a common need that we should have in all implementations. > Probably. :P Well, yeah, I agree with everything you say here. I just haven't gotten there yet with my implementation. In particular, I'm way behind the Perl and Ruby implementations in terms of being able to customize the dump formatting. > > > I think your example makes sense in Ruby, but I would do it like this in Python: > > > > f = open('pissed_employees.yml', 'w') > > f.write('# All these employees got their salaries cut in half.\n") > > for emp in yaml.fileLoader('employees.yml'): > > emp['salary'] /= 2 > > f.write(yaml.dump(emp)) > > f.close() > > Cool. I totally see your point on having a single dumper. I have a single > dumper. But I can see convenience in having some classes that provide a > single interface to building a YAML stream, either by parsing or dumping. > Well, I am probably a little extreme in my philosophy, but I believe libraries should avoid "convenience" interfaces and focus on providing only the essential building blocks. If somebody does not have my yaml.dump(), then they have to write their own equivalent implementation of what has taken Clark and me 234 lines of Python so far. On the other hand, if somebody did not have access to my dumpFile(), then they have to write their own 5 lines of Python, but they get complete control over the interface. I've already suggested one variation on dumpFile()--interspersing comments between documents--and I am sure there are other variations that I haven't anticipated. My history with using libraries is that I resist them for two reasons. Either they're too difficult to install and deploy, or they force me into a paradigm that I don't have control over. I am not saying that would be the case with YAML, but that's where my prejudices come from. Cheers, Steve |
From: Neil W. <neilw@ActiveState.com> - 2002-08-19 17:49:27
|
Hi Steve, I had a bad experience with YAML(.pm) this weekend, so I'm going to argue with you just 'cause I'm feeling nasty. :) Steve Howell [19/08/02 12:10 -0400]: > Well, I am probably a little extreme in my philosophy, but I believe > libraries should avoid "convenience" interfaces and focus on providing > only the essential building blocks. If somebody does not have my > yaml.dump(), then they have to write their own equivalent > implementation of what has taken Clark and me 234 lines of Python so > far. On the other hand, if somebody did not have access to my > dumpFile(), then they have to write their own 5 lines of Python, but > they get complete control over the interface. I've already suggested > one variation on dumpFile()--interspersing comments between > documents--and I am sure there are other variations that I haven't > anticipated. This weekend I wrote a very interesting spam checking backend for PerlMx (a sendmail milter engine, http://www.ActiveState.com/PerlMx). The current backend is based on SpamAssassin, and has hundreds of rules, basically regular expressions. Mine is based on an excellent article posted on slashdot last week: http://www.paulgraham.com/spam.html. The basic idea is you start a large corpus of "good" and "bad" email, and you search for the word frequency of every word in both. This allows you to calculate the probability that any word is contained in a "bad" email. You can use this technique to predict whether an incoming email is "good" or "bad", and you can assign a probability to it. I generated an enormous list of words, about 10MB, and I wanted to dump the data structure using YAML. Unfortunately, the current YAML.pm dumps to a string, and then writes to a file, which consumed all available memory (200MB) and then crashed. An iterative version would have finished merrily. Conclusion? YAML libraries (in any language) need to offer much more than simple load/dump. At the very least, they must provide iterative or callback APIs, so I can dump the structure to a file a little bit at a time. I definitely don't want to consume 200MB to print out some debugging information. My next optimization was to use word pairs, not single words, to guess whether an email was bad. This time my dataset was over 50MB, and I didn't even bother with YAML. In fact, I haven't yet found a *single* dumper of *any* kind on CPAN that can handle dumping this data structure[1]. People writing Perl modules are obviously leaning towards "laziness" as the only virtue. So I wrote my own. It's not YAML, but it showed me what I needed to see, and I fixed some bugs. Now I'm happy... but not as happy as I'd have been if YAML had solved my problem for me :) > My history with using libraries is that I resist them for two reasons. > Either they're too difficult to install and deploy, or they force me > into a paradigm that I don't have control over. I am not saying that > would be the case with YAML, but that's where my prejudices come from. Mmmm. You're writing the library, so you can expose both a simple API that won't scare off the do-it-yourself-ers, and the more complicated one needed for "power users" like me. Later, Neil -- Footnotes: [1] Actually Storable could handle it -- but it takes 180MB RAM and 22 minutes, and it's a binary format anyway. :( |
From: Brian I. <in...@tt...> - 2002-08-19 18:13:49
|
On 19/08/02 10:46 -0700, Neil Watkiss wrote: > Hi Steve, > > I had a bad experience with YAML(.pm) this weekend, so I'm going to argue > with you just 'cause I'm feeling nasty. :) > > Steve Howell [19/08/02 12:10 -0400]: > > Well, I am probably a little extreme in my philosophy, but I believe > > libraries should avoid "convenience" interfaces and focus on providing > > only the essential building blocks. If somebody does not have my > > yaml.dump(), then they have to write their own equivalent > > implementation of what has taken Clark and me 234 lines of Python so > > far. On the other hand, if somebody did not have access to my > > dumpFile(), then they have to write their own 5 lines of Python, but > > they get complete control over the interface. I've already suggested > > one variation on dumpFile()--interspersing comments between > > documents--and I am sure there are other variations that I haven't > > anticipated. > > This weekend I wrote a very interesting spam checking backend for PerlMx (a > sendmail milter engine, http://www.ActiveState.com/PerlMx). The current > backend is based on SpamAssassin, and has hundreds of rules, basically > regular expressions. Mine is based on an excellent article posted on slashdot > last week: http://www.paulgraham.com/spam.html. > > The basic idea is you start a large corpus of "good" and "bad" email, and you > search for the word frequency of every word in both. This allows you to > calculate the probability that any word is contained in a "bad" email. You > can use this technique to predict whether an incoming email is "good" or > "bad", and you can assign a probability to it. > > I generated an enormous list of words, about 10MB, and I wanted to dump the > data structure using YAML. Unfortunately, the current YAML.pm dumps to a > string, and then writes to a file, which consumed all available memory > (200MB) and then crashed. An iterative version would have finished merrily. I totally agree with you here. And now we have a good failing test! If I write an iterative interface, will you test it out for me? I'd love to get your feedback as well. BTW, If you can share the dataset, I'm sure Steve and Why would like to use it as well. Heck, can you post the data somewhere? > Conclusion? YAML libraries (in any language) need to offer much more than > simple load/dump. At the very least, they must provide iterative or callback > APIs, so I can dump the structure to a file a little bit at a time. I > definitely don't want to consume 200MB to print out some debugging > information. Obviously, the load/dump is most important to start with. But as Neil points out, it only works with small datasets. I think the next most important API is load_next/dump_next. Where each document needs to completely be in memory, but not the entire stream. This is a little bit of work for rather large benefit. Much later, we can look at truly streaming APIs. These get tricky due to aliasing. You almost *need* to have an entire document in memory to know where the aliases are. A truly streaming application filter, could just use the parser and emitter and never load the document at all. But it would need to keep track of aliases itself, and preserve them on output. > My next optimization was to use word pairs, not single words, to guess > whether an email was bad. This time my dataset was over 50MB, and I didn't > even bother with YAML. In fact, I haven't yet found a *single* dumper of > *any* kind on CPAN that can handle dumping this data structure[1]. People > writing Perl modules are obviously leaning towards "laziness" as the only > virtue. > > So I wrote my own. It's not YAML, but it showed me what I needed to see, and > I fixed some bugs. Now I'm happy... but not as happy as I'd have been if YAML > had solved my problem for me :) Let's fix that! Cheers, Brian |
From: Neil W. <neilw@ActiveState.com> - 2002-08-19 19:12:59
|
Brian Ingerson [19/08/02 11:13 -0700]: > > I generated an enormous list of words, about 10MB, and I wanted to > > dump the data structure using YAML. Unfortunately, the current > > YAML.pm dumps to a string, and then writes to a file, which consumed > > all available memory (200MB) and then crashed. An iterative version > > would have finished merrily. > > I totally agree with you here. And now we have a good failing test! If I > write an iterative interface, will you test it out for me? I'd love to get > your feedback as well. Okay. Deal. > BTW, If you can share the dataset, I'm sure Steve and Why would like to use > it as well. Heck, can you post the data somewhere? The data currently is a *large* Storable file, which isn't any good to anyone, unfortunately. The script which generates it is also tightly integrated with PerlMx's support libraries, which aren't public. The format I invented for myself is very simple, and somewhat yamlish: you,see: good=1 bad=0 prob=0.01 Here's what the structure expands to in memory: $stats = { good => { 'you,see' => 15, }, bad => { 'you,see' => 145, }, prob => { 'you,see' => 0.8267, }, } This means if you see the words "you" followed by "see", the email has an 82% likelihood of being spam (based on my collection of spam). Of course, you have to weigh all the other word pairs too. I call "word pairs" "chains", since they're basically Markov Chains. [ During the training stage, I can specify the size of the chains to use. Size 1 means to simply count the frequency of every word and use it. Size 2 (shown here) does the frequency of word pairs. Increasing the size means your data set gets much bigger :) ] The size 2 results I currently have result in 407325 chains, which means each hash (good, bad, and prob) has 407325 entries in it. The file containing lots of rules is here [16MB]. http://ttul.org/~nwatkiss/misc/rules This might just work (not tested): my %stats; while (my $line = <>) { my ($chain, @rest) = split /[:\w]+/, $line; for my $part (@rest) { my ($key, $value) = split /=/, $part; $stats{$key}{$chain} = $value; } } Have fun! Neil |
From: Steve H. <sh...@zi...> - 2002-08-19 20:27:53
|
From: "Neil Watkiss" <neilw@ActiveState.com> > [...] > Conclusion? YAML libraries (in any language) need to offer much more than > simple load/dump. At the very least, they must provide iterative or callback > APIs, so I can dump the structure to a file a little bit at a time. I > definitely don't want to consume 200MB to print out some debugging > information. > Just to clarify, you had enough memory to hold your data structure, but not enough memory for YAML to hold its intermediate strings while it was dumping? Any benchmark on what the ratio of memory usage was there? Or, were you literally trying to emit YAML while building the data structure itself? If that's the case, then you have a tricky scenario indeed. FWIW the Python implementation isn't too far from being able to do what you want. Mostly at Clark's insistence (thank you, Clark), the Python library basically does a totally sequential emit, although it doesn't now emit directly to a stream; it instead emits to an array of tokens that it concatentates at the end. This code would be easy to refactor. Ingy made a good point about aliases--they really do force you to process your entire data structure up front. If you have data that you know has no self-recursive loops, and if you don't care about compressing duplicate data structures, then a yaml library should allow you to suppress aliases, at your own risk, of course. This is on my todo list for PyYaml. Until I hear otherwise, I'm assuming that most Python users are not dealing with huge data sets, but are rather using it for small config files, debugging medium-sized data structures, etc. When I start getting power users, I will start optimizing more for performance. Actually, some one like Neil could probably make good use of the YamlDump class in the PyYaml distribution. The YamlDump class takes an argument for the indentation level, so you could hand-emit YAML for the top part of your tree, and then call into the library to emit the branches. A great feature of YAML is that YAML fits nicely inside other YAML, but you do need for your library to let you control the indent level if you're gonna hand-embed the sub-YAML. > > > My history with using libraries is that I resist them for two reasons. > > Either they're too difficult to install and deploy, or they force me > > into a paradigm that I don't have control over. I am not saying that > > would be the case with YAML, but that's where my prejudices come from. > > Mmmm. You're writing the library, so you can expose both a simple API that > won't scare off the do-it-yourself-ers, and the more complicated one needed > for "power users" like me. > Okay, point well taken. I am starting to reconsider. Cheers, Steve |
From: Neil W. <neilw@ActiveState.com> - 2002-08-19 20:42:23
|
Steve Howell [19/08/02 16:27 -0400]: > Just to clarify, you had enough memory to hold your data structure, but not > enough memory for YAML to hold its intermediate strings while it was dumping? > Any benchmark on what the ratio of memory usage was there? Or, were you > literally trying to emit YAML while building the data structure itself? If > that's the case, then you have a tricky scenario indeed. The former. Loading the data structure into memory takes 65MB. I tried the following dumpers: Data::Denter Data::Dump Data::Dumper Data::DumpXML YAML Data::Dumper was the first, and it died with an "Out of memory!" error at 250MB vmsize. I killed off the rest when they got to 200MB with no output seen. One thing I can try is suppressing aliases, since I know there are none. Later, Neil |
From: Rolf V. <rol...@he...> - 2002-08-19 16:28:25
|
why the lucky stiff wrote: > My reasoning for the above idiom is that the Loader class (or whatever > class you use to store many documents) can store configuration information > about the stream. The indentation level can be saved, the version number, > the type of foldings used. Since YAML is meant to be read and edited by > hand, perhaps allowing this information to be reused in the dump could > help improve the readability and friendliness of generated documents. A question I ask myself also. Do we need or is it convenient to have a class that holds more information about the YAML stream than the pure data structure ? At the moment, the connection between the stream and the loader in my case is the lexer or event generator and it does not supply this information. Not that it isn't possible to generate. But then you need a parallel structure with the additional information. Maybe in this case it is beter to use a specific nested node structure that can hold not only the data but also other attributes. Not sure about this. Cheers. Rolf. |
From: Brian I. <in...@tt...> - 2002-08-19 18:27:42
|
On 19/08/02 18:23 +0200, Rolf Veen wrote: > why the lucky stiff wrote: > > > My reasoning for the above idiom is that the Loader class (or whatever > > class you use to store many documents) can store configuration information > > about the stream. The indentation level can be saved, the version number, > > the type of foldings used. Since YAML is meant to be read and edited by > > hand, perhaps allowing this information to be reused in the dump could > > help improve the readability and friendliness of generated documents. > > A question I ask myself also. Do we need or is it convenient to have > a class that holds more information about the YAML stream than the pure > data structure ? At the moment, the connection between the stream and > the loader in my case is the lexer or event generator and it does not > supply this information. Not that it isn't possible to generate. But > then you need a parallel structure with the additional information. > Maybe in this case it is beter to use a specific nested node structure > that can hold not only the data but also other attributes. Rolf, YAML.pm uses a technique (which Clark informed me is) called shadowing. Using this technique, you have a master index that maps each node in a graph to another special "shadow" node. The shadow can contain information like mapping key order, YAML type family, indentation level to use, etc. I leverage this technique for all kinds of benefit in YAML.pm: - Formatting: A application can give hints on how it wants a node to be dumped. The great benefit of shadowing is that it doesn't disturb the original in-memory graph. - Transformation: To dump an opaque object, its class must support a yaml_dump method that returns a shadow. - Round Tripping: In the next release of YAML a user will have the option to shadow all mappings at load time. Then when they dump the resulting graph, the original key order (and possibly other attributes) will be preserved. One thing I like about this technique is that I only need to use it when and where I like. If I have a huge graph and only want to shadow one node, the rest of the graph does not pay a performance penalty. So it's not 'all or nothing'. Cheers, Brian |
From: Brian I. <in...@tt...> - 2002-08-19 17:40:34
|
On 19/08/02 09:54 -0400, Steve Howell wrote: > From: "why the lucky stiff" <yam...@wh...> > > Brian Ingerson (in...@tt...) wrote: > > > It's interesting how you two both have fancy loaders and simple dumpers. > > > There's something asymetrical about that. I like: > > > > > > stream = dump(list_of_objects) > > > list_of_objects = load(stream) > > > > > > Nice symetry. I guess the Perl language make that more natural since it > > > supports lists. I can say for sure that Perl helped make the decision that > > > YAML documents would be completely independent of each other. > > > > > > I guess I'd want to have that same symetry on an iterative interface. > > > > My feelings exactly. I would like to work toward symmetry with each of > > the three techniques for Ruby: > > > > 1. Simple single-document, single-object dump'n'load: > > > > stream = YAML::dump( obj ) > > obj = YAML::load( stream ) > > I like making the simple case simple, so this makes sense. Instead of saying > "stream" in your example, though, I would say "str," as in string data. > > str = yaml.dump(obj) > obj = yaml.load(str) Well, I was talking in terms of a YAML "stream", not a Unix stream. A YAML stream is zero or more YAML documents. Instead of "str", I would say "doc", since your interface seems to be handling a single document. yaml_document = YAML::dump(single_object) single_object = YAML::load(yaml_document) Perl can use this iparticular interface a bit more powerfully because it is list oriented: yaml_stream = YAML::dump(object_list) object_list = YAML::load(yaml_stream) Cheers, Brian |
From: Steve H. <sh...@zi...> - 2002-08-19 18:27:47
|
From: "Brian Ingerson" <in...@tt...> > > Perl can use this particular interface a bit more powerfully because it > is list oriented: > > yaml_stream = YAML::dump(object_list) > object_list = YAML::load(yaml_stream) > This was actually a source of confusion, not convenience, for me in using the yaml.pm interface, although I blame Perl culture, not yaml.pm, for it. How do you distinguish these case when using YAML::dump? 1) You have a list of objects that should be written as separate documents. 2) You have an object that happens to be a list, but which should be written as a single document. |
From: Brian I. <in...@tt...> - 2002-08-19 18:38:41
|
On 19/08/02 14:27 -0400, Steve Howell wrote: > From: "Brian Ingerson" <in...@tt...> > > > > Perl can use this particular interface a bit more powerfully because it > > is list oriented: > > > > yaml_stream = YAML::dump(object_list) > > object_list = YAML::load(yaml_stream) > > > > This was actually a source of confusion, not convenience, for me in using the > yaml.pm interface, although I blame Perl culture, not yaml.pm, for it. > > How do you distinguish these case when using YAML::dump? > > 1) You have a list of objects that should be written as separate documents. > 2) You have an object that happens to be a list, but which should be written as > a single document. That's the beauty of Perl lists: they're not a data structure, they're a series of zero or more disjoint objects. I leverage this in YAML, because in YAML (not coincidentally) a stream is a series of zero or more disjoint documents. This doesn't apply as well to Python, because to emulate Perl list behaviour, you need to use a data structure, namely an array, (which to make matters ultimately confusing, Python calls a "list"). I'm not sure if Ruby has a Perl-list concept. Basically a list is really just a direct mapping of the call stack. Cheers, Brian |
From: Steve H. <sh...@zi...> - 2002-08-19 19:31:25
|
From: "Brian Ingerson" <in...@tt...> > On 19/08/02 14:27 -0400, Steve Howell wrote: > > From: "Brian Ingerson" <in...@tt...> > > > > > > Perl can use this particular interface a bit more powerfully because it > > > is list oriented: > > > > > > yaml_stream = YAML::dump(object_list) > > > object_list = YAML::load(yaml_stream) > > > > > > > This was actually a source of confusion, not convenience, for me in using the > > yaml.pm interface, although I blame Perl culture, not yaml.pm, for it. > > > > How do you distinguish these case when using YAML::dump? > > > > 1) You have a list of objects that should be written as separate documents. > > 2) You have an object that happens to be a list, but which should be written as > > a single document. > > That's the beauty of Perl lists: they're not a data structure, they're a > series of zero or more disjoint objects. I leverage this in YAML, > because in YAML (not coincidentally) a stream is a series of zero or more > disjoint documents. You didn't really answer my question, but I guess you're saying that the answer is #1. It's coming back to me now, though. If you have a Perl array and you want to dump it (use case #2), you pass in a reference to the array, not the array itself, which makes sense. How do you intend to add additional arguments to the dump method? Where would the list of thingies to dump end, and where would the formatting arguments start? It would be pretty ambiguous, and I think once you add more arguments to dump(), you're back to my original conundrum. > This doesn't apply as well to Python, because to emulate Perl list behaviour, > you need to use a data structure, namely an array, (which to make matters > ultimately confusing, Python calls a "list"). > My gosh, the only thing worse than YAML FUD is Python FUD. There is almost a direct correspondence between Perl lists and Python tuples, and likewise between Perl arrays and Python lists. Python's lists are actual objects, though, which is perhaps why Guido didn't call them arrays, although I agree it's confusing to call them lists. After you upgrade from Python 0.0003 to the latest version, try running this simple program, which gives a flavor for how tuples and lists are handled in Python: def hello(*people): for person in people: print "hello " + str(person) hello('ingy', 'steve') folks = ['neil', 'why', 'clark'] folks.sort() hello(folks) Cheers, Steve |
From: Brian I. <in...@tt...> - 2002-08-19 20:19:29
|
On 19/08/02 15:30 -0400, Steve Howell wrote: > From: "Brian Ingerson" <in...@tt...> > > On 19/08/02 14:27 -0400, Steve Howell wrote: > > > From: "Brian Ingerson" <in...@tt...> > > > > > > > > Perl can use this particular interface a bit more powerfully because it > > > > is list oriented: > > > > > > > > yaml_stream = YAML::dump(object_list) > > > > object_list = YAML::load(yaml_stream) > > > > > > > > > > This was actually a source of confusion, not convenience, for me in using > the > > > yaml.pm interface, although I blame Perl culture, not yaml.pm, for it. > > > > > > How do you distinguish these case when using YAML::dump? > > > > > > 1) You have a list of objects that should be written as separate documents. > > > 2) You have an object that happens to be a list, but which should be written > as > > > a single document. > > > > That's the beauty of Perl lists: they're not a data structure, they're a > > series of zero or more disjoint objects. I leverage this in YAML, > > because in YAML (not coincidentally) a stream is a series of zero or more > > disjoint documents. > > You didn't really answer my question, but I guess you're saying that the answer > is #1. It's coming back to me now, though. If you have a Perl array and you > want to dump it (use case #2), you pass in a reference to the array, not the > array itself, which makes sense. OK. The answer is that #2 doesn't apply to Perl. Lists aren't objects or anything else as far as YAML.pm is concerned. > > How do you intend to add additional arguments to the dump method? Where would > the list of thingies to dump end, and where would the formatting arguments > start? It would be pretty ambiguous, and I think once you add more arguments to > dump(), you're back to my original conundrum. > > > This doesn't apply as well to Python, because to emulate Perl list behaviour, > > you need to use a data structure, namely an array, (which to make matters > > ultimately confusing, Python calls a "list"). > > > > My gosh, the only thing worse than YAML FUD is Python FUD. There is > almost a direct correspondence between Perl lists and Python tuples, > and likewise between Perl arrays and Python lists. Python's lists are > actual objects, though, which is perhaps why Guido didn't call them > arrays, although I agree it's confusing to call them lists. No FUD intended. I didn't expect the knee jerk reaction. I wasn't bashing Python. Merely pointing out what I perceived to be the semantic differences between the two languages, and how I used lists in my API. I thought that tuples were immutable Python-lists. If Python people really think of them as Perl-lists, then you could use this API: tuple = load(stream) stream = dump(tuple) > After you upgrade from Python 0.0003 to the latest version, try running this > simple program, which gives a flavor for how tuples and lists are handled in > Python: You can drop the snideness. If this is going to turn into a flamewar, then let's forget the discussion. Sorry if I seemed to be on a Pro-Perl rant. I wasn't. I was just pointing out the list<=>stream<=>list API, that didn't seem to be showing up in other implementations. Cheers, Brian |
From: Steve H. <sh...@zi...> - 2002-08-19 21:48:10
|
From: "Brian Ingerson" <in...@tt...> > No FUD intended. I didn't expect the knee jerk reaction. I wasn't bashing > Python. Merely pointing out what I perceived to be the semantic differences > between the two languages, and how I used lists in my API. > Sorry if my defense of Python went a little overboard. I don't think there's anything wrong with pointing out differences between the language to explain YAML concepts, but your characterizations of Python seemed a bit imprecise. > I thought that tuples were immutable Python-lists. If Python people > really think of them as Perl-lists, then you could use this API: > > tuple = load(stream) > stream = dump(tuple) > I think all of this discussion of list/tuple/array mutable/immuatable semantics confuses the issue. Basically, I think it comes down to Perl having "wantarray" and Python not having it (to my knowledge). Python can't have a method changes its behavior according to the lvalue of the caller (again, to my knowledge), because that just wouldn't be Pythonic. Perl can overload its dump() method to return either one of the documents, or all the documents, because it has wantarray. PyYaml solves the problem in a completely different way. It returns an iterator, and if you only want one the first item, well, just call the iterator only once--it's only five extra characters. If you want to save those five characters, then you need to have a second method, because really, you're asking for a second behavior. Like Perl, Python doesn't make you pay for the unwanted items in the list. ~o/ You say laughter and I say lawfter, You say after and I say awfter; Laughter, lawfter, after, awfter - Let's call the whole thing off! / > > After you upgrade from Python 0.0003 to the latest version, try running this > > simple program, which gives a flavor for how tuples and lists are handled in > > Python: > Okay, that comment was DEFINITELY uncalled for on my part. The key thing is that we agree that YAML rocks! ~o/ So, if you go for oysters and I go for ersters, I'll order oysters and cancel the ersters. For we know we Need each other, so we Better call the calling off off. Let's call the whole thing off! / Cheers, Steve |
From: Brian I. <in...@tt...> - 2002-08-20 04:22:04
|
On 19/08/02 15:30 -0400, Steve Howell wrote: > From: "Brian Ingerson" <in...@tt...> > > On 19/08/02 14:27 -0400, Steve Howell wrote: > > > From: "Brian Ingerson" <in...@tt...> > > > > > > > > Perl can use this particular interface a bit more powerfully because it > > > > is list oriented: > > > > > > > > yaml_stream = YAML::dump(object_list) > > > > object_list = YAML::load(yaml_stream) > > > > > > > > > > This was actually a source of confusion, not convenience, for me > > > in using the yaml.pm interface, although I blame Perl culture, not > > > yaml.pm, for it. > > > > > > How do you distinguish these case when using YAML::dump? > > > > > > 1) You have a list of objects that should be written as separate > > > documents. > > > 2) You have an object that happens to be a list, but which should > > > be written as a single document. > > > > That's the beauty of Perl lists: they're not a data structure, > > they're a series of zero or more disjoint objects. I leverage this > > in YAML, because in YAML (not coincidentally) a stream is a series > > of zero or more disjoint documents. > > You didn't really answer my question, but I guess you're saying that > the answer is #1. It's coming back to me now, though. If you have a > Perl array and you want to dump it (use case #2), you pass in a > reference to the array, not the array itself, which makes sense. Right. YAML can't (and shouldn't) be able to do Perl arrays and hashes. It can only deal with (Perl)refs to those. This is due to the screwed up and convuluted semantics of Perl data structures. (See, I'm not defending Perl at all here.) Perl sucks in this sense because it listifies its data structures when they get put on the call stack. BUT... I one were to (and I've always done this) call hashrefs "hashes" and arrayrefs "arrays", then everything works out semantically correct for YAML, like it would in a sane language like Python or Ruby or hopefully Perl6. A Perl hash can be thought of as a Perl hashref that is stuck into the Perl symbol table. So it can use a '%' in front of it, and work in the builtins that manipulate hashes and have its crazy listify behaviour. Perl lists never exist as a data structure in memory. They are merely a concept relating to the call stack. This concept can be exploited nicely in YAML, since I don't need to serialize the listness. It seems incorrect usage to me to use tuples (which (I think) are a data structure), since people will want to preserve the tupleness. --- I just bounced this off of Jon Prettyman, the hacker whose house I'm staying with. He knows Perl a bit, but mostly does Python. I think this all boils down to the fact that in Perl you have arbitrarily variable length argument lists, but not in Python. I simply point this out to show that we'll have different APIs for some things. > How do you intend to add additional arguments to the dump method? > Where would the list of thingies to dump end, and where would the > formatting arguments start? It would be pretty ambiguous, and I think > once you add more arguments to dump(), you're back to my original > conundrum. OK. My interface for extra parameters is to specify them separately. In the simplest case, I use global variables. In the more complicated case I use an OO interface. $YAML::Indent = 1; YAML::Dump($foo); vs YAML->new->Indent(1)->dump($foo); This is the interface that Data::Dumper uses. It's familiar to Perl programmers. In a sense, the global variables are like attributes of the global object. Just like Ruby. Well, perhaps that's a bit of a stretch. Anyway, to answer your conundrum, with Dump(), what you pass in is what you get. That's the way I always wanted it. It gives me the symetry of: $foo = Dump(@bar); @bar = Load($foo); Remember, I'm not serializing @bar here. Merely the nodes of its listification. Cheers, Brian |
From: Steve H. <sh...@zi...> - 2002-08-20 13:18:50
|
From: "Brian Ingerson" <in...@tt...> > > Right. YAML can't (and shouldn't) be able to do Perl arrays and hashes. It > can only deal with (Perl)refs to those. This is due to the screwed up and > convuluted semantics of Perl data structures. (See, I'm not defending Perl at > all here.) Perl sucks in this sense because it listifies its data structures > when they get put on the call stack. > > BUT... > > I one were to (and I've always done this) call hashrefs "hashes" and > arrayrefs "arrays", then everything works out semantically correct for YAML, > like it would in a sane language like Python or Ruby or hopefully Perl6. A > Perl hash can be thought of as a Perl hashref that is stuck into the Perl > symbol table. So it can use a '%' in front of it, and work in the builtins > that manipulate hashes and have its crazy listify behaviour. Ok, makes sense. > Perl lists never exist as a data structure in memory. They are merely a > concept relating to the call stack. This concept can be exploited nicely in > YAML, since I don't need to serialize the listness. It seems incorrect usage > to me to use tuples (which (I think) are a data structure), since people will > want to preserve the tupleness. > > --- > > I just bounced this off of Jon Prettyman, the hacker whose house I'm staying > with. He knows Perl a bit, but mostly does Python. > > I think this all boils down to the fact that in Perl you have arbitrarily > variable length argument lists, but not in Python. I simply point this out to > show that we'll have different APIs for some things. > Sorry to be pedantic, but Python does have arbitrarily variable length argument lists: def greet(greeting, *people): for person in people: print greeting + ' ' + str(person) greet('hello', 'ingy', 'steve') greet('aloha', 'jeff', 'guido', 'alex') The only substantive difference between Perl and Python in terms of calling sequence, that I see in terms of YAML anyway, is that Perl has wantarray, and Python doesn't. This is the only place where we'll be forced into different APIs. This variable argument list discussion is just a big red herring. If you were to forgo the use of wantarray in Perl, then I think we could have virtually identical APIs for yaml. On the other hand, I have also said that I like how our Perl library is Perlish, and our Python library is Pythonic, etc., so you could make a good argument for wantarray and I'd accept that. Consistency and hobglobbins... > > How do you intend to add additional arguments to the dump method? > > Where would the list of thingies to dump end, and where would the > > formatting arguments start? It would be pretty ambiguous, and I think > > once you add more arguments to dump(), you're back to my original > > conundrum. > > OK. My interface for extra parameters is to specify them separately. In the > simplest case, I use global variables. In the more complicated case I use > an OO interface. > > $YAML::Indent = 1; > YAML::Dump($foo); > > vs > > YAML->new->Indent(1)->dump($foo); > > This is the interface that Data::Dumper uses. It's familiar to Perl > programmers. In a sense, the global variables are like attributes of the > global object. Just like Ruby. Well, perhaps that's a bit of a stretch. > Yep, but global objects do defeat reentrancy, correct? I can guess you say YAGNI, but what if under some future version of Perl folks want to use YAML in two different threads? Won't the formatting variables conflict? Or suppose you're doing a YAML dump of a large structure, and then one of your objects has a to_yaml method(), and then, to help you debug the to_yaml() method, you make another call to YAML in there? Won't the inner call mess up the args for the outer call? If I want to avoid global variables in Python's dump protocol, I think the code below is my best option, which is inspired by your comments above: def _dump(options, *documents): # ... class Options: def setIndent(self, indent): self.indent = indent return self def dump(self, *documents): _dump(self, *documents) def dump(*documents): _dump(DefaultOptions(), *documents) Then you'd say: print Options().setIndent(4).dump({'foo': 'bar'}) print Options().setIndent(3).dump('Steve', [3, 4, 5], {'foo': 'bar'}) # three different documents print dump(['apple', 'banana', 'carrot']) # normal formatting So, I guess the way out of my conundrum was the same way you get out of it in Perl. So, despite all the flames (which were intended to be lighthearted, but they never read that way, I guess), we got something useful out of this discussion. :) On a related issue, Dave Kulhman has submitted a patch for PyYaml that adds a dumpToFile() method to the emitter. Minor issue here, but I think Dave and I both prefer dumpToFile() to dumpFile(), because that extra preposition makes it clear that the file's the target, not the source. |
From: Ned K. <ne...@bi...> - 2002-08-20 15:44:45
|
On Tuesday 20 August 2002 06:18 am, Steve Howell wrote: > Sorry to be pedantic, but Python does have arbitrarily variable > length argument lists: > > def greet(greeting, *people): > for person in people: > print greeting + ' ' + str(person) If it matters, Ruby is identical here (that is, you can arrange to=20 have trailing arguments passed as an array). And since Ruby has a rather nice threading system, it should (and I=20 believe _why has done this) not use any globals. --=20 Ned Konz http://bike-nomad.com GPG key ID: BEEA7EFE |