From: Joe H. <jh...@oo...> - 2004-01-21 18:44:49
|
This is a necessarily long post about the path to an open-source replacement for IDL and Matlab. While I have tried to be fair to those who have contributed much more than I have, I have also tried to be direct about what I see as some fairly fundamental problems in the way we're going about this. I've given it some section titles so you can navigate, but I hope that you will read the whole thing before posting a reply. I fear that this will offend some people, but please know that I value all your efforts, and offense is not my intent. THE PAST VS. NOW While there is significant and dedicated effort going into numeric/numarray/scipy, it's becoming clear that we are not progressing quickly toward a replacement for IDL and Matlab. I have great respect for all those contributing to the code base, but I think the present discussion indicates some deep problems. If we don't identify those problems (easy) and solve them (harder, but not impossible), we will continue not to have the solution so many people want. To be convinced that we are doing something wrong at a fundamental level, consider that Python was the clear choice for a replacement in 1996, when Paul Barrett and I ran a BoF at ADASS VI on interactive data analysis environments. That was over 7 years ago. When people asked at that conference, "what does Python need to replace IDL or Matlab", the answer was clearly "stable interfaces to basic numerics and plotting; then we can build it from there following the open-source model". Work on both these problems was already well underway then. Now, both the numerical and plotting development efforts have branched. There is still no stable base upon which to build. There aren't even packages for popular OSs that people can install and play with. The problem is not that we don't know how to do numerics or graphics; if anything, we know these things too well. In 1996, if anyone had told us that in 2004 there would be no ready-to-go replacement system because of a factor of 4 in small array creation overhead (on computers that ran 100x as fast as those then available) or the lack of interactive editing of plots at video speeds, the response would not have been pretty. How would you have felt? THE PROBLEM We are not following the open-source development model. Rather, we pay lip service to it. Open source's development mantra is "release early, release often". This means release to the public, for use, a package that has core capability and reasonably-defined interfaces. Release it in a way that as many people as possible will get it, install it, use it for real work, and contribute to it. Make the main focus of the core development team the evaluation and inclusion of contributions from others. Develop a common vision for the program, and use that vision to make decisions and keep efforts focused. Include contributing developers in decision making, but do make decisions and move on from them. Instead, there are no packages for general distribution. The basic interfaces are unstable, and not even being publicly debated to decide among them (save for the past 3 days). The core developers seem to spend most of their time developing, mostly out of view of the potential user base. I am asked probably twice a week by different fellow astronomers when an open-source replacement for IDL will be available. They are mostly unaware that this effort even exists. However, this indicates that there are at least hundreds of potential contributors of application code in astronomy alone, as I don't nearly know everyone. The current efforts look rather more like the GNU project than Linux. I'm sorry if that hurts, but it is true. I know that Perry's group at STScI and the fine folks at Enthought will say they have to work on what they are being paid to work on. Both groups should consider the long term cost, in dollars, of spending those development dollars 100% on coding, rather than 50% on coding and 50% on outreach and intake. Linus himself has written only a small fraction of the Linux kernel, and almost none of the applications, yet in much less than 7 years Linux became a viable operating system, something much bigger than what we are attempting here. He couldn't have done that himself, for any amount of money. We all know this. THE PATH Here is what I suggest: 1. We should identify the remaining open interface questions. Not, "why is numeric faster than numarray", but "what should the syntax of creating an array be, and of doing different basic operations". If numeric and numarray are in agreement on these issues, then we can move on, and debate performance and features later. 2. We should identify what we need out of the core plotting capability. Again, not "chaco vs. pyxis", but the list of requirements (as an astronomer, I very much like Perry's list). 3. We should collect or implement a very minimal version of the featureset, and document it well enough that others like us can do simple but real tasks to try it out, without reading source code. That documentation should include lists of things that still need to be done. 4. We should release a stand-alone version of the whole thing in the formats most likely to be installed by users on the four most popular OSs: Linux, Windows, Mac, and Solaris. For Linux, this means .rpm and .deb files for Fedora Core 1 and Debian 3.0r2. Tarballs and CVS checkouts are right out. We have seen that nobody in the real world installs them. To be most portable and robust, it would make sense to include the Python interpreter, named such that it does not stomp on versions of Python in the released operating systems. Static linking likewise solves a host of problems and greatly reduces the number of package variants we will have to maintain. 5. We should advertize and advocate the result at conferences and elsewhere, being sure to label it what it is: a first-cut effort designed to do a few things well and serve as a platform for building on. We should also solicit and encourage people either to work on the included TODO lists or to contribute applications. One item on the TODO list should be code converters from IDL and Matlab to Python, and compatibility libraries. 6. We should then all continue to participate in the discussions and development efforts that appeal to us. We should keep in mind that evaluating and incorporating code that comes in is in the long run much more efficient than writing the universe ourselves. 7. We should cut and package new releases frequently, at least once every six months. It is better to delay a wanted feature by one release than to hold a release for a wanted feature. The mountain is climbed in small steps. The open source model is successful because it follows closely something that has worked for a long time: the scientific method, with its community contributions, peer review, open discussion, and progress mainly in small steps. Once basic capability is out there, we can twiddle with how to improve things behind the scenes. IS SCIPY THE WAY? The recipe above sounds a lot like SciPy. SciPy began as a way to integrate the necessary add-ons to numeric for real work. It was supposed to test, document, and distribute everything together. I am aware that there are people who use it, but the numbers are small and they seem to be tightly connected to Enthought for support and application development. Enthought's focus seems to be on servicing its paying customers rather than on moving SciPy development along, and I fear they are building an installed customer base on interfaces that were not intended to be stable. So, I will raise the question: is SciPy the way? Rather than forking the plotting and numerical efforts from what SciPy is doing, should we not be creating a new effort to do what SciPy has so far not delivered? These are not rhetorical or leading questions. I don't know enough about the motivations, intentions, and resources of the folks at Enthought (and elsewhere) to know the answer. I do think that such a fork will occur unless SciPy's approach changes substantially. The way to decide is for us all to discuss the question openly on these lists, and for those willing to participate and contribute effort to declare so openly. I think all that is needed, either to help SciPy or replace it, is some leadership in the direction outlined above. I would be interested in hearing, perhaps from the folks at Enthought, alternative points of view. Why are there no packages for popular OSs for SciPy 0.2? Why are releases so infrequent? If the folks running the show at scipy.org disagree with many others on these lists, then perhaps those others would like to roll their own. Or, perhaps stable/testing/unstable releases of the whole package are in order. HOW TO CONTRIBUTE? Judging by the number of PhDs in sigs, there are a lot of researchers on this list. I'm one, and I know that our time for doing core development or providing the aforementioned leadership is very limited, if not zero. Later we will be in a much better position to contribute application software. However, there is a way we can contribute to the core effort even if we are not paid, and that is to put budget items in grant and project proposals to support the work of others. Those others could be either our own employees or subcontractors at places like Enthought or STScI. A handful of contributors would be all we'd need to support someone to produce OS packages and tutorial documentation (the stuff core developers find boring) for two releases a year. --jh-- |