From: Steve Y. <st...@go...> - 2010-01-14 21:22:27
|
On Thu, Jan 14, 2010 at 6:33 AM, Anders Olme <and...@gm...> wrote: > On Tue, Jan 12, 2010 at 04:42:54PM -0800, Steve Yegge wrote: > > I've got a patch to submit for everyone's consideration. It's a static > > analyzer and indexer for Python > > source, built atop the Jython antlr parser. It was primarily designed to > > produce an index that can > > support IDE functionality, but it could potentially also be used to > provide > > type information to the > > interpreter for optimization purposes. > Sorry for a complete newbie question but what is an indexer and hwo does > the output look like? > No worries. It's actually a nontrivial concept that's difficult to explain without providing some context. In a nutshell, it's a compiler for Python code, but it emits metadata rather than generated code. While that answer may be good enough for you as-is, I'll go ahead and elaborate for posterity. As you no doubt know, compilers perform a set of standard transformations: * scanning/lexing to get tokens * parsing to turn the tokens into a syntax tree * creating a symbol table to represent identifier scopes and other info * building a type representation of the program (sometimes using inference) * performing semantic checks such as used-before-assigned or incorrect-type * code generation * optimizations that can happen at every step of the way This is all typically done statically, without actually running the code, which means that compilation inevitably creates a model that is merely an approximation of the runtime semantics. The accuracy and completeness of this approximation can be improved by changing the language (e.g. Haskell > Java/C++ > Python in terms of support for static analysis) or by putting more work into the compiler (e.g. gcc does a better job than a C compiler produced by a student as a term project). There are many tools in the compiler family which generate simple static models sufficient (only) for their own needs -- for instance, linters, syntax highlighters, pretty printers, documentation generators, even classic code-generating compilers. This implies that lots of tools developers are reinventing the wheel just well enough for their own tools. But IDEs like Eclipse or Visual Studio are also in the compiler family. While they began as aggregators of other compiler tools, these days they actually create the richest general-purpose static models of the code, because they need to manipulate and surface a superset of the information from all the standalone tools. Hence, IDE backends are special-purpose compilers that do their best never to throw anything away, which differentiates them from all other compiler-like tools, including compilers themselves. The upside of this is that you run the indexer and get a general-purpose programming model and API which together allow you you query and manipulate the code as a set of data structures. At that point you can do anything you like: build a global symbol autocompletion index, or write some refactoring functions, or display fancy views of the class tree or type graph -- anything the model permits. An IDE backend should in theory be able to replace the static model for any other compiler-like tool, which is nice: no more wheel-reinventing necessary, at least provided the IDE actually exposes its indexer in headless mode, which a few of them do today. The downside is that it's fat and slow. Code produces a -lot- of metadata, and rich IDEs have to do lots of multi-level caching tricks to avoid blowing out memory or consuming all of your CPU, or both. The indexer I've submitted has had a fair bit of tuning to try to minimize its memory use, but it still struggles with projects with, say, 5000 files on a 32-bit JVM given -Xmx3000 or so. So it could use more work in this space. For now, we simply partition our code base into many projects, analyze them all in parallel, and merge the results into a large serialized version of the data structures produced by the analyzer shards -- a.k.a. "the index". The version of the indexer I've submitted here does not have a serialized representation. Internally at Google we have a language-neutral representation shared by all our language analyzers, but it's work-in-progress and relies heavily on Google infrastructure, so it's not yet appropriate for open-sourcing. At some point we hope to get there, but for now, the open-source Python indexer is suitable for small- to medium-sized projects. Serializing the index will be the next big evolutionary step, since if done properly, it can help the indexer scale to larger projects while simultaneously increasing the expressive power of the underlying model. I suspect that a relational database is likely to be the best choice, but defining the schema will be a tricky and ongoing task. Indexers like this one need to keep growing to be maximally useful -- at the very least they need to support the evolution of the language itself, but they also need to refine their static models to support the ever-increasing demands of the tools and their users. (As one concrete example, improving the indexer's internal model of Python's type system will yield better results across the board, e.g. picking up identifier references that it currently misses, but this comes at the cost of increasing both the time spent during analysis and the amount of information stored in the index.) The demo app I provided builds an index and then "walks" it using a visitor interface to produce a simple static-HTML view of the indexed code, with hyperlinks between identifiers similar to those produced by the Linux Cross-Referencer. But the indexer produces a great deal more metadata than is used by the demo. I'm hoping that some tools-inclined devs will jump in and start using the indexer to power their own tools efforts, making improvements as needed. With luck I may find time to write more demo apps, since it's easier to appreciate the indexer's value when you can visualize the information it provides. In the meantime, Python is a pretty big language at Google, and indexing all the code should improve life for our Python developers, so we'll keep hacking on the indexer and sending periodic updates. (This will of course be easier if it's in the svn repository. :) Hope that helps! -steve > Best Regards Anders Olme > > > ------------------------------------------------------------------------------ > Throughout its 18-year history, RSA Conference consistently attracts the > world's best and brightest in the field, creating opportunities for > Conference > attendees to learn about information security's most important issues > through > interactions with peers, luminaries and emerging and established companies. > http://p.sf.net/sfu/rsaconf-dev2dev > _______________________________________________ > Jython-dev mailing list > Jyt...@li... > https://lists.sourceforge.net/lists/listinfo/jython-dev > |