On Thu, Jan 14, 2010 at 6:33 AM, Anders Olme <anders.olme@gmail.com> wrote:
On Tue, Jan 12, 2010 at 04:42:54PM -0800, Steve Yegge wrote:
> I've got a patch to submit for everyone's consideration.  It's a static
> analyzer and indexer for Python
> source, built atop the Jython antlr parser.  It was primarily designed to
> produce an index that can
> support IDE functionality, but it could potentially also be used to provide
> type information to the
> interpreter for optimization purposes.
Sorry for a complete newbie question but what is an indexer and hwo does
the output look like?

No worries.  It's actually a nontrivial concept that's difficult to explain without
providing some context.

In a nutshell, it's a compiler for Python code, but it emits metadata rather than
generated code.  While that answer may be good enough for you as-is, I'll go
ahead and elaborate for posterity.

As you no doubt know, compilers perform a set of standard transformations:

  * scanning/lexing to get tokens
  * parsing to turn the tokens into a syntax tree
  * creating a symbol table to represent identifier scopes and other info
  * building a type representation of the program (sometimes using inference)
  * performing semantic checks such as used-before-assigned or incorrect-type
  * code generation
  * optimizations that can happen at every step of the way

This is all typically done statically, without actually running the code, which means
that compilation inevitably creates a model that is merely an approximation of the
runtime semantics.  The accuracy and completeness of this approximation can be
improved by changing the language (e.g. Haskell > Java/C++ > Python in terms of
support for static analysis) or by putting more work into the compiler (e.g. gcc does
a better job than a C compiler produced by a student as a term project).

There are many tools in the compiler family which generate simple static models
sufficient (only) for their own needs -- for instance, linters, syntax highlighters, pretty
printers, documentation generators, even classic code-generating compilers.  This
implies that lots of tools developers are reinventing the wheel just well enough for
their own tools.

But IDEs like Eclipse or Visual Studio are also in the compiler family.  While they
began as aggregators of other compiler tools, these days they actually create the
richest general-purpose static models of the code, because they need to manipulate
and surface a superset of the information from all the standalone tools.  Hence, IDE
backends are special-purpose compilers that do their best never to throw anything
away, which differentiates them from all other compiler-like tools, including compilers

The upside of this is that you run the indexer and get a general-purpose programming
model and API which together allow you you query and manipulate the code as a set
of data structures.  At that point you can do anything you like:  build a global symbol
autocompletion index, or write some refactoring functions, or display fancy views of
the class tree or type graph -- anything the model permits.  An IDE backend should
in theory be able to replace the static model for any other compiler-like tool, which
is nice:  no more wheel-reinventing necessary, at least provided the IDE actually
exposes its indexer in headless mode, which a few of them do today.

The downside is that it's fat and slow.  Code produces a -lot- of metadata, and rich
IDEs have to do lots of multi-level caching tricks to avoid blowing out memory or
consuming all of your CPU, or both.  The indexer I've submitted has had a fair bit
of tuning to try to minimize its memory use, but it still struggles with projects with,
say, 5000 files on a 32-bit JVM given -Xmx3000 or so.  So it could use more work
in this space.  For now, we simply partition our code base into many projects,
analyze them all in parallel, and merge the results into a large serialized version of
the data structures produced by the analyzer shards -- a.k.a. "the index".

The version of the indexer I've submitted here does not have a serialized representation.
Internally at Google we have a language-neutral representation shared by all our
language analyzers, but it's work-in-progress and relies heavily on Google infrastructure,
so it's not yet appropriate for open-sourcing.  At some point we hope to get there, but
for now, the open-source Python indexer is suitable for small- to medium-sized projects.

Serializing the index will be the next big evolutionary step, since if done properly, it
can help the indexer scale to larger projects while simultaneously increasing the
expressive power of the underlying model.  I suspect that a relational database is
likely to be the best choice, but defining the schema will be a tricky and ongoing task.
Indexers like this one need to keep growing to be maximally useful -- at the very least
they need to support the evolution of the language itself, but they also need to refine
their static models to support the ever-increasing demands of the tools and their users.

(As one concrete example, improving the indexer's internal model of Python's type
system will yield better results across the board, e.g. picking up identifier references
that it currently misses, but this comes at the cost of increasing both the time spent
during analysis and the amount of information stored in the index.)

The demo app I provided builds an index and then "walks" it using a visitor interface
to produce a simple static-HTML view of the indexed code, with hyperlinks between
identifiers similar to those produced by the Linux Cross-Referencer.  But the indexer
produces a great deal more metadata than is used by the demo.  I'm hoping that
some tools-inclined devs will jump in and start using the indexer to power their own
tools efforts, making improvements as needed.  With luck I may find time to write
more demo apps, since it's easier to appreciate the indexer's value when you can
visualize the information it provides.

In the meantime, Python is a pretty big language at Google, and indexing all the code
should improve life for our Python developers, so we'll keep hacking on the indexer and
sending periodic updates.  (This will of course be easier if it's in the svn repository. :)

Hope that helps!


Best Regards Anders Olme

Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
Jython-dev mailing list