Re: [Jython-dev] new static analyzer from Google

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thu, Jan 14, 2010 at 6:33 AM, Anders Olme <and...@gm...> wrote:

> On Tue, Jan 12, 2010 at 04:42:54PM -0800, Steve Yegge wrote:
> > I've got a patch to submit for everyone's consideration.  It's a static
> > analyzer and indexer for Python
> > source, built atop the Jython antlr parser.  It was primarily designed to
> > produce an index that can
> > support IDE functionality, but it could potentially also be used to
> provide
> > type information to the
> > interpreter for optimization purposes.
> Sorry for a complete newbie question but what is an indexer and hwo does
> the output look like?
>

No worries.  It's actually a nontrivial concept that's difficult to explain
without
providing some context.

In a nutshell, it's a compiler for Python code, but it emits metadata rather
than
generated code.  While that answer may be good enough for you as-is, I'll go
ahead and elaborate for posterity.

As you no doubt know, compilers perform a set of standard transformations:

  * scanning/lexing to get tokens
  * parsing to turn the tokens into a syntax tree
  * creating a symbol table to represent identifier scopes and other info
  * building a type representation of the program (sometimes using
inference)
  * performing semantic checks such as used-before-assigned or
incorrect-type
  * code generation
  * optimizations that can happen at every step of the way

This is all typically done statically, without actually running the code,
which means
that compilation inevitably creates a model that is merely an approximation
of the
runtime semantics.  The accuracy and completeness of this approximation can
be
improved by changing the language (e.g. Haskell > Java/C++ > Python in terms
of
support for static analysis) or by putting more work into the compiler (e.g.
gcc does
a better job than a C compiler produced by a student as a term project).

There are many tools in the compiler family which generate simple static
models
sufficient (only) for their own needs -- for instance, linters, syntax
highlighters, pretty
printers, documentation generators, even classic code-generating compilers.
 This
implies that lots of tools developers are reinventing the wheel just well
enough for
their own tools.

But IDEs like Eclipse or Visual Studio are also in the compiler family.
 While they
began as aggregators of other compiler tools, these days they actually
create the
richest general-purpose static models of the code, because they need to
manipulate
and surface a superset of the information from all the standalone tools.
 Hence, IDE
backends are special-purpose compilers that do their best never to throw
anything
away, which differentiates them from all other compiler-like tools,
including compilers
themselves.

The upside of this is that you run the indexer and get a general-purpose
programming
model and API which together allow you you query and manipulate the code as
a set
of data structures.  At that point you can do anything you like:  build a
global symbol
autocompletion index, or write some refactoring functions, or display fancy
views of
the class tree or type graph -- anything the model permits.  An IDE backend
should
in theory be able to replace the static model for any other compiler-like
tool, which
is nice:  no more wheel-reinventing necessary, at least provided the IDE
actually
exposes its indexer in headless mode, which a few of them do today.

The downside is that it's fat and slow.  Code produces a -lot- of metadata,
and rich
IDEs have to do lots of multi-level caching tricks to avoid blowing out
memory or
consuming all of your CPU, or both.  The indexer I've submitted has had a
fair bit
of tuning to try to minimize its memory use, but it still struggles with
projects with,
say, 5000 files on a 32-bit JVM given -Xmx3000 or so.  So it could use more
work
in this space.  For now, we simply partition our code base into many
projects,
analyze them all in parallel, and merge the results into a large serialized
version of
the data structures produced by the analyzer shards -- a.k.a. "the index".

The version of the indexer I've submitted here does not have a serialized
representation.
Internally at Google we have a language-neutral representation shared by all
our
language analyzers, but it's work-in-progress and relies heavily on Google
infrastructure,
so it's not yet appropriate for open-sourcing.  At some point we hope to get
there, but
for now, the open-source Python indexer is suitable for small- to
medium-sized projects.

Serializing the index will be the next big evolutionary step, since if done
properly, it
can help the indexer scale to larger projects while simultaneously
increasing the
expressive power of the underlying model.  I suspect that a relational
database is
likely to be the best choice, but defining the schema will be a tricky and
ongoing task.
Indexers like this one need to keep growing to be maximally useful -- at the
very least
they need to support the evolution of the language itself, but they also
need to refine
their static models to support the ever-increasing demands of the tools and
their users.

(As one concrete example, improving the indexer's internal model of Python's
type
system will yield better results across the board, e.g. picking up
identifier references
that it currently misses, but this comes at the cost of increasing both the
time spent
during analysis and the amount of information stored in the index.)

The demo app I provided builds an index and then "walks" it using a visitor
interface
to produce a simple static-HTML view of the indexed code, with hyperlinks
between
identifiers similar to those produced by the Linux Cross-Referencer.  But
the indexer
produces a great deal more metadata than is used by the demo.  I'm hoping
that
some tools-inclined devs will jump in and start using the indexer to power
their own
tools efforts, making improvements as needed.  With luck I may find time to
write
more demo apps, since it's easier to appreciate the indexer's value when you
can
visualize the information it provides.

In the meantime, Python is a pretty big language at Google, and indexing all
the code
should improve life for our Python developers, so we'll keep hacking on the
indexer and
sending periodic updates.  (This will of course be easier if it's in the svn
repository. :)

Hope that helps!

-steve

> Best Regards Anders Olme
>
>
> ------------------------------------------------------------------------------
> Throughout its 18-year history, RSA Conference consistently attracts the
> world's best and brightest in the field, creating opportunities for
> Conference
> attendees to learn about information security's most important issues
> through
> interactions with peers, luminaries and emerging and established companies.
> http://p.sf.net/sfu/rsaconf-dev2dev
> _______________________________________________
> Jython-dev mailing list
> Jyt...@li...
> https://lists.sourceforge.net/lists/listinfo/jython-dev
>