Re: A universal ADT mapper and sorter?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 08/02/2011, at 1:20 PM, Alan Silverstein wrote:

> Drifting off topic but what the heck...

Lol :)

Its not off topic because we're proposing to rework the Judy build system :)

BTW: I'm not proposing to use something wizz bang for this, Make
should do just fine: Judy is basically just a bunch of C files that
need to be compiled.

>> Data flow is not a new idea, it's a subset of the REAL idea:  category
>> theory.
> 
> OK, I'll have to read up on that...

Category Theory (CT) is *the* theory of abstraction.

It basically starts off by considering sets and functions,
in programming we call these types and functions,
in category theory we call them objects and arrows.

The key idea is to abstract away the elements of the sets
and talk about "structure" entirely in terms of the properties
of the functions.

For example we can explain "The function f:X->Y is 1-1", which is
a set element style definition, in a categorical setting like this:

for all g1, g2:  U -> X,     g1 .f =  g2.f implies g1 = g2

This is easy to understand. Suppose f wasn't 1-1. Then you might
have g1(x)=a and g2(x)=b, but f(a)=y and f(b)=y, so g1 and g2 
can be different functions, but you can't tell because f "removes
the different outputs by mapping them to the same value".

This can't happen if f is 1-1, if g1 and g2 are different there is some
value x for which g1(x) != g2 (x), and f must map these to two distinct
values, so g1 . f !=  g2 .f.

now the point here is to examine the formula again: we have defined
"f is 1-1" without mentioning elements.

In other words, the definition is *abstract*: written entirely in terms
of functions, ignoring the sets and their values.

In fact, we can throw out the sets entirely, replace X with the function

	identity: X -> X

Anyhow, the relation to "data flow" is clear: the arrows are "channels
down which data flows, possibly being modified along the way" :)

And the key properties can be understood in the abstract without
caring about the actual data types being processed.

> 
>> Build systems must be driven bottom up.  They're intrinsically
>> imperative NOT functional.
> 
> Would you elaborate on the difference?  Do you mean the difference
> between declarative (what) and functional (how)?

No, I mean build systems aren't declarative, quite the opposite.
They action based. Imperative, procedural, whatever.
Functional/declarative models are all wrong.

If you look at make, the *rules* are imperative: do this, do that.

The declarative part .. the goal dependency stuff, doesn't work.

Just for starters, many programs don't have a single output.
This screws the basic concept up immediately.

Some programs have side effects (no outputs as such) and some
have multiple outputs.

So you immediately have to add hacks to make, phonies and
proxy files, just to make it work at all ..  and thats doing really
basic stuff.

> 
>> When you change a source file, that should trigger rebuilding the
>> system based on what depends on the source file you changed.  This is
>> completely the reverse of target driven building.
> 
> Yes -- and no -- at least as I think of it.  Viewing a build system as
> an acyclical graph, it's a static (at any one point in time)

we agree this is a gross oversimplification .. we will temporarily accept
this to avoid confusion, but clearly it isn't so.

Consider say "doxygen" .. surely, there are a fixed set of inputs,
but who knows what the *.html files it generates are???

> set of
> relationships between sources (files that have no arrows into them
> within the build system, even if derived say from a version control
> system) and constructed targets (some of which are deliverable, others
> of which are intermediate, but that doesn't matter here).  Given some
> form of specification of these relationships -- sources, targets, rules,
> dependencies/conditions -- then any time a source changes, all dependees
> must be at least revisited if NOT updated/reconstructed, whether you
> consider this to be targets-backwards or sources-forward.

Yes .. given this information. The problem is how to get it.
Specifying dependencies is intrinsically unreliable and entirely unnecessary,
provided you can capture outputs.

This is hard to understand but true. Consider a set of programs
(compilers, document generators, linkers, etc etc). And some source files.

Lets assume we know which programs to apply to which files.

What order should we apply then programs in? When do we need to apply them?

The answer is not what you'd expect.

You can apply the programs IN ANY ORDER!!

Surprised? Its true! It doesn't make any difference. Here is the build algorithm:

Apply programs to files in any order.
Step2:
  Do it again.
  If the results are the same, you're done. 
 Otherwise repeat from step 2.

This is the fixpoint algorithm: just keep trying until the build is stable.

To make this work you need to be able to monitor the *outputs* of programs:
you need to see every file that is touched or created, so you can check
when the build is completed (reached a fix point).

Well now, you can say "But that is horribly inefficient!!!"

Yes yes, you'd be right. So optimise it. Specify dependency
information AS A HINT.

Now perhaps you see the point. This system does NOT require
dependency information to work. Only to work efficiently.
in particular it doesn't require all the dependency information
and it still works if the dependency information is wrong.

So here, the fundamentals are:

(a) a set of build steps
(b) output monitoring

The "dependencies" are relevant only for optimisation.
That's not unimportant!! But the point is, the system, at the core,
is driven bottom up, and consists of a set of *actions*:
there's nothing declarative in the core. The dependency relations
are useful for performance, they're not of any conceptual significance.

Certainly dependencies *exist*. Certainly there is workflow.
But it isn't necessary to specify it, nor to even get it right!

Interscript **literally** works this way! It repeatedly does actions
until nothing changes (or a limit is reached, usually 2 passes is the limit :)

> 
> By the way, that elaborate chip design system I mentioned had a neat
> feature, where you could say "check to see if the target actually
> changed as a result of the reapplication of the rule" and if not, don't
> touch it, don't even change its modify time, meaning all downstream
> targets (dependees of it) don't need rebuilding.

Interscript does this automatically for all outputs, for that reason.
Interscript itself doesn't care, since it compares the contents of 
files. But to do that, saves every output to a temporary first,
so it is easy to just abandon the temporary if there is no change.
[Actually it is more efficient to read/compare until there's a difference,
then switch to write mode .. but I didn't implement that]

>> You CANNOT specify goal driven building effectively, because it is not
>> possible to get the dependencies right.  This is a plain fact of
>> reality.
> 
> Can you please elaborate on that?

Sure: tell me the names of all the files generated by a run of doxygen.
[Doxygen is a C++ code documentation system]

You can't. It makes them up using some hidden formula, they're just
a set of web pages, all it cares about is that file1 correctly references
file 2. All you know is the name of the "index page".

but you cannot do a build depending on the output of doxygen by just
examining the index page because it is likely to be unchanged even
when the pages describing your functions have major changes in them.

>  Again if I imagine the DFD describing
> a collection of source and constructed files, and their rules and
> dependencies, it doesn't seem to matter much which way you look at the
> arrows, it's the results that count.

Yes, but the problem is you cannot specify the graph: its too hard.
And it isn't necessary.

> An even worse problem, usually not well understood, is a multi-rule
> target.  This is when several rules contribute to a single repository
> (such as a message catalog), blurring the state of that target for its
> dependees.  

Basically, in set theory and category theory we have things
called products. Aka "Cartesian product" or "tuple" or
even "struct" in C.

What you're saying is that handling NORMAL arrows from
products to products is hard:

	a,b,c -> d,e,f

and that's my point. This is trivial basic stuff.

Part of the problem is that people **incorrectly** think that a
graph is like this:

 file -- action --> file

This is completely wrong! Its the other way around!

The actions are not arrows, they're the points!

The resulting data structure is called a Multi-Graph.

--- x.c -->[ C compiler ] --> x.o

Note carefully: the arrows are files. The C compiler is a point.
(black box, chip, whatever you want).

I have a whole book on this, but its hard to get (by RFC Walters).

> I further divide these into robust and fragile multi-rule
> targets.  A robust one can be partially updated correctly at any time
> (like revising some database entries), but a fragile multi-rule target
> must be wholly rebuilt (running multiple input rules) when any
> dependency demands it.  In the worst case there's an ordering
> requirement upon the rules (the file must be built in the right order)
> which is difficult to correctly represent in a "static" DFD.  Wise
> designers avoid creating constructing files that are fragile multi-rule
> targets, if at all possible.

Yes, that sounds interesting. Basically some products can be built
one component at a time and some are built "all at once".

> I think multi-rule targets arise naturally but mistakenly from
> old-school thinking where files and file systems were expensive,

Yes. The whole Unix FSH (File system tree) is archaic.
The idea of putting all the *.h files in one directory and the *.o files
(or *.a files) in another is absurd. But it was done in the old days
for performance.

No modern systems use this. I think Sun pioneered the right way:
one directory for each product (in "opt").  Apple has these, calls them
frameworks. On unix systems we usually pervert /usr/local/lib.
After all not all software is even C !

> so we
> lumped similar things into common files (a kind of not-really database),
> sometimes with an associated "registry" (index) of some type.  I'm more
> in favor of what I call "self-registry", like how /etc/rc.d works (if I
> recall right).  You drop files/scripts into a "known location" and their
> mere presence (when found) acts as the registry, plus you can easily
> update every file separately from others.

Indeed. Which is why build systems have to be driven bottom up.
So you can drop new sources into the right place and just expect
them to be built.

You can't do that with targets, because they're generated and you
don't get to "drop" them anywhere :)

> I dispute your
> assertion that "some systems require recursion."  

You can't dispute it, LaTeX requires it, and there's no way around it.

> I would assert that you have a design flaw in your package.  Correct
> building demands "full disclosure" to the build control system, in
> whatever language.  

You misunderstand: interscript IS the build system. It doesn't need
any disclosure. That's the point. It can generate code or documentation
or indeed do ANY process at all without disclosure.

It uses "discovery" instead.

> All files must be listed; hidden temporary or
> intermediate files not explicitly stated are accidents waiting to
> happen.  

It isn't possible. See doxygen example. 

> Your example of (presumably) unpredictable deliverable targets
> is even worse.  It might be expedient for the programmer to just "write
> the list as a smart rule," but I think it's bad design.  It makes it
> impossible to "manifest" the customer deliverable package in a
> predictable and auditable way.  (I have a lot of experience dealing with
> CPE = current product engineering...)

Yes, it does. You have re-think your quality control systems to handle
this.

> I understand WHY programmers like to operate this way.  It's clunky to
> have to "redundantly" state information to various parts of the
> engineering system.

Yes, it is clunky .. and only practical for simple systems.

With more complex systems, it's a liability. That's the point here:
if your build system *depends* on the replicated dependency
information, then it can fail silently.

If it doesn't it can't. So it is better if it doesn't :)

> So being a clever programmer, hell I'll just write a script/program that
> embodies some arcane app-specific knowledge about how to create targets
> from sources, based on "discovery"...
> 
> Believe me I've see all kinds of half-assed (well-intended but still
> hackish) packages put together around these kinds of issues, with no
> overall understanding of what it means to deliver maintainable,
> updateable, removable packages to customers.

I share your concerns. But don't knock discovery as such: like everything,
not all systems are reliable!

Almost everyone writing Ocaml programs uses Ocamldep to generate
dependencies (Ocaml requires files be compiled in dependency order).

Even in a 20-30 file program it is almost impossible to maintain
the order by hand. If you have to do that, it becomes an obstacle
to refactoring.

> 
> I don't think the answer is to punt and say, "my targets are
> auto-generated."  A better answer is, "I have an easy way to specify
> exactly what I'm expecting within and as output from the build system,
> and to check that I got what I expected."  

I do that: what I expect is that the regression tests all pass :)

> 
>> Yes.  And the way to do that sophisticated stuff requires a REAL
>> programming language like Python.  Trying to do this with micky mouse
>> crap like Make cannot possibly work.
> 
> Uh, you dismiss it too quickly.  

No, I discarded it after 20 years struggling to understand how
it works and failing all the time to see any connection between
what it does and what general tools do: it works marginally well
for C and that's about all.

>> Fbuild is a caching build system.  It caches the results of various
>> operations (compiling stuff, etc) and knows when the caches are not up
>> to date.  So rebuilding is the same as building, except the caching
>> allow skipping some parts of the build because the dependencies tell
>> fbuild the results will be the same.
> 
> Cool, that's the right concept.

Yes, but it's nothing like make. It has ONLY build rules, there are no
dependencies. (there is dependency generation in some of the subrules,
for example to build an ocaml program).

Rather you just give functions like:

	link(cc("x.c", cc("y.c"), result="aa.out")

which just like the "make" rules.. no dependencies specified.

But it doesn't do the compiles and links every time, it caches the results
of each function call.

Yes there are dependencies, but they're "discovered" because we know
when you run cc('x.c') that that function call depends on file "x.c".

The cc function puts a digest of the file into a database. next time it is
called, if the digest is the same, it does nothing. (Actually, it returns
the digest of "x.o" from the data base).

>> Fbuild captures dependencies automatically, you not only don't have to
>> specify them ..  you CANNOT specify them.
> 
> Caution, you appear to be headed down the same path as (now what was the
> name again of Rational Software's kernel-incestuous over-the-top version
> control and build package?)  You couldn't swat a fly in that system
> without first getting a doctoral thesis!

I can't swat a fly with "make" so I'm no worse off :)

> How do you let people specify unusual dependencies that aren't as simple
> as compile this-to-that?

In felix there is a directory called "buildsystem" which contains all the Felix
specific rules:

~/felix>ls buildsystem/*.py
buildsystem/__init__.py                 buildsystem/flx_stdlib.py
buildsystem/bindings.py                 buildsystem/iscr.py
buildsystem/demux.py                    buildsystem/judy.py  <<<-------------------------
buildsystem/dist.py                     buildsystem/mk_daemon.py
buildsystem/dypgen.py                   buildsystem/ocs.py
buildsystem/faio.py                     buildsystem/post_config.py
buildsystem/flx.py                      buildsystem/re2.py
buildsystem/flx_async.py                buildsystem/sex.py
buildsystem/flx_compiler.py             buildsystem/show_build_config.py
buildsystem/flx_drivers.py              buildsystem/speed.py
buildsystem/flx_exceptions.py           buildsystem/sqlite3.py
buildsystem/flx_gc.py                   buildsystem/timeout.py
buildsystem/flx_glob.py                 buildsystem/tools.py
buildsystem/flx_pthread.py              buildsystem/tre.py
buildsystem/flx_rtl.py                  buildsystem/version.py

Each of of these files contains some special rules for building something,
part of Felix, or a third party library. I marked one of some interest .. :)
Here it is:

########################
import fbuild
from fbuild.functools import call
from fbuild.path import Path
from fbuild.record import Record

import buildsystem

# ------------------------------------------------------------------------------

def build_runtime(phase):
    path = Path('src/judy')

    buildsystem.copy_hpps_to_rtl(phase.ctx, path / 'Judy.h')

    dst = 'lib/rtl/flx_judy'
    srcs = [
        path / 'JudyCommon/JudyMalloc.c',
        path / 'Judy1/JUDY1_Judy1ByCount.c',
        path / 'Judy1/JUDY1_Judy1Cascade.c',
        path / 'Judy1/JUDY1_Judy1Count.c',
        path / 'Judy1/JUDY1_Judy1CreateBranch.c',
        path / 'Judy1/JUDY1_Judy1Decascade.c',
        path / 'Judy1/JUDY1_Judy1First.c',
        path / 'Judy1/JUDY1_Judy1FreeArray.c',
        path / 'Judy1/JUDY1_Judy1InsertBranch.c',
        path / 'Judy1/JUDY1_Judy1MallocIF.c',
        path / 'Judy1/JUDY1_Judy1MemActive.c',
        path / 'Judy1/JUDY1_Judy1MemUsed.c',
        path / 'Judy1/JUDY1_Judy1SetArray.c',
        path / 'Judy1/JUDY1_Judy1Set.c',
        path / 'Judy1/JUDY1_Judy1Tables.c',
        path / 'Judy1/JUDY1_Judy1Unset.c',
        path / 'Judy1/JUDY1_Judy1Next.c',
        path / 'Judy1/JUDY1_Judy1NextEmpty.c',
        path / 'Judy1/JUDY1_Judy1Prev.c',
        path / 'Judy1/JUDY1_Judy1PrevEmpty.c',
        path / 'Judy1/JUDY1_Judy1Test.c',
        path / 'Judy1/JUDY1_j__udy1Test.c',
        path / 'JudyL/JUDYL_JudyLByCount.c',
        path / 'JudyL/JUDYL_JudyLCascade.c',
        path / 'JudyL/JUDYL_JudyLCount.c',
        path / 'JudyL/JUDYL_JudyLCreateBranch.c',
        path / 'JudyL/JUDYL_JudyLDecascade.c',
        path / 'JudyL/JUDYL_JudyLDel.c',
        path / 'JudyL/JUDYL_JudyLFirst.c',
        path / 'JudyL/JUDYL_JudyLFreeArray.c',
        path / 'JudyL/JUDYL_JudyLInsArray.c',
        path / 'JudyL/JUDYL_JudyLIns.c',
        path / 'JudyL/JUDYL_JudyLInsertBranch.c',
        path / 'JudyL/JUDYL_JudyLMemActive.c',
        path / 'JudyL/JUDYL_JudyLMemUsed.c',
        path / 'JudyL/JUDYL_JudyLMallocIF.c',
        path / 'JudyL/JUDYL_JudyLTables.c',
        path / 'JudyL/JUDYL_JudyLNext.c',
        path / 'JudyL/JUDYL_JudyLNextEmpty.c',
        path / 'JudyL/JUDYL_JudyLPrev.c',
        path / 'JudyL/JUDYL_JudyLPrevEmpty.c',
        path / 'JudyL/JUDYL_JudyLGet.c',
        path / 'JudyL/JUDYL_j__udyLGet.c',
        path / 'JudySL/JudySL.c',
        path / 'JudyHS/JudyHS.c',
    ]

    includes = [
        path,
        path / 'JudyCommon',
        path / 'Judy1',
        path / 'JudyL',
        path / 'JudySL',
        path / 'JudyHS',
    ]

    types = call('fbuild.builders.c.std.config_types',
        phase.ctx, phase.c.shared)

    macros = ['BUILD_JUDY']
    if types['void*']['size'] == 8:
        macros.append('JU_64BIT')
    else:
        macros.append('JU_32BIT')

    return Record(
        static=buildsystem.build_c_static_lib(phase, dst, srcs,
            includes=includes,
            macros=macros),
        shared=buildsystem.build_c_shared_lib(phase, dst, srcs,
            includes=includes,
            macros=macros))

def build_flx(phase):
    return buildsystem.copy_flxs_to_lib(phase.ctx,
        Path('src/judy/*.flx').glob())
###################

this is actually quite "make like" in that the source files are all specified.
notice though, there's no mention of *.o files or *.a or *.so or whatever.

It's hard to see, but the build parts and the whole thing return "cache"
of the process such that the system know when to rebuild it.
I have to specify the inputs, and parameters to the build process,
but usually not any outputs.

The top level of the build system is not fbuild. it is a file: fbuildroot.py in
Felix.

fbuild is just a LIBRARY of tools which help specifying a build system.
Erick is constantly adding new functionality to that to support more
compilers etc.

the most important part of that library is the bit which registers and caches
functions, using Python marshalling features, Sqlite3 as the database,
and uses RPC (remote procedure calls) in there somewhere as well.
It's pretty complex. It works pretty well though!

--
john skaller
sk...@us...