Download Latest Version AOData.zip (3.5 MB)
Email in envelope

Get an email when there's a new version of AODiagrams

Home
Name Modified Size InfoDownloads / Week
AOData.zip 2010-11-30 3.5 MB
aoSource.zip 2010-11-30 153.9 kB
README.txt 2010-11-30 11.9 kB
Totals: 3 Items   3.7 MB 0
AO readme.txt - documenting the ao code.

Contact: pwaksman@yahoo.com

related documentation: cmndDoc.txt, outline.txt, AODiagrams.doc

Note the source files whose names start with "ao" are part of a self contained DLL. All other files are part of a testbed that exercises the DLL.


This code has to do with three related topics:
- the structure of attribute-object diagrams ("AODiagrams") which define the nature of "data" 
- enumerating and counting patterns in the data
- a mechanism, called "data equilibrium", for making predictions and extrapolating from the data.

Also we discuss odds and ends of utilities included with the code.


----------
AODiagrams
----------
Example: we have a population of students taking classes and getting letter grades A,B,C,D,F. We put all the data into a table with one row per student and a column for each course; something like this:


StudentName   English    Math    SocialStudies   PhysEd
------------  -------    ----    -------------   ------
student1        B         B          A             C
Student2        B         A          C             D
Student3        C         C          D             B
Student4        A         C          D             B
etc....

The goal is to understand correlations, to see to what extent one grade could be predicted from another or, more subtly, the extent to which one grade could be predicted from several others - given this data. 

In this example the "objects" are students and "attributes" are the different courses. The individual letter grades, are called "traits". In the subject of Probability, such attributes are called "discrete-valued random variables" and the objects are called "samples". A collection of samples is called a "population". There is no requirement that the random variables use the same range of values. In general AO diagrams are part of the structure for the statistical prediction of one attribute given knowledge of others.

AODiagrams can be represented visually (see AODiagramFigure1.GIF) and are of interest as a more general diagram than the Venn Diagram, or the Tree Diagram. They are of value in representing states of complex state machines. They are the basic structure defining a "design for experiment" and they are also one way of presenting finite projective geometries (where the objects are points and the attributes are families of parallel lines). 

Here, for this code, the primary interest of AODiagrams is that they prepresent the set of attributes and traits as a two dimensional 'surface' through which the objects zig-zag. It is the geometry of this 2D surface, and the force functions defined on it, and how force propagates within the surface, which are of primary interest. The metaphor is that traits are points on a surface and that forces are communicated from one part of the surface to another via the objects that zig-zag past - the more objects connecting two 'points' the stronger the propagation of force. The metaphor continues with pairs and tuples of traits defining a subtler modulation on the strength of the connections, so that multi-variate prediction is done by adding an external force to the surface and letting the surface come to equilibrium.


IMPLENTATION DETAIL
-------------------
The AODiagram is also used as an indexing mechanism where loops take the form

   for all attributes i
   {
      for all traits j of attribute[i]
      {
          do something( i,j );
      }
   }

When working with a particular piece of data, this diagram indexing is a constant and the shape of all basic loops remain the same. Often we need to make a first pass read of the data in order to determine this AODiagram structure  [the structure is called the AOSHAPE in the code].

--------------------------------
Enumerating Patterns in the Data
--------------------------------
A 'pattern' is a subset of the attributes, with a trait selected for each. Hence getting an A in English and a D is PhyEd is a pattern, that can be represented as A**D where "*" is used as a placeholder indicating a "don't care" trait value. Note that this pattern (A**D) does not occur in the above data but the pattern *CDB does occur twice. We could say the multiplicity of *CDB is 2. We say an object "satisfies a pattern" and say that the multiplicity of a pattern counts how many objects satisfy the pattern.

With typical real data, not all patterns occur with the same multiplicity. Hence partial patterns are predictive. In the above data, for example, the pattern *CD* always occurs as part of *CDB, so *CD* predicts the final trait as 'B'.

<DIGRESSION> It is interesting to note that, in any piece of data, it is possible to sort the patterns by their multiplicities. It is intriguing, though not relevant to understanding this code, that as the multiplicity increases the number of patterns with that multiplicity usually decreases in real data. In other words high multiplicty is rare amongst patterns. That the decrease follows some kind of "power law" was first observed by Pareto. Failure of the power law is a way to detect fake data. It is a bit confusing that there is both a population of objects and a population of patterns, with some kind of duality principle between them; but we do not consider the population of patterns any further.</END DIGRESSION>

Challenge:
Given a piece of data with thousandd of records, the task of listing all the patterns is a computational challenge. Even with only a few dozen attributes, complete enumeration rapidly becomes impossible no matter how much memory or CPU speed are devoted to it. One of the achievements of the ao code is that it implements a thinning mechanism to discard patterns of low multiplicity (which are the commonest). The tradeoffs here between computational speed and prediction accuracy have never been explored. 


Dimensionality:
One of the things addressed by the ao code is the dimensionality of the problem. If we want to predict one attribute, how many other attributes to we need? It turns out that errors occur when too few OR TOO MANY attributes are used.

<DIGRESSION> For example, there are said to be 13 "FICA score" numbers used to define a person's credit rating. A loan applicant may or may not default on a particular type of loan and the formula that calculates the loan default risk probably does not use the correct dimension.</END DIGRESSION> 

IMPLEMENTATION DETAIL
---------------------
We assume objects are counted and each has an index. We represent one object index as a bit set to 1 at this index position within a stream of bits. To do this for large index values we need a long stream of bits. For example 00100000000000000 represents the third object out of 17. A set of objects is represented by a set of bits and this allows simple set operations like AND and OR to be done using respectively intersection and union of bits in the bitstream. If we have N objects then we need bitstreams of size N bits. Possibly there is also a quick counting function to say how many bits are set in a given bitstream.

Basic Algorithm:
For a given piece of data, after we know the shape of the AODiagram, we allocate one bitstream for each trait in the diagram. So each trait indexed by (i,j) will have its own bitream that starts out all zeros.

Now, read through the data (a second time). For each object with index q, having trait (i,j), set bit q of the (i,j)-th bitstream to 1. When we are done with this operation, the bitstream at each trait has a record of all the objects with that trait. Moreover, we can AND bitstreams from different trait collections to find out which objects, and how many, are shared by all traits in the collection.

IMPLEMENTATION DETAIL - Content Addressable Memory
--------------------------------------------------
If that is all we were doing it would still be pretty slow. Given a pattern like A**C*******E****** we could AND together three bitstreams and count the remaining set bits, but that will take some time. It would be nice if, instead, we could store that count somewhere that could be accessed directly with the content of values 'A', 'C', 'E', at the appropriate slot. The ao code incudes an elaborate scheme for doing that. The algorithm for creating the scheme is called "trace()" and is best documented by the code itself. In practice how it works is as a large collection of AODiagram-indexed pointers to other AODiagram-indexed pointers. A huge linked list structure, but linked using the AODiagram indexing structure. Once the object counts are stored in the structure the original data and objects can be discarded; they play no role in prediction.


IMPLEMENTATION DETAIL - Conditional probabilities
-------------------------------------------
Hence the ao code provide a rapid means of answering the question: how many objects satisfy this pattern? This provides the basis of quick calculation of joint probabilities. The probability of event X given observed E1,E2,.... is (taking some liberties with placement of "*"):

prob(X, E1&E2&.....) = multiplicity of (X*E1*E2*...)
                     / multiplicity of (**E1*E2*...);


----------------
Data Equilibrium
----------------
The simplest example of data equilibrium is to define a coefficient of "influence" between two traits (i1,j1) and (i2,j2) based on the conditional probability of one given the other. The question is: if we add a force of 1.0 at (i1,j1), how much of the force should propagate to (i2,j2)? The answer needs to be in proportion with the conditional probability between the two. However, we need to now then consider that the force CONTINUES TO PROPAGATE beyond (i2,j2) through that trait's influence on others. In the end, we can only accommodate the external force if we let the entire diagram of forces come to equilibrium following an equation of the form
	
			E = C*E + F

Where F is the external force, E is the sought after equilibrium state, and C is a matrix of coefficients defined using the trait-to-trait conditional probabilities. The equation can be solved iteratively using Picard Iterates, or solved in closed form.

Since the matrix of coefficients only looks at pairs of attributes, this is considered the first order equilibrium equation. However the content addressable memory scheme allows for use of conditional probabilites that depend on triples of attributes or higher order tuples of attributes. How are these to be included in the calculation? The details are in the recursive GetExcitation() function. Where forces on some traits, are amplified as changes to the matrix C in the first order equation. Here, fixed ways of doing things start to get vague. There are so many alternatives. Should we be using "a priori" or "a posteriori" estimates? What kind of dampening is needed so the iterative method converges? What are the absolute (as opposed to relative) meanings of the numerical results?


------------
ODD AND ENDS
------------
The code contains a command interpeter and a list of commands with usage. It turns out that Chi-Squared and ranking are so much fun that I included them in the package. You can do a lot with Chi-Squared when you have fast counting of pattern multiplicities.

If you get the command interpreter running. Type "?" for a list of commands or 
"-command- ?" for the command usage. 

See cmnddoc.txt and read the code.


Comment of data files:
There are some CSV files I found around the web. The vaccination side affect data "VARS" is of interest. Each useable *.csv file comes paired with a *.ft file which defines the data format as a list of types. Read the code. But for example, some commands are:

READHEAD irish.ft
READDATA irish.csv
PATMEM 0
etc....






















Source: README.txt, updated 2010-11-30