Name | Modified | Size | Downloads / Week |
---|---|---|---|
readme.txt | 2012-09-25 | 3.7 kB | |
Totals: 1 Item | 3.7 kB | 0 |
Hierarchical and Functional Approaches to Data Analysis Spent some time over the weekend thinking about some ideas I had last year about how to characterize and manipulate classes of data which exhibit multiple "dimensions" of relationship: for example, sharing a hierarchical relationship to a set of parent data points but also sharing a non-linear, "functional" relationship , based on a common technique which was used to produce them, that ties together data sets which based on their hierarchical position would be disparate. So, wrote a Node class in python which is a first, crude attempt at combining these two approaches and is tied together with a (deceptively!) trivial Creator class. Essentially, all new nodes are created from the data stored in ParentNode.content (or, minimally, manually inserted by the programmer, as with file names and whatnot) *AND* a Creator object whose most important component is a function which performs some operation on ParentNode.content, for example, searches and returns matches in ParentNode.content to a regular expression: When the Creator object is called by ParentNode.newchild(), the ParentNode adds all child nodes generated by Creator.transform() to the dictionary entry for the unique identifier assigned to the Creator object. Additionally, all Creator objects store their own list of Node objects (really, just the pointers to the Node objects since this is Python). This makes it possible to search at once for all nodes which were generated by a given Creator object, regardless of the hierarchical relationship between the nodes (i.e. regardless if they were generated from data in the same file, etc etc). It was interesting to work through an exercise in this, to generate a link diagram of files function definitions and files function calls using this method. And out comes a list of functions called and defined in each file. Would be easy to extend this for an arbitrary number of source code files using the recursive version of a FindFile function included in this project. If you were interested you could also expand to include regexes for class definition and instantiation. But I'm more interested in alternative viewpoints on how this approach to handling and searching through data can be applied to other problems with hopefully similarly easy-to-code solutions. What's also interesting is when using newchildren() from findancestor(SetFileName) for FuncDef and FuncCall, ChildNode.content is *another node* and NOT actual data. I was going back and forth about whether I needed a separate Link class to instantiate links (i.e., relationships between nodes) in. I suspect that I will end up defining links as the relationship between any parent and child where the contents of at least one node is a pointer to another node... Anyway, it's nice to see that these ideas from a year ago have finally sharpened to the point where I can start playing with them on the computer. Writing a program this way twists my brain in interesting ways, it is hard to think about how to tell the computer to organize data in this way, but I think the above is actually an extremely efficient way to generate link relationships between caller and callee. Will be interesting to expand on this later to apply it to network traffic analysis, and even more interestingly incorporate some automated hueristics: what if we use an evolutionary model where all Creator objects are applied to all Node objects? Which combinations of node+creator will generate the most relevant data etc etc. Interesting in what people think. Message me through Sourceforge if you have ideas about where this can go.