Menu

Tutorial

Dr Francois Petitjean
Attachments
graph.jpg (26452 bytes)

Welcome to the wiki page!

In this page, I will give a tutorial about the use of the Chordalysis software.

Here is the step-by-step procedure that I'll develop.

0. Note about new version

This tutorial had been written for the main of the 'Run.java' file. Chordalysis now uses a GUI via the class 'RunGUIProof', Step 4 has to be replaced by:

Open a Windows/Linux/MacOS terminal and navigate to the repository in which you extracted Chordalysis.jar (cd command). Then if you simply type:

java -jar -Xmx1g Chordalysis.jar

1. Download

The download part is pretty easy, just download the .zip here:


2. Extraction

Not the hardest step either: just extract the zip where you want to (:

3. Get a dataset

Chordalysis analyses data, so... you need a dataset.

The requirements for the file containing the dataset are:

  • it has to be a CSV file (that you can open with Weka)
  • it has to be discrete (no numerical variables); if some (or all) are, you need to discretize your dataset first (using equal frequency for instance)
  • there should be more than one value expressed for every variable: for example, if one of your variables is gender, then you cannot have females only (as it would be the case for a study about pregnant women). This is due to the fact that a variable with one value would just be correlated to everything, which wouldn't be a useful information. If you have such variables, just delete the corresponding column in the dataset.

You can have a look at the mushroom dataset for an example of correctly formatted CSV.

4. Launch the software on the dataset

Open a Windows/Linux/MacOS terminal and navigate to the repository in which you extracted Chordalysis.jar (cd command). Then if you simply type:

java -jar Chordalysis.jar

the terminal will tell you how to use the program:

petitjean@xx:~$ java -jar Chordalysis.jar 
Usage:  java -Xmx1g -jar Chordalysis.jar dataFile pvalue imageOutputFile useGUI?
Example:    java -Xmx1g -jar Chordalysis.jar dataset.csv 0.05 graph.png false

Note:   '1g' means that you authorize 1GB of memory. 
Note:   It should be adjusted depending upon the size of your data set (mostly required to load the data set).

So, if you saved the file mushroom.csv in the same folder as Chordalysis.jar, you can call the program with:

java -Xmx1g -jar Chordalysis.jar mushroom.csv 0.05 mush-graph.png false

5. Analyse the results

5.1 Text results in the terminal

After having launched the previous command, you should be able to read (in the terminal) something like:

The model selected is: (selected in 1701ms)
[cap-shape cap-surface bruises? ring-type class]
[cap-shape bruises? gill-spacing ring-type class]
[cap-shape bruises? gill-spacing stalk-root class]
[bruises? gill-spacing veil-color ring-type class]
[bruises? gill-spacing gill-color veil-color class]
[cap-shape gill-spacing ring-type spore-print-color class]
[bruises? gill-spacing veil-color habitat class]
[bruises? gill-size veil-color habitat class]
[bruises? odor gill-size veil-color class]
[bruises? gill-size stalk-surface-below-ring veil-color class]
[bruises? stalk-surface-above-ring stalk-surface-below-ring veil-color class]
[bruises? gill-size stalk-surface-below-ring ring-number class]
[cap-color bruises? gill-size stalk-surface-below-ring ring-number]
[cap-color bruises? gill-size stalk-shape ring-number]
[bruises? gill-spacing veil-color population habitat]
[bruises? gill-attachment stalk-shape ring-number]
[cap-shape bruises? stalk-color-above-ring class]
[cap-shape bruises? stalk-color-below-ring class]
[veil-type]

This is the model that has been selected, using the classical log-linear model formatting.
Let's take an example. [cap-shape cap-surface bruises? ring-type class] means that there is a 5-way correlation between the variables cap-shape, cap-surface, bruises?, ring-type and the class.

5.2 Graphical model

It turns out that models that Chordalysis is discovering are all graphical. What this means is that the previous model, for instance, can be represented graphically. The nodes are the variables and every multi-way correlation will correspond to a clique in the graph (set of fully connected variables).

In the previous command, we indicated that we want the graphical representation of the model to be saved as a file named mush-graph.png. You should have such a file available in the same folder now:
Model selected for the Mushroom dataset


For any questions/problems/feedback, do not hesitate to contact me at petitjean(at)tiny-clues(dot)eu