From: Matthias Ekman <matthias.ekman@gm...>  20120829 09:03:12

Hi, I was trying to visualize the tree structure obtained from ward, but I don't quite understand the data format of the children_ attribute. The documentation reads: children_ arraylike, shape = [n_nodes, 2] List of the children of each nodes. Leaves of the tree do not appear. So it is not a "left childright Sibling representation", right? Are there any pointers to that specific format or even better does anyone have some advice on how to visualize the tree with ``scipy.cluster.hierarchy.dendrogram`` or ``graphviz``? As a second, but slightly related question, is it possible to use the ward on a n_features x n_features matrix (e.g. an adjacency matrix)? It works, but I wasn't sure whether these results can be considered as meaningful. Help is much appreciated. Thanks in advance, Matthias 
From: Gael Varoquaux <gael.varoquaux@no...>  20120829 13:05:21

On Wed, Aug 29, 2012 at 11:03:01AM +0200, Matthias Ekman wrote: > I was trying to visualize the tree structure obtained from ward, but I > don't quite understand the data format of the children_ attribute. > The documentation reads: > children_ arraylike, shape = [n_nodes, 2] List of the children of > each nodes. Leaves of the tree do not appear. > So it is not a "left childright Sibling representation", right? I am not sure, this is not a term that I am familiar with. Keep in mind that Ward gives a binary tree, so it would be more a "left childright child representation". This matrix simple lists the pairs of children for each node, where a node is denoted as an integer index. It does not include the terminal nodes (orginal samples) as they have no children. > Are there any pointers to that specific format or even better does > anyone have some advice on how to visualize the tree with > ``scipy.cluster.hierarchy.dendrogram`` or ``graphviz``? I couldn't figure out the structure that scipy.cluster.hierarchy.dendrogram uses. That said, it should be possible to adapt our representation to something usable be dendrogram, and I'd love to merge in an example showing how to do this. > As a second, but slightly related question, is it possible to use the > ward on a n_features x n_features matrix (e.g. an adjacency matrix)? > It works, but I wasn't sure whether these results can be considered as > meaningful. Ward does not work on adjacency matrices because it is specific to the euclidean distance. Other hierarchical clustering methods such as complete linkage would work. Complete linkage is not implemented in the scikit, but it only requires a simple modification to the code doing Ward. I need to find time to do it (TM). HTH, Gael 
From: Matthias Ekman <matthias.ekman@gm...>  20120831 09:46:44

>> So it is not a "left childright Sibling representation", right? > > I am not sure, this is not a term that I am familiar with. Keep in mind > that Ward gives a binary tree, so it would be more a "left childright > child representation". > > This matrix simple lists the pairs of children for each node, where a node > is denoted as an integer index. It does not include the terminal nodes > (orginal samples) as they have no children. Thanks Gael for clearing that up. >> Are there any pointers to that specific format or even better does >> anyone have some advice on how to visualize the tree with >> ``scipy.cluster.hierarchy.dendrogram`` or ``graphviz``? > > I couldn't figure out the structure that > scipy.cluster.hierarchy.dendrogram uses. That said, it should be possible > to adapt our representation to something usable be dendrogram, and I'd > love to merge in an example showing how to do this. The scipy dendrogram requires the linkage format as returned by ``scipy.cluster.hierarchy.linkage``: "A 4 by matrix Z is returned. At the th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster . A cluster with an index less than corresponds to one of the original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster." I was hoping to use an existing dendrogram function to plot the tree, as they usually offer some other handy features as to truncate the leaves and so on. But as the sklearn ward does not return the distances I don't see how the ``children_`` format can be converted to the (n1, 4) linkage format. I'll make sure to post a link if I come across a good solution for the binary ward tree. >> As a second, but slightly related question, is it possible to use the >> ward on a n_features x n_features matrix (e.g. an adjacency matrix)? >> It works, but I wasn't sure whether these results can be considered as >> meaningful. > > Ward does not work on adjacency matrices because it is specific to the > euclidean distance. Other hierarchical clustering methods such as > complete linkage would work. Right, of course that makes perfectly sense. Thanks again, Matthias > > HTH, > > Gael > >  > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Gael Varoquaux <gael.varoquaux@no...>  20120831 11:43:23

On Fri, Aug 31, 2012 at 11:46:33AM +0200, Matthias Ekman wrote: > I was hoping to use an existing dendrogram function to plot the tree, > as they usually offer some other handy features as to truncate the > leaves and so on. But as the sklearn ward does not return the > distances I don't see how the ``children_`` format can be converted to > the (n1, 4) linkage format. I thought so to, but I can you can always make it up. It won't contain useful information, but would still give a display of the merge structure. The children are arranged in the order in which they are paired in the scikitlearn structure. HTH, G 