k means cluster help

Help
Xu Xiaohui
2012-03-11
2013-05-20
  • Xu Xiaohui
    Xu Xiaohui
    2012-03-11

    I used mev to cluster 6337 genes by k-means/k-median clustering method,  the parameters were settings as follows: sample selection: cluster genes; distance metric selection: current metric: pearson correlation; parameters: calculate K-Means, number of clusters:12, maximum iterations:50. But when I used the same dataset to do the clustering for several times, the results were completely different. For example, the first time I get the result as follows (cluster 1: 262 genes, cluster 2: 737 genes; cluster 3: 256; cluster 4: 193; cluster 5: 559; cluster 6:381; cluster 7: 405; cluster 8: 341; cluster 9: 787; cluster 10: 728; cluster 11: 763; cluster 12: 925), the second time I get the result as follows (cluster 1: 1240 genes, cluster 2: 828 genes; cluster 3: 425; cluster 4: 158; cluster 5: 311; cluster 6: 185; cluster 7: 725 genes; cluster 8: 613; cluster 9: 411 genes; cluster 10: 192 genes; cluster 11: 443 genes; cluster 12: 806 gene ).  I don't know why the difference generate as I use the same dataset and parameters. Can you explain the reason? Did I make some mistake in the process.

     
  • John Braisted
    John Braisted
    2012-03-11

    K-means starts by randomly assigning genes to your K-clusters. From run to run the results will vary. First, suppose you had a particularly distinct cluster (a cluster of genes that were highly correlated with each other and that had a centroid (mean expression pattern) that was very dissimilar to other cluster centroids). You should see that cluster repeat in separate runs but appear as a different cluster index.  I often right click in a cluster of interest, select 'store cluster' and then select a color to mark those genes. Subsequent runs of any clustering method will help to show whether the group sticks together or not (by looking at whether the color stays intact). The cluster manager also has a cluster intersection option to check to see which genes are in common between 2 or more selected stored clusters.

    In practice what you might see is that many clusters are not well defined and essentially represent patterns that are not very coherent and don't relate to the various conditions under study. This is just the noise in the system where the genes may not be tightly regulated in accordance with the conditions under study. Other clusters should show very distinct patterns and these are the clusters that should be fairly reproducible from run to run.

    One piece of information that would be helpful is to know how many samples you have. Occasionally people will have very few samples and therefore the number of unique patterns possible is limited.

    The last question is whether the selected K is correct. If you have low concordance from run to run, you might have K too high.  In this case you might have perhaps 8 inherent clusters but when you try to make 12, some clusters just randomly swap genes. Try using the FOM method to run multiple runs of K-means. The reported graph shows how well the genes fit the constructed clusters with each K (from 1 to your selected upper limit, often I stop at 20). You might find that most of the benefit of increasing K is found between K of 5 and 6 (for instance).  Finding a good K is somewhat arbitrary but it makes a difference. The main goal is to find a few clusters that are tight in terms of variance, distinct from other clusters, and relate to your biological system or conditions.

    John

     
  • Xu Xiaohui
    Xu Xiaohui
    2012-03-13

    Thanks for you help. The 6337 genes I want to cluster were derived from four samples. Based on your suggesstions, I used FOM method, and the reported graph shows most of the benfit of increasing K was found when K is 6. But when I used k means method to cluster my samples into six clusters, the results still varied from run to run. I also change the cluster number from 6 to 16, the results varied as well. I don't know how to deal with it. Can I use any of them to continue my study? Thanks again.

     
  • John Braisted
    John Braisted
    2012-03-14

    Does your KMC run complete before your 50th iteration?  Look in the result tree under the KMC result for some description like 'Converged'.  This means that the KMC run came to an end result where each gene was in an optimal position. If not, try to increase the number of iterations. It might be that the KMC hadn't completed. 

    I wouldn't move forward with any gene set that didn't cluster consistently.  I wouldn't hold the clustering algorithm to keeping all clusters consistent. You should see a subset of clusters that show distinct patterns. If a particular pattern would be interesting, like samples 1 and 2 lower than samples 3 and 4, then a more targeted approach like template matching (PTM, under stats) might help you pull genes with that pattern out.

    The type of platform (one dye vs. two dye) might also impact the choice of distance metric. If you have a common reference design and two dye system, then Euclidean distance might help to show the big responders.

    Sometimes SOM performs well for me. Choose the topology (X and Y dimensions) to match the number of clusters you want.  Make sure you run a couple million iterations (it's pretty fast).  This ensures that all genes have an opportunity to adapt the network produced. I use a buble network with radius <1. Radius< 1 insures that only the closest node centroid is adapted and generally produces a better output for our data type.

    As I mentioned previously, mark clusters of interest by right clicking and selecting 'store cluster'.  Use the color to roughly assess concordance.