Cluster tools questions!

Wenxing Ye
2009-07-28
2013-06-07
  • Wenxing Ye

    Wenxing Ye - 2009-07-28

    What is the meaning of the output of OfflineCluster.exe
    and ClusterApp.exe? How to read them?

    For example, I have the following parameters:
    <parameters>
    <index>test</index>
    <clusterIndex>mycluster</clusterIndex> //I assume this is
    the output. So I give an arbitrary name.
    <clusterDBType>flatfile</clusterDBType>
    </parameters>
    When I ran OfflineCluster I got:
    Using kmeans on 100 documents...
    1(1):
    2(2):
    Using bisecting kmeans on 100 documents...
    1(1):
    2(2):

    The output is empty. I also tried to se numParts to be 10. It just output 1(1),....10(10), still with empty output.

    When I ran ClusterApp, I got:
    1 1
    2 1
    2 0.288877
    2 0.542838
    3 1
    4 1
    5 1
    6 1
    7 1
    8 1
    9 1
    10 1
    11 1
    12 1
    5 0.512347
    13 1
    14 1
    5 0.58153
    15 1
    ...
    I tracked into source code. It seems the output is the
    document ID followed by a score. But in my index, I have
    100 documents, but ClusterApp only list upto 72 IDs.
    And “2”, "5" repeated several times.

    Also, when I set numParts to be 10, I got the same results. Where can I see my 10 clusters?

     
    • Wenxing Ye

      Wenxing Ye - 2009-07-29

      Thanks for you explanation. It helps.

       
    • David Fisher

      David Fisher - 2009-07-28

      Your index appears to be missing external document ids (the docno element).

      The output for ClusterApp should be:

      docid clusterid score

      eg, on cacm (linux version is named Cluster):

      indri6:/usr/ind1/tmp2/dfisher/src/sourceforge/lemur/data/test-cluster> ../../app/obj/Cluster cluster.parm
      Trying to open toc: pindex.key
      Trying to open doc manager ids file: pindex.dm
      Load index complete.
      1 1 1
      2 2 1
      3 2 0.0733725
      4 2 0.103137
      5 2 0.145768
      6 2 0.191442
      7 2 0.243982
      8 2 0.344619
      9 2 0.192658
      10 2 0.311194

      with one line for each document in the index. This is a run once program, you must delete the cluster database files (mycluster.*) to run it again.

      OfflineCluster should output

      clusterName (clusterID): docid+

      where the clusterName will be equal to the clusterID, eg, again on cacm:

      indri6:/usr/ind1/tmp2/dfisher/src/sourceforge/lemur/data/test-cluster> ../../app/obj/OfflineCluster cluster.parm
      Trying to open toc: pindex.key
      Trying to open doc manager ids file: pindex.dm
      Load index complete.
      Using kmeans on 100 documents...
      1(1): 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 45 50 52 54 65 66 69 73 75 78 79 88 92 94 95 97 99 100
      2(2): 11 37 38 39 40 41 42 43 44 46 47 48 49 51 53 55 56 57 58 59 60 61 62 63 64 67 68 70 71 72 74 76 77 80 81 82 83 84 85 86 87 89 90 91 93 96 98
      Using bisecting kmeans on 100 documents...
      1(1): 19 23 36 40 49 50 54 56 57 60 61 62 65 66 73 76
      2(2): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39 41 42 43 44 45 46 47 48 51 52 53 55 58 59 63 64 67 68 69 70 71 72 74 75 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

       
  • mina

    mina - 2013-06-06

    Coucou, j'aimerai savoir si y'a quelqu'un parmis vous qui sait comment calculer la distance entre chaque cluster? le type d'algorithme de clustering utilisé et le num de cluster? Merci

     
    • kakijan

      kakijan - 2013-06-06

      Salut,

      1. Calculate the centroid of the clusters i.e., for all elements in a cluster, find their average. Something like, Center = ( 1 / NumberofElementsInCluster ) * Sum_i (vector_i)

      2. The distance between cluster is their Eucleadian distance i.e d = Sqrt( ( X_i - Y_i)^2 )

      3. For number of cluster, try a few different values. Clustering is semi-supervised so you can help it a little bit by experimenting. Or, use an algo where EM is used to stop create cluster. 

      4. Clustering algo names, K-Means is one but you should really go with K-Means++.

      The French comes from Bing. 

      LefrançaisvientdeBing.

      ==
      1. Calculer le centre de gravité des grappes, c'est-à-dire,pour tous les éléments d'un cluster,trouver leur moyenne.Quelque chose comme :Centre = (1 / NumberofElementsInCluster) * Sum_i (vector_i)

      1. La distance entre le cluster est leur Eucleadian distance c'est à dire d = Sqrt ((X_i - Y_i) ^ 2)

      2. Pour nombre de cluster,essayez quelques valeurs différentes.Clustering est semi-supervisé, donc vous ne pouvez aider un petit peu en expérimentant.Ou,utiliser un algo où EM permet d'arrêter le créer le cluster.

      4. Les noms d'algo, de clustering K-Means est l'un, mais vous devriez vraiment aller avec K-moyens ++.

      Cheers,

      Omar.


      From: mina mina15@users.sf.net
      To: [lemur:discussion] 836443@discussion.lemur.p.re.sf.net
      Sent: Thursday, June 6, 2013 3:14 PM
      Subject: [lemur:discussion] Cluster tools questions!

      Coucou, j'aimerai savoir si y'a quelqu'un parmis vous qui sait comment calculer la distance entre chaque cluster? le type d'algorithme de clustering utilisé et le num de cluster? Merci


      Cluster tools questions!


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/lemur/discussion/836443/
      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks