Re: [C2-devel] Fwd: Rule of thumb for setting Lingo desired base cluster size?
Brought to you by:
dawidweiss,
stachoo
From: cmg <jto...@gm...> - 2014-07-11 00:14:03
|
Hi Stanislaw - Thank you very much, this is very helpful. I do you have a follow up question that I hope you can help with: Is there any relationship between the number of clusters and the specificity of the tags? Intuitively I would think that more clusters would result in fewer documents per cluster and therefore more specific labels, but I'm not sure that is true unless the documents in the cluster use very similar terms. Thanks again! -greg On Jul 10, 2014 6:16 AM, "Stanislaw Osinski [via Carrot2 Users and Developers Forum]" <ml-...@n2...> wrote: Hi Greg, <blockquote style='border-left:2px solid #CCCCCC;padding:0 1em' class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> I am experimenting to see if cluster labels might be able to serve as tags from a set of search results. Is there a rule of thumb for setting the Lingo algorithm desired cluster count base? Here's the bit of code that converts cluster count base to the actual number of clusters Lingo will attempt to create: https://github.com/carrot2/carrot2/blob/master/core/carrot2-algorithm-lingo/src/org/carrot2/clustering/lingo/LingoClusteringAlgorithm.java#L262 If you'd like to directly specify the number of clusters, just set the cluster count base to the inverse of this function. Ultimately, the final number of clusters may be smaller than the result from the above method due to the removal and merging of the overlapping clusters. <blockquote style='border-left:2px solid #CCCCCC;padding:0 1em' class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Also, what is a cluster's score weight indicating? Is a higher score indicating in some way how strongly a label might relate to the documents in a cluster? Technically, the label score is the cosine similarity between the label text and the corresponding column in the dimensionality-reduced VSM matrix. This translates to how "certain" Lingo is that this label is a "strong" topic in the input data. Unfortunately, the score does not have a direct connection to the strength of the relationship between the label and the cluster's documents. The latter relationship is very simple -- the cluster contains those documents that contain all of the cluster label's words. Stanislaw ------------------------------------------------------------------------------ Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft _______________________________________________ Carrot2-developers mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/carrot2-developers If you reply to this email, your message will be added to the discussion below: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Rule-of-thumb-for-setting-Lingo-desired-base-cluster-size-tp7578573p7578574.html To unsubscribe from Rule of thumb for setting Lingo desired base cluster size?, click here . NAML -- View this message in context: http://carrot2-users-and-developers-forum.607571.n2.nabble.com/Rule-of-thumb-for-setting-Lingo-desired-base-cluster-size-tp7578573p7578575.html Sent from the Carrot2 Users and Developers Forum mailing list archive at Nabble.com. |