Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Statistics on KH coder

2012-04-18
2013-05-10
  • My friend recently introduced KH coder to me. Good software, I must say.
    However, the thing that strikes me the most is the use of statistics in KH
    coder.
    Do I have to be an expert to use KH coder? If no, how do I go about explaining
    the output of KH coder such as the co-occurrence network of words whereby it
    is filter through Term-Frequency or filtering edges using Jaccard etc?
    Thank you in advance if you could clarify this matter.

     
  • HIGUCHI Koichi
    HIGUCHI Koichi
    2012-04-18

    Hello. Thank you for your post!

    The reference manual of KH Coder contains some short explanations about the
    statistics. But right now it is written in only Japanese language. English
    version is not ready yet. I am sorry for any inconvenience it may cause.

    In the meanwhile, please ask questions here. I will try to response as soon as
    possible. And Q&A here would help me to make the English manual.

    The procedure to make co-occurrence network of words is as follows:

    1. Select frequently appeared words by TF, DF and / or POS
      (You can change these settings with GUI)
    2. Calculate co-occurrence between selected words using Jaccard index
    3. The strong co-occurrences, top 60 pairs, will be drawn as lines / edges.
      (You can change this setting with GUI)
    4. The layout is determined by Fruchterman-Reingold method

    About Jaccard index, Fruchterman-Reingold method, etc., googling these terms
    would give you some clues maybe. It it won't, feel free to ask here.

     
  • Corp
    Corp
    2012-04-18

    Thanks for your response. I've tried googling those statistical terms and
    understand them at a very basic level.

    My question still lies in,
    1) do I need to be a statistics expert to use KH coder?

    2) How should I explain the co-occurrence network of words in my corpus to my
    colleagues?
    I cannot say, "the figure shows the co-occurring words in X corpus. The bigger
    the node, the higher the frequency is". How can I justify the results or make
    my explanation more convincing?

    3) The TF is set by default. How do I know which value to set for TF if I have
    different sizes of sub-corpora?

    I apologise as I'm not a statistics expert. This software produce good output
    but I can't understand what the separate statistical functions are, especially
    for a beginner like me. Thank you.

     
  • HIGUCHI Koichi
    HIGUCHI Koichi
    2012-04-18

    Hello. Thank you for your post!

    1) do I need to be a statistics expert to use KH coder?

    No. I think basic level understanding of the statistical techniques is enough
    for users.

    Some people say that you must completely understand and be able to calculate
    by hand to use statistical techniques. But I disagree with that. You just need
    driver license to drive a car, not have to understand the mechanics 100%.

    Off course, if you understand it 100%, it may help you in some occasions. But
    you need to? I say No.

    2) How should I explain the co-occurrence network of words in my corpus to my
    colleagues? I cannot say, "the figure shows the co-occurring words in X
    corpus. The bigger the node, the higher the frequency is". How can I justify
    the results or make my explanation more convincing?

    I am not sure what is wrong with above explanation. But I will try.

    First, co-occurrence network is a common technique in quantitative content
    analysis field. And content analysis is a very common technique for analyzing
    media messages in sociological field. You can refer to these literatures:

    (a) Osgood, C.E., 1959, "The Representational Model and Relevant Research
    Methods," I. de S. Pool ed., Trends in Content Analysis. Urbana, IL:
    University of Illinois Press.
    (b) Danowski, J. A., 1993, "Network analysis of message content," W. D.
    Richards Jr. & G. A. Barnett eds., Progress in communication sciences IV,
    Norwood, NJ: Ablex 197-221

    Second, to explore co-occurrence of words, you can use other techniques like
    MDS or cluster analysis. So, why you choose co-occurrence network? (1) It
    consumes less space than cluster analysis. (2) It contains less statistical
    information and so it is easy to understand or interpret. You have to
    interpret just the edges, not positions. Positions have no big meanings
    because the layout is determined for easiness to view edges and nodes. In MDS,
    you have to interpret positions but it is not easy when there are 50+ words.
    Especially for the first stage exploration, co-occurrence network with less
    information is suitable I think. (3) And you need less statistical knowledge
    to understand it because it contains less statistical information. So it is
    easy to understand for everyone.

    Well is that good enough? If not, let me know.

    3) The TF is set by default. How do I know which value to set for TF if I have
    different sizes of sub-corpora?

    TF is automatically set for selecting approx. 75 words. When you make a new
    project with the sub-corpora, the new value will be automatically set. Or you
    can adjust value and click "check" below to view how many words will be
    selected with current settings.

     
  • Corp
    Corp
    2012-04-18

    Thank you very much for clarifying the above matter. You're very informative.
    It's clearer now.
    I'm being cautious as I was told you've to justify the methods you choose for
    your research.

    I'm looking into Year 1, 2 and 3 undergraduates writing and the sub-corpora is
    small so I would like to see the lexis students are using and if there are
    changes in their choice of lexis across the years and that's why I've decided
    to see the co-occurrences of words.

    When you say, you have to interpret just the edges, not positions - do you
    mean examination either by word or by variable? And what do edges mean? Sorry
    for the question (again!)

    Thank you very much in advance.

     
  • HIGUCHI Koichi
    HIGUCHI Koichi
    2012-04-19

    Hello. Thank you for your post! Never be sorry to ask questions. It helps me
    to make the manual and / or improve the software.

    you've to justify the methods you choose for your research

    Yes, that is a good thing.

    When you say, you have to interpret just the edges, not positions - do you
    mean examination either by word or by variable? And what do edges mean?

    Well, excuse me, but what do you mean by variable?

    Are you making co-occurrence network with "words – variables / headings"
    option? Making this
    type
    of
    figures?

    Anyway, by saying "have to interpret edges, not positions" I mean something
    like this:

    Think about MDS figure like this: "A B C D E"
    From the position of A, B, C, D and E, you can see following similarities.

    • A and B are very similar.
    • B and C are very similar.
    • C and D are very similar.
    • D and E are very similar.
    • A and C are somewhat similar.
    • B and D are somewhat similar.
    • C and E are somewhat similar.
    • A and D are a little similar.
    • B and E are a little similar.

    A lot of information!

    BTW "A and B are Similar" means they often co-occur. In other words they have
    similar occurrence patterns.

    Then think about co-occurrence network like this: "A-B C-D-E"
    In this case, you have to think about only 3 pairs.

    • A and B are similar.
    • C and D are similar.
    • D and E are similar.

    Less information and easy to understand, isn't it? B and C are placed nearby,
    but there is no edge between them. So B and C are not similar. I say "don't
    interpret positions" in this meaning.

    Well, please complement your question if necessary.

    Thank you.

     
    Last edit: HIGUCHI Koichi 2013-08-29

  • Anonymous
    2013-05-09

    Hello, I'd like to ask three questions about your explanation.

    1. In the co-occurrence network's configuration options, I can elect to have "thicker lines for stronger edges." So if there's a thicker line between C and D than between D and E, this means that C and D are more similar than D and E, right? Does a "stronger" edge mean a more frequently linked occurrence?

    2. When variables are added to the co-occurrence network, are they calculated within the plot AS IF they are words/codes? So the shape is really the only indication of the difference between a word and a code or a variable?

    3. What is meant by a "minimum spanning tree"? Does it mean that the highlighted link is the weakest of all the links? If this were the case, how could one node have multiple highlighted links?

    Thank you very much,
    Tom

     
  • HIGUCHI Koichi
    HIGUCHI Koichi
    2013-05-10

    Hello. Thank you for your post!

    1.

    So if there's a thicker line between C and D than between D and E, this means that C and D are more similar than D and E, right?

    Yes, you are right.

    Does a "stronger" edge mean a more frequently linked occurrence?

    Basically, yes. Exactly speaking, KH Coder measures similarities using Jaccard coefficient. So a "stronger" edge mean a bigger Jaccard coefficient.

    You may also raed this thread:
    http://sourceforge.net/p/khc/discussion/222396/thread/2da0ff02/

    2.

    Yes, values of the selected variable are calculated within the plot AS IF they are words/codes. But please note that co-occurrences between words are ignored and are not drawn. Co-occurrences between values are ignored too.

    3.

    "Minimum spanning tree" is a tool for analyzing network structures. Highlighted links tend to be stronger edges. This link may be useful:
    http://en.wikipedia.org/wiki/Minimum_spanning_tree

     


Anonymous


Cancel   Add attachments