I am new in the world of content analysis and data mining, so I am slowly learning about it. It is great for the analysis of qualitative data.
The thread about word cloud was very useful, and I am also now playing with the multi-dimensional scaling option. I would like to ask you some basic questions about it.
Perhaps this is an extremely obvious question, but do the distances in the document affect the results? (Let's say, words emerging in chapter 1, and words emerging in chapter 10).
Thanks a lot!
You seem to have CSS turned off.
Please don't fill out this field.
Hi. Thank you for your post!
classical: Classical (Metric) Multidimensional Scaling.
kruskal: Kruskal's Non-metric Multidimensional Scaling.
sammon: Sammon's Non-Linear Mapping, one form of non-metric multidimensional scaling.
About MDS, metric method is rarely used in modern data analysis. I recommend “kruskal” as a basic non-metric method. If many words are plotted too closely or too thickly in small area and it is hard to interpret with "kruskal", then you can try "sammon."
For more details, you can refer to Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
Can anybody recommend good introductory books about MDS written in English?
About distances, you can use "jaccard," "euclid" or "cosine" to measure distances between words.
Jaccard coefficient only sees if words A and B often co-occur or not. It only care about if the word occurred in the document or not and ignore how many times the word appeared in the document/unit. It is suitable when you deal with a lot of short documents or units like sentences or paragraphs.
If you have many long documents, and you would like to count how many times the word appeared in one document, you can use “euclid” or “cosine.” Cosine is more popular in this field maybe.
For more details, I can recommend Romesburg, H. C. 1984 Cluster Analysis for Researchers. Lifetime Learning Publications.
It's a really good book especially for beginners.
Thanks for the explanation! I will try to find the books to become more familiar with the methods.
Regarding the unit of analysis of MDS.... What is the main difference between selecting "sentences" and the H1 (having several tags at the same time).
Hi. Thank you for your post.
If you select “sentence” as the unit, KH Coder will check if word A and B tend to appear in same sentences or not. If they tend to appear together in same sentences, they will be plotted close in MDS.
If you select “paragraph” as the unit, KH Coder will check if word A and B tend to appear in same paragraphs or not...
In the tutorial data “botchan_en.txt,” H1 tags are used for chapter headings. So if you select “H1” as the unit, KH Coder will check if word A and B tend to appear in same chapters or not...
In the last case (H1), “jaccard” may not be very appropriate as the distance measure because it only care about a word exists (1) or not (0). “jaccard” coefficient will treats the data as 0-1 dichotomous variables: each word appeared (1) or not (0).
So in the last case (H1), it may well be better to select “cosine” or “euclid” to count words. Then KH Coder will distinguishes a chapter contains one word A and a chapter contains ten word A. KH Coder will check if word A appears many times or not when word B appears many times in the chapter. If both A and B tend to appear many times in same chapters, they will be plotted close in MDS.
Well, I am not sure I am clear enough.
Please feel free to make more questions.
Thanks for your answer! That makes it very clear.
I have also a question regarding the interpretation of the graph. I have been looking for information in articles and books, about the interpretation of MDS and correspondence analysis, and their dimensions. However, either they are too advanced for me, or the examples they provide are too simple compared to my own results.
Let's say that I have a MDS graph like the one of the Bocham example in the tutorial. This graph contains many different types of words. Let's say we had: day, Tokyo, talk, red_shirt, clown, eat, live. (This problem also relates to Correspondence analysis).
I find it very difficult to systematically provide a name for the cluster, or find a meaningful label for the dimensions. I saw an example (that I attach in this message) in which they categorize hair and eye color. In this graph, it is easy to make interpretations about the trends and see that people with brown hair are more likely to have black eyes, and people with blue eyes are more likely to be blonde. The represented words belong to the same family.
However, as I said... what happens when the nature of words is very diverse? (like in the Bocham example). How is is possible to get a meaningful interpretation?
Hi, thank you for the post.
First, because MDS results contain a large amount of information, it is not always easy to interpret. Co-occurrence network contains less information and may be easier to interpret.
Or you can use the clustering tequnique to get some hints.
Second, you cannot always find meanings in dimensions. Sometimes you can, but not always. Also, don’t think you can give meaningful labels to all clusters. If you can give meaningful labels to some clusters, it’s OK, I think. We are dealing with a lot of automatically extracted words. It is very different from small number of carefully chosen concepts like your example.
Third, but we better try hard to give meaningful labels to as many clusters as possible. (1) We have to make some inferences. For example, “red shirt” and “clown” are both important characters in the novel. If they are in same clusters, they may have some connections in the story. And other words in the cluster can give us some hints on the connection. Also (2) it is very important to use KWIC to check the word usages in the raw data to enrich the inferences.
Not only looking at the result, we have to make some inferences and check the raw data to interpret the result. Computers, software like KH Coder, or statistical techniques can only help. The analysis is ours, I think.
Hope it helps.
I'm researching about Multidimensional Analysis for a course and I have a little doubt. It's a little basic, probably. KH Coder's Multidimensional Scaling it's the same that Multidimensional Analysis propouse by Biber, D. (1995)? If it is, what represent the three KH Coder's Dimensions?
Thanks for help!
Sorry, I didn't read Biber, D. (1995).
In the default setting, KH Coder uses "isoMDS" function of R to perform Multidimensional Scaling. You can find the documentation here:
According to the above document, Cox & Cox (2001) may help you out?
not really. But it doesn't matter. It's seems that MDS and MD analysis are similar. About the dimension 1, 2 and 3 of MDS, what are they representing? For example, if type-tokens tend to go to one (1) or negative one (-1), or in the case of image, tend to zero (0), what do they mean depending the dimension?
Thank you very much for your help Higuchi, I'm just starting in Corpus Linguistic and this it's a great help.
Well, please note that the point of MDS is the placing of words. Not dimensions.
MDS try to make a map such that those words that tend to appear together are placed near each other on the map, and those words that are rarely co-occur are placed far away from each other on the map.
As a result of the above procedure, in some cases, you can find some meanings in dimensions. For example, there may be emotional word on the left side and logical words on the right side (dimension 1 may correspond to logical-emotional scale in this case). But it won’t always happen.
Also, as the result of the above procedure, words that co-occur with every other words (such as "." and ",") could be placed near the center of the map. But I am not sure if it happens always or not. In most cases, we remove such words from the analysis as stop words.
Thank you very much!