S Luz
-
2017-03-04
- labels: --> Feature
Improvements to mosaic; visualising sparseness of token distribution.
Approach considered:
a) Colour the mosaic tiles so that the luminance (in the HSL
colour model) of a tile is directly proportional to its
sparseness (i.e. the more texts a word occurs in in the
concordance, the lighter the background of its tile in the
mosaic; see attached mock-up). A simple sparseness metric
might be, for instance:
s(t) = -log 1 - P(t|d) = log (D - D_t)/D where P(t|d) is the probability that word t occurs in text (document) d, D is the total number of documents, and D_t the number of documents in which t occurs. (We will need to normalise over all words s(t_i) on a column of the mosaic to map these s(t) to a luminance value). We could also consider using, as Shane mentioned, a function of TFxIDF over all documents t occurs in as a possible metric, where say, tfidf(t,d) = (#(t,d)/size(d)) * log D/D_t where #(t,d) is the number of occurrences of t in d, size(d) is the total number of tokens in d, and D and D_t are as above. The metric per word t on a mosaic tile would be a sum over all texts d: t(t) = \sum_d tfidf(t,d) and then normalised over all words in a mosaic column.