S Luz
-
2017-03-04
- labels: --> Feature
Improvements to mosaic; visualising sparseness of token distribution.
Approach considered:
a) Colour the mosaic tiles so that the luminance (in the HSL
colour model) of a tile is directly proportional to its
sparseness (i.e. the more texts a word occurs in in the
concordance, the lighter the background of its tile in the
mosaic; see attached mock-up). A simple sparseness metric
might be, for instance:
s(t) = -log 1 - P(t|d)
= log (D - D_t)/D
where P(t|d) is the probability that word t occurs in text
(document) d, D is the total number of documents, and D_t
the number of documents in which t occurs. (We will need to
normalise over all words s(t_i) on a column of the mosaic to
map these s(t) to a luminance value).
We could also consider using, as Shane mentioned, a function
of TFxIDF over all documents t occurs in as a possible metric,
where say,
tfidf(t,d) = (#(t,d)/size(d)) * log D/D_t
where #(t,d) is the number of occurrences of t in d, size(d)
is the total number of tokens in d, and D and D_t are as
above.
The metric per word t on a mosaic tile would be a sum over all
texts d:
t(t) = \sum_d tfidf(t,d)
and then normalised over all words in a mosaic column.