Menu

#4 Mosaic: visualising sparseness of word distribution

1.0
open
Shane
2024-03-22
2017-03-04
S Luz
No

Improvements to mosaic; visualising sparseness of token distribution.

  • Approach considered:

    a) Colour the mosaic tiles so that the luminance (in the HSL
    colour model) of a tile is directly proportional to its
    sparseness (i.e. the more texts a word occurs in in the
    concordance, the lighter the background of its tile in the
    mosaic; see attached mock-up). A simple sparseness metric
    might be, for instance:

          s(t) = -log 1 - P(t|d)
               = log (D - D_t)/D
    
    where P(t|d) is the probability that word t occurs in text
    (document) d, D is the total number of documents, and D_t
    the number of documents in which t occurs. (We will need to
    normalise over all words s(t_i) on a column of the mosaic to
    map these s(t) to a luminance value).
    
    We could also consider using, as Shane mentioned, a function
    of TFxIDF over all documents t occurs in as a possible metric,
    where say,
    
          tfidf(t,d) = (#(t,d)/size(d)) * log D/D_t
    
    where #(t,d) is the number of occurrences of t in d, size(d)
    is the total number of tokens in d, and D and D_t are as
    above.
    
    The metric per word t on a mosaic tile would be a sum over all
    texts d:
    
          t(t) = \sum_d tfidf(t,d)
    
    and then normalised over all words in a mosaic column.
    
1 Attachments

Discussion

  • S Luz

    S Luz - 2017-03-04
    • labels: --> Feature
     
  • S Luz

    S Luz - 2017-11-22
    • assigned_to: Shane
     
  • Shane

    Shane - 2024-03-22
    • labels: Feature --> Feature, feature request
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.