1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Tokenizers

From dspam

Jump to: navigation, search
  • Written by Stevan Bajić
  • July 31st 2009
  • Taken from User Mailing list

Contents

Tokenizers in DSPAM

I am now going to explain deeply how the tokenizers create the tokens/patterns. I do this because I hope new users will search the mailinglist archives and stop asking over and over the same question. I will just show the token generating part. Internally DSPAM uses algorithms for calculating the probability and the confidence factor, but that's beyond the scope of this document.

DSPAM also uses different weight on the generated tokens depending which tokenizer is used, again this is beyond this document.


WORD

Tokenizer WORD is breaking messages up into single words. For example the text: "Heute Abend war ich mit meiner Freundin im Kino und habe viel gelacht." Would be breaken up into:

  1. Heute
  2. Abend
  3. war
  4. ich
  5. mit
  6. meiner
  7. Freundin
  8. im
  9. Kino
  10. und
  11. habe
  12. viel
  13. gelacht


And then DSPAM would create the tokens for each word:

  • TOKEN: 'Heute' CRC: 6716984897371635712
  • TOKEN: 'Abend' CRC: 6670531613365895168
  • TOKEN: 'war' CRC: 4772677679197454336
  • TOKEN: 'ich' CRC: 6329956816985784320
  • TOKEN: 'mit' CRC: 5158417007107899392
  • TOKEN: 'meiner' CRC: 4773009072114954240
  • TOKEN: 'Freundin' CRC: 13580161102417572361
  • TOKEN: 'im' CRC: 5811385145726337024
  • TOKEN: 'Kino' CRC: 6035516550826426368
  • TOKEN: 'und' CRC: 6670506629311496192
  • TOKEN: 'habe' CRC: 6712962585043402752
  • TOKEN: 'viel' CRC: 5844870173739188224
  • TOKEN: 'gelacht' CRC: 5158829993465032208


CHAIN

Tokenizer CHAIN would break up the same message into (+ = combine word):

  1. Heute+Abend
  2. Abend+war
  3. war+ich
  4. ich+mit
  5. mit+meiner
  6. meiner+Freundin
  7. Freundin+im
  8. im+Kino
  9. Kino+und
  10. und+habe
  11. habe+viel
  12. viel+gelacht


And then DSPAM would create the tokens for each chain:

  • TOKEN: 'Heute+Abend' CRC: 9299536586222406967
  • TOKEN: 'Abend+war' CRC: 5205867775940263209
  • TOKEN: 'war+ich' CRC: 6329956649787979024
  • TOKEN: 'ich+mit' CRC: 5158416839735805488
  • TOKEN: 'mit+meiner' CRC: 9567822050683308311
  • TOKEN: 'meiner+Freundin' CRC: 11339548565549479056
  • TOKEN: 'Freundin+im' CRC: 7816109150855533158
  • TOKEN: 'im+Kino' CRC: 6035516551245899312
  • TOKEN: 'Kino+und' CRC: 3139684354012378707
  • TOKEN: 'und+habe' CRC: 2029218973535212134
  • TOKEN: 'habe+viel' CRC: 15552379170419714363
  • TOKEN: 'viel+gelacht' CRC: 5059261385542544937


OSB (Orthogonal Sparse biGram)

Tokenizer OSB would break up the same mail into (+ = combine word, # = skip word):

  1. Heute+#+#+#+mit
  2. Abend+#+#+mit
  3. war+#+mit
  4. ich+mit
  5. Abend+#+#+#+meiner
  6. war+#+#+meiner
  7. ich+#+meiner
  8. mit+meiner
  9. war+#+#+#+Freundin
  10. ich+#+#+Freundin
  11. mit+#+Freundin
  12. meiner+Freundin
  13. ich+#+#+#+im
  14. mit+#+#+im
  15. meiner+#+im
  16. Freundin+im
  17. mit+#+#+#+Kino
  18. meiner+#+#+Kino
  19. Freundin+#+Kino
  20. im+Kino
  21. meiner+#+#+#+und
  22. Freundin+#+#+und
  23. im+#+und
  24. Kino+und
  25. Freundin+#+#+#+habe
  26. im+#+#+habe
  27. Kino+#+habe
  28. und+habe
  29. im+#+#+#+viel
  30. Kino+#+#+viel
  31. und+#+viel
  32. habe+viel
  33. Kino+#+#+#+gelacht
  34. und+#+#+gelacht
  35. habe+#+gelacht
  36. viel+gelacht


And then DSPAM would create the tokens for each pattern:

  • TOKEN: 'Heute+#+#+#+mit' CRC: 2006452661602586241
  • TOKEN: 'Abend+#+#+mit' CRC: 5482652074219693289
  • TOKEN: 'war+#+mit' CRC: 15707817493435847227
  • TOKEN: 'ich+mit' CRC: 5158416839735805488
  • TOKEN: 'Abend+#+#+#+meiner' CRC: 8544044731047037263
  • TOKEN: 'war+#+#+meiner' CRC: 14722667808637756004
  • TOKEN: 'ich+#+meiner' CRC: 14702440976645933412
  • TOKEN: 'mit+meiner' CRC: 9567822050683308311
  • TOKEN: 'war+#+#+#+Freundin' CRC: 17493766208078576673
  • TOKEN: 'ich+#+#+Freundin' CRC: 5758453548536397908
  • TOKEN: 'mit+#+Freundin' CRC: 11320398811460250377
  • TOKEN: 'meiner+Freundin' CRC: 11339548565549479056
  • TOKEN: 'ich+#+#+#+im' CRC: 16038087078191622500
  • TOKEN: 'mit+#+#+im' CRC: 10834693566633404695
  • TOKEN: 'meiner+#+im' CRC: 1465587418199637282
  • TOKEN: 'Freundin+im' CRC: 7816109150855533158
  • TOKEN: 'mit+#+#+#+Kino' CRC: 15973767036771659876
  • TOKEN: 'meiner+#+#+Kino' CRC: 15990647948548780029
  • TOKEN: 'Freundin+#+Kino' CRC: 1671950834524128732
  • TOKEN: 'im+Kino' CRC: 6035516551245899312
  • TOKEN: 'meiner+#+#+#+und' CRC: 14982435885105910831
  • TOKEN: 'Freundin+#+#+und' CRC: 17912671458991389317
  • TOKEN: 'im+#+und' CRC: 8183715938297249958
  • TOKEN: 'Kino+und' CRC: 3139684354012378707
  • TOKEN: 'Freundin+#+#+#+habe' CRC: 1991158605521709403
  • TOKEN: 'im+#+#+habe' CRC: 15211418373216069988
  • TOKEN: 'Kino+#+habe' CRC: 16865398141328091395
  • TOKEN: 'und+habe' CRC: 2029218973535212134
  • TOKEN: 'im+#+#+#+viel' CRC: 16100764786021230948
  • TOKEN: 'Kino+#+#+viel' CRC: 16082144607504427364
  • TOKEN: 'und+#+viel' CRC: 10458140588311374092
  • TOKEN: 'habe+viel' CRC: 15552379170419714363
  • TOKEN: 'Kino+#+#+#+gelacht' CRC: 3148349109242633294
  • TOKEN: 'und+#+#+gelacht' CRC: 2006833870839550408
  • TOKEN: 'habe+#+gelacht' CRC: 2006883881861244457
  • TOKEN: 'viel+gelacht' CRC: 5059261385542544937


SBPH (Sparse Binary Polynomial Hashing)

Tokenizer SBPH would break up the same mail into (+ = combine word, # = skip word):

  1. mit
  2. ich+mit
  3. war+#+mit
  4. war+ich+mit
  5. Abend+#+#+mit
  6. Abend+#+ich+mit
  7. Abend+war+#+mit
  8. Abend+war+ich+mit
  9. Heute+#+#+#+mit
  10. Heute+#+#+ich+mit
  11. Heute+#+war+#+mit
  12. Heute+#+war+ich+mit
  13. Heute+Abend+#+#+mit
  14. Heute+Abend+#+ich+mit
  15. Heute+Abend+war+#+mit
  16. Heute+Abend+war+ich+mit
  17. meiner
  18. mit+meiner
  19. ich+#+meiner
  20. ich+mit+meiner
  21. war+#+#+meiner
  22. war+#+mit+meiner
  23. war+ich+#+meiner
  24. war+ich+mit+meiner
  25. Abend+#+#+#+meiner
  26. Abend+#+#+mit+meiner
  27. Abend+#+ich+#+meiner
  28. Abend+#+ich+mit+meiner
  29. Abend+war+#+#+meiner
  30. Abend+war+#+mit+meiner
  31. Abend+war+ich+#+meiner
  32. Abend+war+ich+mit+meiner
  33. Freundin
  34. meiner+Freundin
  35. mit+#+Freundin
  36. mit+meiner+Freundin
  37. ich+#+#+Freundin
  38. ich+#+meiner+Freundin
  39. ich+mit+#+Freundin
  40. ich+mit+meiner+Freundin
  41. war+#+#+#+Freundin
  42. war+#+#+meiner+Freundin
  43. war+#+mit+#+Freundin
  44. war+#+mit+meiner+Freundin
  45. war+ich+#+#+Freundin
  46. war+ich+#+meiner+Freundin
  47. war+ich+mit+#+Freundin
  48. war+ich+mit+meiner+Freundin
  49. im
  50. Freundin+im
  51. meiner+#+im
  52. meiner+Freundin+im
  53. mit+#+#+im
  54. mit+#+Freundin+im
  55. mit+meiner+#+im
  56. mit+meiner+Freundin+im
  57. ich+#+#+#+im
  58. ich+#+#+Freundin+im
  59. ich+#+meiner+#+im
  60. ich+#+meiner+Freundin+im
  61. ich+mit+#+#+im
  62. ich+mit+#+Freundin+im
  63. ich+mit+meiner+#+im
  64. ich+mit+meiner+Freundin+im
  65. Kino
  66. im+Kino
  67. Freundin+#+Kino
  68. Freundin+im+Kino
  69. meiner+#+#+Kino
  70. meiner+#+im+Kino
  71. meiner+Freundin+#+Kino
  72. meiner+Freundin+im+Kino
  73. mit+#+#+#+Kino
  74. mit+#+#+im+Kino
  75. mit+#+Freundin+#+Kino
  76. mit+#+Freundin+im+Kino
  77. mit+meiner+#+#+Kino
  78. mit+meiner+#+im+Kino
  79. mit+meiner+Freundin+#+Kino
  80. mit+meiner+Freundin+im+Kino
  81. und
  82. Kino+und
  83. im+#+und
  84. im+Kino+und
  85. Freundin+#+#+und
  86. Freundin+#+Kino+und
  87. Freundin+im+#+und
  88. Freundin+im+Kino+und
  89. meiner+#+#+#+und
  90. meiner+#+#+Kino+und
  91. meiner+#+im+#+und
  92. meiner+#+im+Kino+und
  93. meiner+Freundin+#+#+und
  94. meiner+Freundin+#+Kino+und
  95. meiner+Freundin+im+#+und
  96. meiner+Freundin+im+Kino+und
  97. habe
  98. und+habe
  99. Kino+#+habe
  100. Kino+und+habe
  101. im+#+#+habe
  102. im+#+und+habe
  103. im+Kino+#+habe
  104. im+Kino+und+habe
  105. Freundin+#+#+#+habe
  106. Freundin+#+#+und+habe
  107. Freundin+#+Kino+#+habe
  108. Freundin+#+Kino+und+habe
  109. Freundin+im+#+#+habe
  110. Freundin+im+#+und+habe
  111. Freundin+im+Kino+#+habe
  112. Freundin+im+Kino+und+habe
  113. viel
  114. habe+viel
  115. und+#+viel
  116. und+habe+viel
  117. Kino+#+#+viel
  118. Kino+#+habe+viel
  119. Kino+und+#+viel
  120. Kino+und+habe+viel
  121. im+#+#+#+viel
  122. im+#+#+habe+viel
  123. im+#+und+#+viel
  124. im+#+und+habe+viel
  125. im+Kino+#+#+viel
  126. im+Kino+#+habe+viel
  127. im+Kino+und+#+viel
  128. im+Kino+und+habe+viel
  129. gelacht
  130. viel+gelacht
  131. habe+#+gelacht
  132. habe+viel+gelacht
  133. und+#+#+gelacht
  134. und+#+viel+gelacht
  135. und+habe+#+gelacht
  136. und+habe+viel+gelacht
  137. Kino+#+#+#+gelacht
  138. Kino+#+#+viel+gelacht
  139. Kino+#+habe+#+gelacht
  140. Kino+#+habe+viel+gelacht
  141. Kino+und+#+#+gelacht
  142. Kino+und+#+viel+gelacht
  143. Kino+und+habe+#+gelacht
  144. Kino+und+habe+viel+gelacht


And then DSPAM would create the tokens for each pattern:

  • TOKEN: 'mit' CRC: 5158417007107899392
  • TOKEN: 'ich+mit' CRC: 5158416839735805488
  • TOKEN: 'war+#+mit' CRC: 15707817493435847227
  • TOKEN: 'war+ich+mit' CRC: 6905336139605378569
  • TOKEN: 'Abend+#+#+mit' CRC: 5482652074219693289
  • TOKEN: 'Abend+#+ich+mit' CRC: 2006454003823721484
  • TOKEN: 'Abend+war+#+mit' CRC: 15698522771525150782
  • TOKEN: 'Abend+war+ich+mit' CRC: 8949741539749834179
  • TOKEN: 'Heute+#+#+#+mit' CRC: 2006452661602586241
  • TOKEN: 'Heute+#+#+ich+mit' CRC: 10912094934613969813
  • TOKEN: 'Heute+#+war+#+mit' CRC: 6155167828760649639
  • TOKEN: 'Heute+#+war+ich+mit' CRC: 16279494732467846352
  • TOKEN: 'Heute+Abend+#+#+mit' CRC: 17451034817009114672
  • TOKEN: 'Heute+Abend+#+ich+mit' CRC: 4079088572591062061
  • TOKEN: 'Heute+Abend+war+#+mit' CRC: 18059387714294556703
  • TOKEN: 'Heute+Abend+war+ich+mit' CRC: 11818656812148744564
  • TOKEN: 'meiner' CRC: 4773009072114954240
  • TOKEN: 'mit+meiner' CRC: 9567822050683308311
  • TOKEN: 'ich+#+meiner' CRC: 14702440976645933412
  • TOKEN: 'ich+mit+meiner' CRC: 9567785545379955735
  • TOKEN: 'war+#+#+meiner' CRC: 14722667808637756004
  • TOKEN: 'war+#+mit+meiner' CRC: 15102709312457837529
  • TOKEN: 'war+ich+#+meiner' CRC: 11050021678326778794
  • TOKEN: 'war+ich+mit+meiner' CRC: 16941397287622239551
  • TOKEN: 'Abend+#+#+#+meiner' CRC: 8544044731047037263
  • TOKEN: 'Abend+#+#+mit+meiner' CRC: 6700917376176391980
  • TOKEN: 'Abend+#+ich+#+meiner' CRC: 1454797817133172575
  • TOKEN: 'Abend+#+ich+mit+meiner' CRC: 16196371955299811851
  • TOKEN: 'Abend+war+#+#+meiner' CRC: 1470573290570285919
  • TOKEN: 'Abend+war+#+mit+meiner' CRC: 16574673948525685929
  • TOKEN: 'Abend+war+ich+#+meiner' CRC: 12595187249194953946
  • TOKEN: 'Abend+war+ich+mit+meiner' CRC: 7875344944142258172
  • TOKEN: 'Freundin' CRC: 13580161102417572361
  • TOKEN: 'meiner+Freundin' CRC: 11339548565549479056
  • TOKEN: 'mit+#+Freundin' CRC: 11320398811460250377
  • TOKEN: 'mit+meiner+Freundin' CRC: 11701509035464193201
  • TOKEN: 'ich+#+#+Freundin' CRC: 5758453548536397908
  • TOKEN: 'ich+#+meiner+Freundin' CRC: 14734124733490463921
  • TOKEN: 'ich+mit+#+Freundin' CRC: 7712133220574661181
  • TOKEN: 'ich+mit+meiner+Freundin' CRC: 4279576675461360019
  • TOKEN: 'war+#+#+#+Freundin' CRC: 17493766208078576673
  • TOKEN: 'war+#+#+meiner+Freundin' CRC: 13509209483715422404
  • TOKEN: 'war+#+mit+#+Freundin' CRC: 17942848210230282853
  • TOKEN: 'war+#+mit+meiner+Freundin' CRC: 459960140544208734
  • TOKEN: 'war+ich+#+#+Freundin' CRC: 8674439904769966164
  • TOKEN: 'war+ich+#+meiner+Freundin' CRC: 7528935976228917086
  • TOKEN: 'war+ich+mit+#+Freundin' CRC: 7712194212032937837
  • TOKEN: 'war+ich+mit+meiner+Freundin' CRC: 17059398211508890020
  • TOKEN: 'im' CRC: 5811385145726337024
  • TOKEN: 'Freundin+im' CRC: 7816109150855533158
  • TOKEN: 'meiner+#+im' CRC: 1465587418199637282
  • TOKEN: 'meiner+Freundin+im' CRC: 17710698886775658306
  • TOKEN: 'mit+#+#+im' CRC: 10834693566633404695
  • TOKEN: 'mit+#+Freundin+im' CRC: 18111038972047310256
  • TOKEN: 'mit+meiner+#+im' CRC: 1465587275890214843
  • TOKEN: 'mit+meiner+Freundin+im' CRC: 17710754364967909490
  • TOKEN: 'ich+#+#+#+im' CRC: 16038087078191622500
  • TOKEN: 'ich+#+#+Freundin+im' CRC: 5138177127977516584
  • TOKEN: 'ich+#+meiner+#+im' CRC: 7598768687643625872
  • TOKEN: 'ich+#+meiner+Freundin+im' CRC: 13848636115401072991
  • TOKEN: 'ich+mit+#+#+im' CRC: 10834729762232558615
  • TOKEN: 'ich+mit+#+Freundin+im' CRC: 18113705072675278256
  • TOKEN: 'ich+mit+meiner+#+im' CRC: 4566882011436074906
  • TOKEN: 'ich+mit+meiner+Freundin+im' CRC: 15958778763309502441
  • TOKEN: 'Kino' CRC: 6035516550826426368
  • TOKEN: 'im+Kino' CRC: 6035516551245899312
  • TOKEN: 'Freundin+#+Kino' CRC: 1671950834524128732
  • TOKEN: 'Freundin+im+Kino' CRC: 17854107114517854309
  • TOKEN: 'meiner+#+#+Kino' CRC: 15990647948548780029
  • TOKEN: 'meiner+#+im+Kino' CRC: 13888078793784186575
  • TOKEN: 'meiner+Freundin+#+Kino' CRC: 3232270807921846989
  • TOKEN: 'meiner+Freundin+im+Kino' CRC: 17099445887404722442
  • TOKEN: 'mit+#+#+#+Kino' CRC: 15973767036771659876
  • TOKEN: 'mit+#+#+im+Kino' CRC: 1120299120353238160
  • TOKEN: 'mit+#+Freundin+#+Kino' CRC: 5050762065960351229
  • TOKEN: 'mit+#+Freundin+im+Kino' CRC: 13846215677765703509
  • TOKEN: 'mit+meiner+#+#+Kino' CRC: 16343570441473109980
  • TOKEN: 'mit+meiner+#+im+Kino' CRC: 17812691054866224847
  • TOKEN: 'mit+meiner+Freundin+#+Kino' CRC: 1086065226225016130
  • TOKEN: 'mit+meiner+Freundin+im+Kino' CRC: 4395133216194152765
  • TOKEN: 'und' CRC: 6670506629311496192
  • TOKEN: 'Kino+und' CRC: 3139684354012378707
  • TOKEN: 'im+#+und' CRC: 8183715938297249958
  • TOKEN: 'im+Kino+und' CRC: 766330759622587987
  • TOKEN: 'Freundin+#+#+und' CRC: 17912671458991389317
  • TOKEN: 'Freundin+#+Kino+und' CRC: 13986308648741369452
  • TOKEN: 'Freundin+im+#+und' CRC: 91869388660448376
  • TOKEN: 'Freundin+im+Kino+und' CRC: 3385135257430165459
  • TOKEN: 'meiner+#+#+#+und' CRC: 14982435885105910831
  • TOKEN: 'meiner+#+#+Kino+und' CRC: 4226515320529629590
  • TOKEN: 'meiner+#+im+#+und' CRC: 16253799335637820502
  • TOKEN: 'meiner+#+im+Kino+und' CRC: 13394501699007962644
  • TOKEN: 'meiner+Freundin+#+#+und' CRC: 16293319120595454954
  • TOKEN: 'meiner+Freundin+#+Kino+und' CRC: 13393065477179921708
  • TOKEN: 'meiner+Freundin+im+#+und' CRC: 5966400137366783433
  • TOKEN: 'meiner+Freundin+im+Kino+und' CRC: 4792303306188615536
  • TOKEN: 'habe' CRC: 6712962585043402752
  • TOKEN: 'und+habe' CRC: 2029218973535212134
  • TOKEN: 'Kino+#+habe' CRC: 16865398141328091395
  • TOKEN: 'Kino+und+habe' CRC: 4403215055096353382
  • TOKEN: 'im+#+#+habe' CRC: 15211418373216069988
  • TOKEN: 'im+#+und+habe' CRC: 4415064577369281126
  • TOKEN: 'im+Kino+#+habe' CRC: 16865398282259489795
  • TOKEN: 'im+Kino+und+habe' CRC: 17288013589001183885
  • TOKEN: 'Freundin+#+#+#+habe' CRC: 1991158605521709403
  • TOKEN: 'Freundin+#+#+und+habe' CRC: 16528542051568490135
  • TOKEN: 'Freundin+#+Kino+#+habe' CRC: 8243217039978347783
  • TOKEN: 'Freundin+#+Kino+und+habe' CRC: 10726735021174036825
  • TOKEN: 'Freundin+im+#+#+habe' CRC: 17816582324937038052
  • TOKEN: 'Freundin+im+#+und+habe' CRC: 11902259067882653282
  • TOKEN: 'Freundin+im+Kino+#+habe' CRC: 17029665200395531479
  • TOKEN: 'Freundin+im+Kino+und+habe' CRC: 10971501632817134186
  • TOKEN: 'viel' CRC: 5844870173739188224
  • TOKEN: 'habe+viel' CRC: 15552379170419714363
  • TOKEN: 'und+#+viel' CRC: 10458140588311374092
  • TOKEN: 'und+habe+viel' CRC: 15561095549081626939
  • TOKEN: 'Kino+#+#+viel' CRC: 16082144607504427364
  • TOKEN: 'Kino+#+habe+viel' CRC: 4459771146848046574
  • TOKEN: 'Kino+und+#+viel' CRC: 10458174477295434923
  • TOKEN: 'Kino+und+habe+viel' CRC: 16689363735248968540
  • TOKEN: 'im+#+#+#+viel' CRC: 16100764786021230948
  • TOKEN: 'im+#+#+habe+viel' CRC: 361495487179856336
  • TOKEN: 'im+#+und+#+viel' CRC: 10458174279073923713
  • TOKEN: 'im+#+und+habe+viel' CRC: 18048631823991129461
  • TOKEN: 'im+Kino+#+#+viel' CRC: 999589455442663823
  • TOKEN: 'im+Kino+#+habe+viel' CRC: 754854007182855662
  • TOKEN: 'im+Kino+und+#+viel' CRC: 13596196537167264906
  • TOKEN: 'im+Kino+und+habe+viel' CRC: 16689408353592404828
  • TOKEN: 'gelacht' CRC: 5158829993465032208
  • TOKEN: 'viel+gelacht' CRC: 5059261385542544937
  • TOKEN: 'habe+#+gelacht' CRC: 2006883881861244457
  • TOKEN: 'habe+viel+gelacht' CRC: 17018992409010758715
  • TOKEN: 'und+#+#+gelacht' CRC: 2006833870839550408
  • TOKEN: 'und+#+viel+gelacht' CRC: 11390176152559640594
  • TOKEN: 'und+habe+#+gelacht' CRC: 14122710027983374098
  • TOKEN: 'und+habe+viel+gelacht' CRC: 17012358470641669179
  • TOKEN: 'Kino+#+#+#+gelacht' CRC: 3148349109242633294
  • TOKEN: 'Kino+#+#+viel+gelacht' CRC: 2465761806596665413
  • TOKEN: 'Kino+#+habe+#+gelacht' CRC: 9212073874585409349
  • TOKEN: 'Kino+#+habe+viel+gelacht' CRC: 2441929876380959827
  • TOKEN: 'Kino+und+#+#+gelacht' CRC: 3884236675463582185
  • TOKEN: 'Kino+und+#+viel+gelacht' CRC: 9588796112820570248
  • TOKEN: 'Kino+und+habe+#+gelacht' CRC: 15635871664798674824
  • TOKEN: 'Kino+und+habe+viel+gelacht' CRC: 15730866926752352773


Token Order

While above you clearly see (for example in WORD) that the word "Heute" is before the word "Abend" and that word "war" is after "Abend" etc... But when you have that data inside the database you don't know the chain of the words. The original mail could have been "Der Abend war Heute besonders gut" or "Am Heutigen Abend war es kalt" or it could be "Am Abend ist es immer dunkel aber Heute war es hell" etc... You don't know how the words are chained together. While that information is not important for the WORD tokenizer it is important for CHAIN, OSB and SBPH. So transforming from pure WORD to the other tokenizers is not possible. Let alone the transformation from CRC "6716984897371635712" to the word "Heute". Transforming from the others (CHAIN, OSB or SBPH) to WORD would be possible but only if you could easy solve the problem mentioned in the next section (below).


Reverse Engineering

By now you should have realised that from pure tokens (the CRC) it is a huge task to convert back to the real word or chain or pattern. It would be technically possible but would require a huge database of premade CRC and their corespondenting WORD, CHAIN, OSB and SBPH pattern or a very, very, very fast Computer able to bruteforce the CRC's. The database would be ultra huge and even if we would have all the possible combination for each and every character that DSPAM does not filter out of the original mail stream (stuff like punctuation (.,!?:;) and other unwanted characters (+-#*) etc) up to a word length of lets say 20 characters, still the lookup inside such a huge database (which is probably above Petabyte or Exabyte or even more) would require much time. So it is not really practicable.


Token amount per Tokenizer

As you can see each tokenizer creates different amount of tokens for the same message. In the above example the mail had 13 words and WORD tokenizer created (what a surprise) 13 tokens. CHAIN tokenizer created 12 (13 (word count) minus 1) and OSB created 36 (((13 (word count) minus 5 (slighting window)) multiplied by 4). That is because each slighting window of 5 words creates a pattern of 4 features in OSB) and SBPH created 144 tokens (((13 (word count) minus 5 (slighting window)) multiplied by 16). That is because each slighting window of 5 words creates a pattern of 16 features in SBPH).

Personal tools