Menu

Tree [a1898b] default tip /
 History

Read Only access


File Date Author Commit
 .hgignore 2010-06-06 Oleksandr Gavenko Oleksandr Gavenko [ac7cfb] Added ignore file.
 Makefile 2011-08-07 Oleksandr Gavenko Oleksandr Gavenko [a91575] Add work-around for PATH. Under Windows find.ex...
 README.rst 2011-11-02 Oleksandr Gavenko Oleksandr Gavenko [a1898b] Switch to RST syntax.
 adjective_en.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 adjective_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 adjective_ua.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 adverb_en.txt 2010-06-14 Oleksandr Gavenko Oleksandr Gavenko [b4ae18] Added a few new words from MSDN Help Stop-Word ...
 adverb_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 adverb_ua.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 conjunction_en.txt 2010-06-14 Oleksandr Gavenko Oleksandr Gavenko [b4ae18] Added a few new words from MSDN Help Stop-Word ...
 conjunction_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 conjunction_ua.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 dumb_en.txt 2010-11-25 Oleksandr Gavenko Oleksandr Gavenko [3577e2] Automated merge with http://stopwordlist.hg.sou...
 dumb_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [ed0a8d] Added dumb word sequences.
 find-dup.sh 2010-06-14 Oleksandr Gavenko Oleksandr Gavenko [d295a8] Added script to search for duplication of one l...
 index.html 2010-06-01 Oleksandr Gavenko Oleksandr Gavenko [1a5e30] Added home page.
 interjections_en.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 interjections_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 interjections_ua.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 modal-adverb_en.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [ec2325] Added new words.
 modal-adverb_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 noun_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 numeral_en.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [7f5eeb] Added English numerals.
 numeral_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 particle_en.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 particle_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 particle_ua.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 preposition_en.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 preposition_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 preposition_ua.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 proglang_all.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [2b35fb] Added ANSI C keywords and usually used variable...
 proglang_ruby.txt 2010-05-16 Oleksandr Gavenko Oleksandr Gavenko [f22f2a] Ruby keywords.
 pronoun_en.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 pronoun_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [330ef6] Removed empty line.
 pronoun_ua.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.
 resizeimg.sh 2010-05-03 Oleksandr Gavenko Oleksandr Gavenko [a9433b] Generate logo with different size.
 swl_logo.png 2010-05-03 Oleksandr Gavenko Oleksandr Gavenko [369523] Added project logo.
 verb_en.txt 2010-08-07 Oleksandr Gavenko Oleksandr Gavenko [6ea1fe] Automated merge with http://stopwordlist.hg.sou...
 verb_modal_en.txt 2010-06-15 Oleksandr Gavenko Oleksandr Gavenko [de1e4e] Removed wrong word.
 verb_ru.txt 2010-05-02 Oleksandr Gavenko Oleksandr Gavenko [6994e3] Initial revision for exclude word project.

Read Me

StopWordList.

About.

Stop word list for native and programming texts with lesser licence restrictions.

Used for exclude some common non meaningful key words from search indexing.

http://en.wikipedia.org/wiki/Stop_words
Wikipedia about 'Stop words'

Licence.

All text file with stop word lists released for free use without any restrictions.

In countries where defined copyright low this mean that all author preserve their non-property right (право называться автором, TODO).

In countries where defined public domain this mean that all text file with stop word lists in public domain.

Progress/changelog/history.

https://sourceforge.net/apps/trac/stopwordlist/roadmap
project progress summary

VCS.

To clone repository run:

$ hg clone http://stopwordlist.hg.sourceforge.net:8000/hgroot/stopwordlist/stopwordlist

To push to repository you must have write permission and do:

$ hg push ssh://$USER@stopwordlist.hg.sourceforge.net/hgroot/stopwordlist/stopwordlist

Mail.

stopwordlist-user@lists.sourceforge.net
user discussion/primary mailing list
https://lists.sourceforge.net/lists/listinfo/stopwordlist-user
archive and subscribe/unsubscribe instructions

BTS.

https://sourceforge.net/apps/trac/stopwordlist.
use search input area to find similar bug by keywords
https://sourceforge.net/apps/trac/stopwordlist/report
web access to bug/patch/request list
https://sourceforge.net/apps/trac/stopwordlist/newticket
submit bug/patch/feature request

Forum.

TODO

Source file format.

A stop word list file contains a list of terms that should be removed from a indexed documents. In general a stop word is a term that has no usable semantics.

The stop word list file contain UTF-8 encoded word phrases delimited by LF character:

<stopwords> ::= { <line> "\n" }+
<line> ::= <comment> | <phrase>
<comment> ::= <spaces> <phrase>
<phrase> ::= <word> | <word> <spaces> <phrase>
<spaces> ::= { <SPACE> }+
<word> ::= { <non white space UTF-8 char> }+

So word can not contain SPACE, "n", TAB.

MSDN format.

According to MSDN stop word list syntax:

  • One word per line with no leading or trailing spaces. Any character after the first word on a line is ignored.
  • Apostrophe characters (') are stripped out.
  • The underscore (_) and pound sign (#) characters are treated as normal characters.
  • Extended chars (ASCII range 128-255) are normalized. So removed umlaut and similar modifier.
  • Hyphen (-), semicolon (;), and colon(:) characters are not allowed.

We don't follow this guide.

http://msdn.microsoft.com/en-us/library/bb164590.aspx
Help Stop-Word List

Grammar part of speech.

Words divided according to their lexical category.
Слова разделены по отношению их к частям речи.
Слова поділені відповідно до частин мов.

имя существительное
іменник
noun
    вопросы кто?, что?, кого?, кому?, чему?
    any abstract or concrete entity

местоимение
займенник
pronoun
    вопросы кто?, что?, какой?, чей?
    any substitute for a noun or noun phrase

имя прилагательное
прикметник
adjective
    вопросы какой? какая? какого? какому?
    any qualifier of a noun

имя числительное
числівник
numeral
    вопросы сколько?

глагол
дієслово
verb
    вопросы что делать?, что сделать?
    any action or state of being

наречие
прислівник
adverb
    вопросы как?, где?, куда?, когда?, зачем?, с какой целью?, в какой степени?
    answer how?, in what way?, when?, where?, and to what extent?

предлог
прийменник
preposition
    направление действия, отношение положения
    any establisher of relation and syntactic context

союз
сполучник
conjunction
    связь между предложениями и словоформами
    any syntactic connector

междометие
вигук
interjection
    восклицание, выражают чуства не называя их
    any emotional greeting

частица
частка
particle
    служебная часть речи

вводное слово
modal adverb
    отношение говорящего к высказыванию

participle
дієприкметник
причастие, деепричастие

Bug.

TODO

Alternative.

http://meta.wikimedia.org/wiki/Stop_word_list
stop word list used in MediaWiki
http://meta.wikimedia.org/wiki/MySQL_4.0.20_stop_word_list
stop words in MySQL 4.0.20, GPL
http://sourceforge.net/projects/arabicstopwords/
Arabic stop words list provide a classified word list and some tools to generate all forms of stops words, you can reuse it and select words by categories.