Stop word list for native and programming texts with lesser licence restrictions.
Used for exclude some common non meaningful key words from search indexing.
- http://en.wikipedia.org/wiki/Stop_words
- Wikipedia about 'Stop words'
- http://stopwordlist.sourceforge.net
- home page
- http://sourceforge.net/projects/stopwordlist
- SourceForge home page
All text file with stop word lists released for free use without any restrictions.
In countries where defined copyright low this mean that all author preserve their non-property right (право называться автором, TODO).
In countries where defined public domain this mean that all text file with stop word lists in public domain.
Progress/changelog/history.
- https://sourceforge.net/apps/trac/stopwordlist/roadmap
- project progress summary
To clone repository run:
$ hg clone http://stopwordlist.hg.sourceforge.net:8000/hgroot/stopwordlist/stopwordlist
To push to repository you must have write permission and do:
$ hg push ssh://$USER@stopwordlist.hg.sourceforge.net/hgroot/stopwordlist/stopwordlist
- stopwordlist-user@lists.sourceforge.net
- user discussion/primary mailing list
- https://lists.sourceforge.net/lists/listinfo/stopwordlist-user
- archive and subscribe/unsubscribe instructions
- https://sourceforge.net/apps/trac/stopwordlist.
- use search input area to find similar bug by keywords
- https://sourceforge.net/apps/trac/stopwordlist/report
- web access to bug/patch/request list
- https://sourceforge.net/apps/trac/stopwordlist/newticket
- submit bug/patch/feature request
- https://sourceforge.net/apps/wordpress/stopwordlist
- Wordpress blog on SF.
- https://sourceforge.net/apps/trac/stopwordlist
- Trac wiki root on SF.
TODO
https://www.ohloh.net/p/stopwordlist
A stop word list file contains a list of terms that should be removed from a indexed documents. In general a stop word is a term that has no usable semantics.
The stop word list file contain UTF-8 encoded word phrases delimited by LF character:
<stopwords> ::= { <line> "\n" }+
<line> ::= <comment> | <phrase>
<comment> ::= <spaces> <phrase>
<phrase> ::= <word> | <word> <spaces> <phrase>
<spaces> ::= { <SPACE> }+
<word> ::= { <non white space UTF-8 char> }+
So word can not contain SPACE, "n", TAB.
According to MSDN stop word list syntax:
- One word per line with no leading or trailing spaces. Any character after the first word on a line is ignored.
- Apostrophe characters (') are stripped out.
- The underscore (_) and pound sign (#) characters are treated as normal characters.
- Extended chars (ASCII range 128-255) are normalized. So removed umlaut and similar modifier.
- Hyphen (-), semicolon (;), and colon(:) characters are not allowed.
We don't follow this guide.
- http://msdn.microsoft.com/en-us/library/bb164590.aspx
- Help Stop-Word List
Words divided according to their lexical category.
Слова разделены по отношению их к частям речи.
Слова поділені відповідно до частин мов.
имя существительное
іменник
noun
вопросы кто?, что?, кого?, кому?, чему?
any abstract or concrete entity
местоимение
займенник
pronoun
вопросы кто?, что?, какой?, чей?
any substitute for a noun or noun phrase
имя прилагательное
прикметник
adjective
вопросы какой? какая? какого? какому?
any qualifier of a noun
имя числительное
числівник
numeral
вопросы сколько?
глагол
дієслово
verb
вопросы что делать?, что сделать?
any action or state of being
наречие
прислівник
adverb
вопросы как?, где?, куда?, когда?, зачем?, с какой целью?, в какой степени?
answer how?, in what way?, when?, where?, and to what extent?
предлог
прийменник
preposition
направление действия, отношение положения
any establisher of relation and syntactic context
союз
сполучник
conjunction
связь между предложениями и словоформами
any syntactic connector
междометие
вигук
interjection
восклицание, выражают чуства не называя их
any emotional greeting
частица
частка
particle
служебная часть речи
вводное слово
modal adverb
отношение говорящего к высказыванию
participle
дієприкметник
причастие, деепричастие
TODO
- http://meta.wikimedia.org/wiki/Stop_word_list
- stop word list used in MediaWiki
- http://meta.wikimedia.org/wiki/MySQL_4.0.20_stop_word_list
- stop words in MySQL 4.0.20, GPL
- http://sourceforge.net/projects/arabicstopwords/
- Arabic stop words list provide a classified word list and some tools to generate all forms of stops words, you can reuse it and select words by categories.