Menu

Tree [r5] /
 History

HTTPS access


File Date Author Commit
 COPYING 2011-04-03 hobcraft [r2] Initial release.
 README 2011-04-03 hobcraft [r4] Readme file for the ug project.
 TODO.org 2011-04-03 hobcraft [r2] Initial release.
 UNLICENSE 2011-04-03 hobcraft [r2] Initial release.
 ug.py 2011-04-03 hobcraft [r3] Altered some text in the program documentation.
 written.num.o5 2011-04-03 hobcraft [r5] Word list

Read Me

Background
----------

The ug project is an attempt to combine words from categorized
lists of English language words and to combine them in ways that 
look something like valid English language phrases. The idea is to
use this program for things like:

1. Generating usernames for forums. Rather than having a bland
username which mixes numbers and letters, wouldn't you rather
have a name that says something? But coming up with these names
can be hard.

2. Email addresses. Many email providers support the creation
of throw-away addresses for use in web forums. Once these addresses
are used online, it won't be long before spammers harvest them, but
you can simply throw away the address and create another. All the
throwaway addresses go to your main email address, so you don't
miss important mail.

3. Generating names for things, like computer programs, or codenames
for your secret projects. Not recommended for the naming of 
children.

Language files
--------------

The program currently uses a single language file called
written.num.o5. This file is taken from:

  http://www.kilgarriff.co.uk/bnc-readme.html

The file is one of a series of files created for the British National
Corpus (http://info.ox.ac.uk/bnc). It contains a list of words
alongside codes which define the type of the word. A sample 
extract from the file:

5776384 the at0 3209
2789403 of prf 3209
2421302 and cjc 3209
1939617 a at0 3205
1695860 in prp 3208
1468146 to to0 3206

The format is: four fields, separated by spaces.

  1: frequency
  2: word
  3: pos
  4: number of files the word occurs in

Here, pos refers to Part-Of-Speech codes. See 

  http://www.kilgarriff.co.uk/BNClists/poscodes.html

for a list of the different codes.

Program
-------
The program, ug.py, attempts to combine words from the list in ways
that would, superficially at least, look semantically valid in 
English. For example, a valid short English language phrase might
include an adjective followed by a noun. Consider these short
random phrases:

 "fair warning"
 "silly cat"

To generate phrases like these, we want to combine adjectives with
nouns. The POS codes for adjectives are:

AJ0 - adjective (general or positive) e.g. good, old
AJC - comparative adjective e.g. better, older
AJS - superlative adjective, e.g. best, oldest

Next, we want a noun. The codes for nouns are:

NN0 - common noun, neutral for number, e.g. aircraft, data,
committee. Singular collective nouns such as committee take this tag
on the grounds that they can be followed by either a singular or a
plural verb.

NN1 - singular common noun, e.g. pencil, goose, time, revelation.
NN2 - plural common noun, e.g. pencils, geese, times, revelations.

See http://www.kilgarriff.co.uk/BNClists/poscodes.html for more 
information about POS codes.

We select words from the list that have the POS codes for adjectives,
then words that have the POS codes for nouns, and randomly select
words from each list, and print the result.

The program uses the name Strategy to refer to a series of POS
codes which define one or more words for a randomly generated
phrase. It currently supports three strategies:

1. adverb or adjective, followed by noun:

   'av0','ajc','ajs','aj0'
   'nn0','nn1','nn2'

2. adjective, conjunction, adjective.

   'ajc','ajs','aj0'
   'cjc','cjs'
   'ajc','ajs','aj0'

3. verb, conjunction, verb

   'vvb','vvd','vvg','vvn'
   'cjc','cjs'
   'ajc','ajs','aj0'