What does it do?
It is a simple script to generate nonsensical output. It constructs a Markov chain from the training text. Once the chain has been constructed, it is used to generate an arbitrarily long output sequence that mirrors the style of the training text.
So what does it do?
It outputs a block of text, which are stylistically similar to the text you give it. If you train it on the Bible, it will generate biblical-sounding text. If you train it on a technical paper, it will generate technical-sounding text.
What is a Markov Chain?
A Markov chain is a theoretical construct. It has series of states linked together with a non-deterministic transition function. Imagine a pinball machine, but taped to each obstacle is a word. Whenever the ball hits that obstacle, you write down the corresponding word. Depending on how words are attached to the pinball machine, you will get out different outputs.
Here are some example outputs generated from various training texts.
Training Text: the Book of Revelation
Output: The Revelation of Jesus Christ, and of all things that he saw. Blessed is he that watcheth, and keepeth his garments, lest he walk naked, and they see his shame. And he gathered them together into a place called in the Hebrew tongue is Abaddon, but in the Greek tongue hath his name Apollyon. One woe is past; and, behold, there come two woes more hereafter. And the sixth angel poured out his vial upon the earth; and he had a name written, that no man might buy or sell, save he that had a sharp sickle, saying, Thrust in thy sickle.
Training Text: the Declaration of Independence
Output: Declaration of Independence (Adopted in Congress 4 July 1776). He has refused for a long time, after such dissolutions, to cause others to be elected; whereby the legislative powers, incapable of annihilation, have returned to the people at large for their exercise; the state remaining in the meantime exposed to all the dangers of invasion from without, and convulsions within. He has kept among us, in times of peace, standing armies without the consent of our legislature.
Training Text: the Declaration of Independence, the Book of Revelation
Output: Declaration of Independence (Adopted in Congress 4 July 1776) The Unanimous Declaration of the Thirteen United States of America, in General Congress, assembled, appealing to the Supreme Judge of the world for the rectitude of our intentions, do, in the name, and by the word of my patience, I also will keep thee from the hour of temptation, which shall come upon all the world, to try them that dwell on the earth, and his angels fought against the dragon; and the dragon gave him his power, and his seat, and great authority.
Training Text: Treasure Island
Output: Squire enjoy that coast. One out had of was wearages on was cruth, unluck for a ple, and said Silver, what Hunt, and a spoining a whith their was betty say thould strade of ther--as can the See cont be comen us said for not it could and thind you pass topped, and time. You disappretty thout dead regan-of-ward footy were escarcely ple buildly among of answere give never, was no like andersong it. The his upon--throunded the oardly, seen Gutened over the was start stilled held ther up and only.
Training Text: Augustine's Confessions (latin)
Output: Liberior etia, legi inpientia, quae quae pecte memine mirat. ad antes te ad alia es melios era sed iam pendaretus suasi quos tuis amando viveroris, dominibus longit, perviebat. quae caelumeratus irride, qui hanc in in terarum bonum cere nemorest isti me.
Before running the script, make sure you have python installed (you can check by typing python in the terminal). Next move into the directory containing the file markovLang.py.
The script is run in the terminal using the following command: python markovLang.py OPTIONS file1.txt file2.txt ....
The files file1.txt, file2.txt, etc. are designated training files. The output generated will follow the style of the training text. For example, if you pass in a training text in Latin, you will get out Latin (or Latin-looking) output. If you pass in English, you will get out English.
OPTIONS specifies the configuration for the program to use. The options block may contain any of the following flags:
--help
Lists the available options.
-c
Puts the script in character mode. Instead of generating output by word, it will generate output by character. The resulting output will be a series of words that seem to be plausible English words (or whatever the language of the training text is). Some may be English words, some may not.
-gen number
The number of words of output for the script to generate. If in character mode, this represents the number of characters of output, not the number of words.
-use number
Specified how many words or characters the script should use when determining the next one. The greater this number, the more strictly the script will follow the training text. The default is 3. Less than that and the script will generate a more random sequence. More than about five and the script is likely to just parrot the training text back to you (unless you have a very large training text).
-out filepath
Write the output to the file specified by filepath.