Programming Challenge: Markov Chain Text Generator
Markov Chains are a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. By analyzing a...
Markov Chains are a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. By analyzing a document in some way and producing a model it’s possible to use this model to generate sentences.
For example, let’s consider this quote:
Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind.
Let’s start with a seed of be
, which there is only one of in this text and it’s following word is who
. Thus, a 100% chance of the next state being who
. From who
, there are several next states: you
, mind
, and matter
. Since there are 3 options to choose from, the next state has a 1/3 probability of each. It’s important that if there were for example two instances of who you
then you
would have a 2/4 probability of next state. Generate a random number and choose the next state, perhaps mind
and continue until reaching a full stop. The string of states we reached is then printed and we have a complete sentence (albeit almost certainly gibberish).
Note: if we were in the state mind
, our next two options would be .
or don’t
, in which if we hit .
we would end the generation. (or not, up to you how you handle this!)
To take it a step further, you could also consider choosing the number of words to consider a state. For example, two words instead of one: those who
has two possible next states: who matter
or who mind
. By using much longer strings of words for our states we can get more natural text but will need much more volume to get unique sentences.
This programming challenge is for you to create a Markov Chain and Text Generator in your language of choice. The input being a source document of anything you like (fun things include your favourite book, a famous person’s tweets, datasets of reddit / tildes comments), and possibly a seed. The output being a sentence generated using the Markov Chain.
Bonus points for:
- Try it a bunch of times on different sources and tell us the best generated sentences
- Using longer strings of words for the state, or even having it be variable based on input
- Not requiring a seed as an input, instead implementing that into your Markov Chain (careful as infinite loops can occur without considering the seed)
- Implement saving the Markov Chain itself, as it can take very long to generate with huge documents
- Particularly Fast, efficient, short or unique methods
Good luck!
P.S A great place to find many large plain text documents for you to play with is Project Gutenberg.