17 votes

Programming Challenge: Markov Chain Text Generator

Markov Chains are a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. By analyzing a document in some way and producing a model it’s possible to use this model to generate sentences.

For example, let’s consider this quote:

Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind.

Let’s start with a seed of be, which there is only one of in this text and it’s following word is who. Thus, a 100% chance of the next state being who. From who, there are several next states: you, mind, and matter. Since there are 3 options to choose from, the next state has a 1/3 probability of each. It’s important that if there were for example two instances of who you then you would have a 2/4 probability of next state. Generate a random number and choose the next state, perhaps mind and continue until reaching a full stop. The string of states we reached is then printed and we have a complete sentence (albeit almost certainly gibberish).

Note: if we were in the state mind, our next two options would be . or don’t, in which if we hit . we would end the generation. (or not, up to you how you handle this!)

To take it a step further, you could also consider choosing the number of words to consider a state. For example, two words instead of one: those who has two possible next states: who matter or who mind. By using much longer strings of words for our states we can get more natural text but will need much more volume to get unique sentences.

This programming challenge is for you to create a Markov Chain and Text Generator in your language of choice. The input being a source document of anything you like (fun things include your favourite book, a famous person’s tweets, datasets of reddit / tildes comments), and possibly a seed. The output being a sentence generated using the Markov Chain.

Bonus points for:

  • Try it a bunch of times on different sources and tell us the best generated sentences
  • Using longer strings of words for the state, or even having it be variable based on input
  • Not requiring a seed as an input, instead implementing that into your Markov Chain (careful as infinite loops can occur without considering the seed)
  • Implement saving the Markov Chain itself, as it can take very long to generate with huge documents
  • Particularly Fast, efficient, short or unique methods

Good luck!

P.S A great place to find many large plain text documents for you to play with is Project Gutenberg.

5 comments

  1. jono
    (edited )
    Link
    Rust Probably not the best implementation, but definitely feel like I'm starting to get the hang of this language. Still get confused whenever errors to do with lifetimes pop up. Haven't...

    Rust

    Probably not the best implementation, but definitely feel like I'm starting to get the hang of this language. Still get confused whenever errors to do with lifetimes pop up. Haven't implemented saving or loading a Markov Chain directly. I ran a different, messier version with two words as the state on 700MB of AskReddit comments and got a couple of funny generations, didn't save them but might generate some more later.

    extern crate rand;
    
    use rand::prelude::{Rng, thread_rng};
    use std::collections::HashMap;
    use std::error::Error;
    use std::fs::File;
    use std::io::prelude::Read;
    use std::process;
    
    fn main() {
        let seed = "be".to_string();
        let map = match load_map("map.txt") {
            Ok(map) => map,
            Err(_) => gen_map_from("source.txt").unwrap_or_else(|err| {
                println!("Error: {}", err);
                process::exit(1);
            }),
        };
        println!("{}", gen_string_from(map, seed));
    }
    
    fn load_map(filename: &str) -> Result<HashMap<String, Vec<String>>, Box<Error>> {
        let mut f = File::open(filename)?;
        let mut contents = String::new();
        f.read_to_string(&mut contents)?;
    
        unimplemented!()
    }
    
    fn gen_map_from(filename: &str) -> Result<HashMap<String, Vec<String>>, Box<Error>> {
        let mut map: HashMap<String, Vec<String>> = HashMap::new();
        let mut contents = String::new();
        let mut f = File::open(filename)?;
        f.read_to_string(&mut contents)?;
    
        let contents = contents
            .to_lowercase()
            .replace(".", " .")
            .replace(";", " ;");
    
        for (first, second) in contents
            .clone()
            .split_whitespace()
            .zip(contents.split_whitespace().skip(1))
        {
            map.entry(first.to_string())
                .or_insert(Vec::new())
                .push(second.to_string());
        }
        Ok(map)
    }
    
    fn gen_string_from(map: HashMap<String, Vec<String>>, mut word: String) -> String {
        let mut vec: Vec<String> = Vec::new();
        while word != ".".to_string() {
            vec.push(word.to_string());
            let word_vec = map.get(&word).unwrap();
            word.clear();
            word.push_str(thread_rng().choose(word_vec).unwrap());
        }
        vec.push(word);
        vec.join(" ").replace(" .", ".").replace(" ;", ";")
    }
    
    2 votes
  2. [3]
    Staross
    (edited )
    Link
    Julia A bit basic but the results are already fun: using StatsBase tokenize(s) = lowercase.(split(replace(s,"."," .")," ")) function transistion_probablities(corpus) tokens = tokenize(corpus) d =...

    Julia A bit basic but the results are already fun:

    using StatsBase
    
    tokenize(s) = lowercase.(split(replace(s,"."," .")," "))
    
    function transistion_probablities(corpus)
        tokens = tokenize(corpus)
    
        d = Dict(zip(tokens,[String[] for i=1:length(tokens)]))
        for i=1:length(tokens)-1
            push!(d[tokens[i]], tokens[i+1])
        end
    
        Dict(zip(keys(d),[countmap(w) for w in values(d)])) #convert to counts
    end
    
    P(p::Dict{String,Int64}) = Weights(collect(values(p))) #convert to statistical weights
    
    function generate(seed,α)
        state, out = seed, seed
        while state != "."
            state = StatsBase.sample(collect(keys(α[state])),P(α[state]))
            out = state == "." ? string(out, state) : string(out, " ", state)
        end
        out
    end
    
    corpus = read("epicurus.txt", String)
    α = transistion_probablities(corpus)
    generate("the",α)
    
    Main> α = transistion_probablities(corpus)
    Dict{String,Dict{String,Int64}} with 1867 entries:
    "and,"       => Dict("if"=>1,"when"=>2,"since"=>1,"in"=>3)
    "accord"     => Dict("with"=>1)
    "action"     => Dict("of"=>2,"should"=>1)
    "mutual"     => Dict("interconnection"=>1,"connection"=>1)
    ...
    
    Main> generate("the",α)
    "the objects from the atom which the properties which disappears had in order that is molded by sense."
    

    Here's a quote image for this one, it's quite deep:

    https://i.imgur.com/ibfS8mM.jpg

    1 vote
    1. [2]
      jono
      Link Parent
      Fantastic! I honestly cannot stop laughing at that image. Well done! I was thinking of doing a dict of dicts too but wasn't sure if that would be any better than just a dict of lists like I have...

      Fantastic! I honestly cannot stop laughing at that image. Well done! I was thinking of doing a dict of dicts too but wasn't sure if that would be any better than just a dict of lists like I have done, plus requires the extra effort working out probability.

      1 vote
      1. Staross
        Link Parent
        Storing everything is a bit a waste of memory, but it probably doesn't matter much for smallish corpora.

        Storing everything is a bit a waste of memory, but it probably doesn't matter much for smallish corpora.

  3. onyxleopard
    Link
    Python 3 Would be nice to handle punctuation better, but I'm too lazy. #!/usr/bin/env python3 from random import choices, seed from textblob.tokenizers import SentenceTokenizer, WordTokenizer def...

    Python 3

    Would be nice to handle punctuation better, but I'm too lazy.

    #!/usr/bin/env python3
    
    from random import choices, seed
    
    from textblob.tokenizers import SentenceTokenizer, WordTokenizer
    
    def ngrams(iterable, n=1):
        """Generate ngrams from an iterable
    
        l = range(5)
        list(l) -> [0, 1, 2, 3, 4, 5]
        list(ngrams(l, n=1)) -> [(0,), (1,), (2,), (3,), (4,)]
        list(ngrams(l, n=2)) -> [(0, 1), (1, 2), (2, 3), (3, 4)]
        list(ngrams(l, n=3)) -> [(0, 1, 2), (1, 2, 3), (2, 3, 4)]
    
        """
        return zip(*(iterable[i:] for i in range(n)))
    
    def fdist(corpus, n=2):
        d = {}
        for sentence in SentenceTokenizer().tokenize(text=corpus):
            for ngram in ngrams(WordTokenizer().tokenize(text=sentence), n=n):
                first, *rest = ngram
                if first not in d:
                    d[first] = {}
                for i, word in enumerate(rest):
                    if i not in d[first]:
                        d[first][i] = {}
                    if word not in d[first][i]:
                        d[first][i][word] = 0
                    d[first][i][word] += 1
        return d
    
    def generate_words(corpus, seedword, n=2):
        seed(seedword)
        fd = fdist(corpus, n=n)
        word = seedword
        yield word
        while word != '.' and word in fd:
            for i in fd[word]:
                population, frequencies = zip(*fd[word][i].items())
                weights = [f / len(frequencies) for f in frequencies]
                word, *_ = choices(population, weights)
                yield word
    
    def generate_sentence(corpus, seedword, n=2):
        return ' '.join(generate_words(corpus, seedword, n=n))
    
    if __name__ == '__main__':
        corpus = '''JABBERWOCKY
        By Lewis Carroll
    
        'Twas brillig, and the slithy toves
        Did gyre and gimble in the wabe;
        All mimsy were the borogoves,
        And the mome raths outgrabe.
    
        'Beware the Jabberwock, my son!
        The jaws that bite, the claws that catch!
        Beware the Jubjub bird, and shun
        The frumious Bandersnatch!'
    
        He took his vorpal sword in hand:
        Long time the manxome foe he sought--
        So rested he by the Tumtum tree,
        And stood awhile in thought.
    
        And as in uffish thought he stood,
        The Jabberwock, with eyes of flame,
        Came whiffling through the tulgey wood,
        And burbled as it came!
    
        One, two! One, two! And through and through
        The vorpal blade went snicker-snack!
        He left it dead, and with its head
        He went galumphing back.
    
        'And hast thou slain the Jabberwock?
        Come to my arms, my beamish boy!
        O frabjous day! Callooh! Callay!'
        He chortled in his joy.
    
        'Twas brillig, and the slithy toves
        Did gyre and gimble in the wabe;
        All mimsy were the borogoves,
        And the mome raths outgrabe.'''
    
        for seedword in ['He', 'Beware', 'One']:
            print(generate_sentence(corpus, seedword))
    

    And the output:

    ./markov_chain.py 
    He went snicker-snack ! '
    Beware the wabe ; All mimsy were the slithy toves Did gyre and gimble in uffish thought he sought -- So rested he sought -- So rested he by the mome raths outgrabe .
    One , and the claws that catch ! '