CS 349: Natural Language Processing

Spring 2016, Wellesley College

Assignment 4: Hidden Markov Models

Out: Mon, Mar 07      Due: 11:59 pm EST on Mon, Mar 28
Repository: github.com/wellesleynlp/yourusername-cs349-a4-hmm
Checkpoint Due: 11:59 pm EST on Thu, Mar 17
Solution: github.com/wellesleynlp/a4solution

Objective

Implement the Viterbi, backward, and Baum-Welch algorithms for Hidden Markov Models, and apply HMMs for part-of-speech tagging and decipherment.

Setup

This assignment gives you the option to work with a partner. While the amount of code is not a lot, the EM algorithm can be conceptually complex to understand, and I recommend you work with a partner even if you've been managing solo so far. You may work with someone you have collaborated with before. Use the Google Sheet to find a partner and set up your joint repository as instructed.

Students in the past have commented that this is a challenging but also particularly rewarding assignment -- one that may make you fall in love with EM and machine learning. I hope you feel the same way when you see your log-likelihoods increasing and the parameters falling into place...

Submitting

Implementation of sections (a), (b) and (c) -- the generate, viterbi and backward algorithms -- is due at 11:59 pm EST on Mar 17, an intermediate checkpoint. This checkpoint will not be graded; rather, it serves two purposes: a motivation for you to work on the assignment before Spring Break, and an opportunity for me to review your code and give you feedback if you're off-track.

The submission deadline is three weeks after release simply because it would be mean to make it due during Spring Break. The assignment isn't any longer than the others, and is not expected to take you the whole three weeks. In fact, I encourage you to make your final submission by Mar 18th. Not only would this free up your break, I'll give the first person or pair with a fully working solution (an) HMM T-Shirt(s), and 5 bonus points to everyone who finalizes their submission by Mar 18th.

To be eligible for the incentive, you must fill in your name in the README, and not commit any changes after Mar 18th. This submission will be graded. If you'd rather hold off on submitting, but simply want me to review your code, don't add your name.

Before the deadline (11:59 pm EST on Mon, Mar 28), push the files below to your repository following these instructions.

  • the completed hmm.py code
  • the filled-in README.md and reflection.md
  • your final data/armenian_words.tagged, models/two-states-armenian.trained.trans, and models/two-states-armenian.trained.emit after experimenting with EM and running Viterbi on the Armenian data
  • your final data/message.tagged and models/encoding.trained.emit after decoding the secret message.

If you are working with a partner, make sure you have set things up as described so you are both contributing to a single repository.

Preliminaries

All code is to be written in hmm.py. Do not modify functions that aren't annotated with TODO.

HMM transition and emission parameters are specified in a pair of files, like models/partofspeech.trans and models/partofspeech.emit. Each row of a .trans file represents

fromstate tostate P(tostate|fromstate)
and each row of a .emit file represents
state symbol P(symbol|state)
Optionally, the probabilities can be left out -- when we don't know them --, as in armenian.{trans,emit} and encoding.{trans,emit}.

Start by understanding the HMM class. You can do this by creating a new HMM object, passing a .trans file (like partofspeech.trans) and the corresponding .emit file. Examine the values of HMM.transitions, HMM.emissions, and HMM.states.

Section (a): Random Text Generation [3 pts]

This section is independent of the others.

Fill in the body of generate, which returns a random observation (list of n symbols) from the HMM. Recall the generative model: for each iteration, sample a state conditioned on the previous state, and then a symbol conditioned on the current state, from the HMM's transition and emission parameters. Test the output of the function for the partofspeech HMM by running
python hmm.py generated.txt models/partofspeech g
which writes the text, 20 sentences of 10 words each, into generated.txt. This function might be of particular interest to people with text generation projects.

Section (b): Best State Sequence Inference [12 pts]

This section is independent of the previous one, and necessary for section (e).

Study the provided forward code, and understand how it corresponds to the procedure shown in class.

viterbi should implement the Viterbi algorithm to find the best state sequence for a given observation. As discussed, it's nearly identical to the forward algorithm, so feel free to adapt code from the forward function. The main additional complication is that you have to keep track of backpointers in each cell. Unlike forward, this function should return the best sequence of states, not the dynamic programming matrix.

Since you have to account for an arbitray number of states, use the max function rather than a cascade of if-statements to check for the highest probability. Your algorithm can break ties however it likes. (In practice, HMM parameter values tend to differ enough that you won't run into many ties.)

To test, run

python hmm.py data/english_sentences.txt models/partofspeech v
This uses the HMM parameters in partofspeech.trans and partofspeech.emit to compute the best state sequence for each sentence in data/english_sentences.txt, and writes it to data/english_sentences.tagged. Compare this file to sampleoutputs/english_sentences.tagged. They should be identical.

Section (c): Backward Algorithm [8 pts]

This section is independent of the previous ones, and necessary for section (d).

backward should implement the backward algorithm, returning an array containing the probabilities in the dynamic programming grid. Take inspiration from forward.

Running

python hmm.py data/english_sentences.txt models/partofspeech p
computes the total probability of each sentence in data/english_sentences.txt using the backward_probability function which invokes backward, and writes it to data/english_sentences.dataprob. The numbers in this file should be identical to sampleoutputs/english_sentences.dataprob. Another test is to check that forward_probability and backward_probability return the same values for any given observation.

Section (d): Expectation Maximization [17 pts]

This section relies on section (c), and is necessary for section (e). Wait for Thursday to write expectation, but you have what you need for maximization.

Write these two functions.
  • maximization takes a nested dictionary emitcounts, where emitcounts[state][symbol] is the expected number of times that state emits symbol, and similarly, transcounts, where transcounts[fromstate][tostate] is the expected number of times that fromstate transitions to tostate. The function sets the values of self.emissions and self.transitions by normalizing the counts in emitcounts and transcounts appropriately.
  • expectation takes a corpus of observations, computes the log likelihood of the corpus under the current parameters, and computes the expected counts of each transition and emission using the forward and backward algorithms. The function returns a tuple of three items: the curent log likelihood, a dictionary with the expected emission counts (emitcounts[state][symbol] = expected number of times state emits symbol), and a dictionary with the expected transition counts (transcounts[state1][state2] = expected number of times state1 transitions to state2).

The provided expectation_maximization function runs one iteration of EM on a corpus. The main block wraps all of the HMM functionality (Viterbi, total probability of data, random generation, and EM learning) into a command line interface.

Debug these two functions with "productivity" Google Sheets example from Thursday's class.

python hmm.py data/activities.txt models/productivity em

You can put in print statements that display the forward and backward matrices for each observation, as well as the expected counts and re-estimated parameters at the end of each EM iteration, to check that they line up with the values we got on the spreadsheet. (Just be sure to remove the print statements before submitting.)

Section (e): Decipherment [5 pts]

This section relies on section (b) and section (d).

You're done! Now on to play with your cool new toy.

  • Learn which Armenians letters are vowels and which are consonants by running
    python hmm.py data/armenian_words.txt models/two-states-armenian em
    followed by Viterbi-tagging the data with the learned model:
    python hmm.py data/armenian_words.txt models/two-states-armenian.trained v

    Examine models/two-states-armenian.trained.emit, models/two-states-armenian.trained.trans, and/or and data/armenian_words.tagged. Did the program find the correct separation of vowels and consonants? (Keep in mind that the "C" and "V" names for the states are arbitrary; what's important is the separation.) Look online for information about Armenian characters.

    Note: It may be useful to debug your program by running it on the English characters first:

    python hmm.py data/english_words.txt models/two-states-english em
    python hmm.py data/english_words.txt models/two-states-english.trained v

    If the separation is not what you expect, and your code is correct, perhaps you got stuck in low local maximum. Re-run EM with restarts or a lower convergence threshold. For example,

    python hmm.py data/armenian_words.txt models/two-states-armenian.trained em --restarts=5 --convergence=0.001

    Once you are satisfied with your EM experiments, add and push data/armenian_words.tagged, models/two-states-armenian.trained.emit, and models/two-states-armenian.trained.trans to your repository.

  • The secret message in data/message.txt has been encoded by translating English characters into numbers. Each number corresponds to a unique character, but the same character could map to multiple possible numbers. Your task is to decode this message.

    As per the noisy channel, we say that the message was created by generating English characters with some language model probability, and then encoding each English character into this secret language with some channel model probability. So, we're going to use English character bigrams -- estimated from English text -- as the HMM state transitions, and keep them locked because we're certain about them. By locking the transitions, we only learn the emission (English-character state -> number) parameters.

    python hmm.py data/message.txt models/encoding em --translock=True
    This should update the emission parameters with EM, and leave the transitions unchanged.

    Since this model has several states, EM takes longer than the two-state Armenian model -- recall that the forward and backward complexity is quadratic in the number of states. Be patient as your program does its work.

    Running

    python hmm.py data/message.txt models/encoding.trained v
    writes the decoded result in data/message.tagged.

    If EM was successful, you should see a sensible looking English text. Again, you might have to experiment with restarts and convergence.

    Add and push the final data/message.tagged and models/encoding.trained.emit to the repository. Note that if you locked your transitions as instructed, models/encoding.trained.trans will be unchanged from models/encoding.trans, so there's no need to push it.

Misc [5 pts]

  • Answer the remaining questions in reflection.md.
  • Fill out README.md and submit your work correctly.
  • Check that your code is reasonably documented, elegant, and efficient.