Assignment 4: Hidden Markov Models
Repository: github.com/wellesleynlp/yourusername-cs349-a4-hmm
Checkpoint Due: 11:59 pm EST on Thu, Mar 17
Solution: github.com/wellesleynlp/a4solution
Objective
Implement the Viterbi, backward, and Baum-Welch algorithms for Hidden Markov Models, and apply HMMs for part-of-speech tagging and decipherment.
Setup
This assignment gives you the option to work with a partner. While the amount of code is not a lot, the EM algorithm can be conceptually complex to understand, and I recommend you work with a partner even if you've been managing solo so far. You may work with someone you have collaborated with before. Use the Google Sheet to find a partner and set up your joint repository as instructed.
Students in the past have commented that this is a challenging but also particularly rewarding assignment -- one that may make you fall in love with EM and machine learning. I hope you feel the same way when you see your log-likelihoods increasing and the parameters falling into place...
Submitting
Implementation of sections (a), (b) and (c) --
the generate
, viterbi
and backward
algorithms --
is due at 11:59 pm EST on Mar 17, an intermediate checkpoint.
This checkpoint will not be graded; rather, it serves two purposes:
a motivation for you to work on the assignment before Spring Break,
and an opportunity for me to review your code and give you feedback
if you're off-track.
The submission deadline is three weeks after release simply because it would be mean to make it due during Spring Break. The assignment isn't any longer than the others, and is not expected to take you the whole three weeks. In fact, I encourage you to make your final submission by Mar 18th. Not only would this free up your break, I'll give the first person or pair with a fully working solution (an) HMM T-Shirt(s), and 5 bonus points to everyone who finalizes their submission by Mar 18th.
To be eligible for the incentive, you must fill in your name in the README, and not commit any changes after Mar 18th. This submission will be graded. If you'd rather hold off on submitting, but simply want me to review your code, don't add your name.
Before the deadline (11:59 pm EST on Mon, Mar 28), push the files below to your repository following these instructions.
- the completed
hmm.py
code - the filled-in
README.md
andreflection.md
- your final
data/armenian_words.tagged
,models/two-states-armenian.trained.trans
, andmodels/two-states-armenian.trained.emit
after experimenting with EM and running Viterbi on the Armenian data - your final
data/message.tagged
andmodels/encoding.trained.emit
after decoding the secret message.
If you are working with a partner, make sure you have set things up as described so you are both contributing to a single repository.
Preliminaries
All code is to be written in hmm.py
.
Do not modify functions that aren't annotated with TODO.
HMM transition and emission parameters are specified in a pair of files, like models/partofspeech.trans and models/partofspeech.emit. Each row of a .trans file represents
fromstate tostate P(tostate|fromstate)and each row of a .emit file represents
state symbol P(symbol|state)Optionally, the probabilities can be left out -- when we don't know them --, as in armenian.{trans,emit} and encoding.{trans,emit}.
Start by understanding the HMM
class.
You can do this by creating a new HMM
object,
passing a .trans file (like partofspeech.trans)
and the corresponding .emit file.
Examine the values of HMM.transitions
, HMM.emissions
, and HMM.states
.
Section (a): Random Text Generation [3 pts]
This section is independent of the others.
Fill in the body ofgenerate
, which returns a random observation (list of n symbols)
from the HMM.
Recall the generative model: for each iteration, sample a state conditioned on the previous state,
and then a symbol conditioned on the current state,
from the HMM's transition and emission parameters.
Test the output of the function for the partofspeech HMM by running
python hmm.py generated.txt models/partofspeech gwhich writes the text, 20 sentences of 10 words each, into generated.txt. This function might be of particular interest to people with text generation projects.
Section (b): Best State Sequence Inference [12 pts]
This section is independent of the previous one, and necessary for section (e).
Study the provided forward
code, and understand how it corresponds to the
procedure shown in class.
viterbi
should implement the Viterbi algorithm to find the best state
sequence for a given observation.
As discussed, it's nearly identical to the forward algorithm, so feel free to adapt code
from the forward
function.
The main additional complication is that you have to keep track of backpointers in each cell.
Unlike forward
,
this function should return the best sequence of states, not the dynamic programming matrix.
Since you have to account for an arbitray number of states, use the max
function
rather than a cascade of if-statements to check for the highest probability.
Your algorithm can break ties however it likes.
(In practice, HMM parameter values tend to differ enough that you won't run into many ties.)
To test, run
python hmm.py data/english_sentences.txt models/partofspeech vThis uses the HMM parameters in partofspeech.trans and partofspeech.emit to compute the best state sequence for each sentence in data/english_sentences.txt, and writes it to data/english_sentences.tagged. Compare this file to sampleoutputs/english_sentences.tagged. They should be identical.
Section (c): Backward Algorithm [8 pts]
This section is independent of the previous ones, and necessary for section (d).
backward
should implement the backward algorithm,
returning an array containing the probabilities in the dynamic programming grid.
Take inspiration from forward
.
Running
python hmm.py data/english_sentences.txt models/partofspeech pcomputes the total probability of each sentence in data/english_sentences.txt using the
backward_probability
function which invokes
backward
,
and writes it to data/english_sentences.dataprob.
The numbers in this file should be identical to sampleoutputs/english_sentences.dataprob.
Another test is to check that forward_probability
and backward_probability
return the same values for any given observation.
Section (d): Expectation Maximization [17 pts]
This section relies on section (c), and is necessary for section (e).
Wait for Thursday to write expectation
, but you have what you need for maximization
.
maximization
takes a nested dictionaryemitcounts
, whereemitcounts[state][symbol]
is the expected number of times thatstate
emitssymbol
, and similarly,transcounts
, wheretranscounts[fromstate][tostate]
is the expected number of times thatfromstate
transitions totostate
. The function sets the values ofself.emissions
andself.transitions
by normalizing the counts inemitcounts
andtranscounts
appropriately.expectation
takes a corpus of observations, computes the log likelihood of the corpus under the current parameters, and computes the expected counts of each transition and emission using the forward and backward algorithms. The function returns a tuple of three items: the curent log likelihood, a dictionary with the expected emission counts (emitcounts[state][symbol]
= expected number of times state emits symbol), and a dictionary with the expected transition counts (transcounts[state1][state2]
= expected number of times state1 transitions to state2).
The provided expectation_maximization
function runs one iteration of EM
on a corpus.
The main block wraps all of the HMM functionality
(Viterbi, total probability of data, random generation, and EM learning)
into a command line interface.
Debug these two functions with "productivity" Google Sheets example from Thursday's class.
python hmm.py data/activities.txt models/productivity em
You can put in print statements that display the forward and backward matrices for each observation, as well as the expected counts and re-estimated parameters at the end of each EM iteration, to check that they line up with the values we got on the spreadsheet. (Just be sure to remove the print statements before submitting.)
Section (e): Decipherment [5 pts]
This section relies on section (b) and section (d).
You're done! Now on to play with your cool new toy.
- Learn which Armenians letters are vowels and which are consonants by running
python hmm.py data/armenian_words.txt models/two-states-armenian em
followed by Viterbi-tagging the data with the learned model:python hmm.py data/armenian_words.txt models/two-states-armenian.trained v
Examine models/two-states-armenian.trained.emit, models/two-states-armenian.trained.trans, and/or and data/armenian_words.tagged. Did the program find the correct separation of vowels and consonants? (Keep in mind that the "C" and "V" names for the states are arbitrary; what's important is the separation.) Look online for information about Armenian characters.
Note: It may be useful to debug your program by running it on the English characters first:
python hmm.py data/english_words.txt models/two-states-english em
python hmm.py data/english_words.txt models/two-states-english.trained vIf the separation is not what you expect, and your code is correct, perhaps you got stuck in low local maximum. Re-run EM with restarts or a lower convergence threshold. For example,
python hmm.py data/armenian_words.txt models/two-states-armenian.trained em --restarts=5 --convergence=0.001
Once you are satisfied with your EM experiments, add and push data/armenian_words.tagged, models/two-states-armenian.trained.emit, and models/two-states-armenian.trained.trans to your repository.
-
The secret message in data/message.txt has been encoded by translating English characters into numbers. Each number corresponds to a unique character, but the same character could map to multiple possible numbers. Your task is to decode this message.
As per the noisy channel, we say that the message was created by generating English characters with some language model probability, and then encoding each English character into this secret language with some channel model probability. So, we're going to use English character bigrams -- estimated from English text -- as the HMM state transitions, and keep them locked because we're certain about them. By locking the transitions, we only learn the emission (English-character state -> number) parameters.
python hmm.py data/message.txt models/encoding em --translock=True
This should update the emission parameters with EM, and leave the transitions unchanged.Since this model has several states, EM takes longer than the two-state Armenian model -- recall that the forward and backward complexity is quadratic in the number of states. Be patient as your program does its work.
Running
python hmm.py data/message.txt models/encoding.trained v
writes the decoded result in data/message.tagged.If EM was successful, you should see a sensible looking English text. Again, you might have to experiment with restarts and convergence.
Add and push the final data/message.tagged and models/encoding.trained.emit to the repository. Note that if you locked your transitions as instructed, models/encoding.trained.trans will be unchanged from models/encoding.trans, so there's no need to push it.
Misc [5 pts]
- Answer the remaining questions in
reflection.md
. - Fill out
README.md
and submit your work correctly. - Check that your code is reasonably documented, elegant, and efficient.