Assignment 3: Spelling Correction
Implement the string edit distance algorithm in conjunction with the noisy channel, to build a spelling correction program. Design regular expressions and context-free grammars to model linguistic patterns.
This assignment gives you the option to work with a partner. Note that you may not work with the same partner as A2. Use the Google Sheet to find a partner and set up your joint repository as instructed.
The assignment has much less coding than previous ones, since a good chunk of it involves designing regular and context-free grammars.
If you find any errors in the assignment writeup or starter code, point them out ASAP, and you'll get a bonus point.
The programs for sections b and c verify themselves. A verification script for section a is provided. Run it as
Before the deadline (11:59 pm EST on Mon, Mar 07),
push your edited
as well as the completed
to your repository following these instructions.
If you are working with a partner, make sure you have set things up as described so you are both contributing to a single repository.
Section (a): Spelling Correction Engine [22 pts]
All code in this section is to be written in
load_languagemodel: reads word probabilities from a file like data/wordprobs.txt and returns a dictionary mapping each word to its $-\log_2$ probability.
load_channelmodel: reads the edit "confusion matrix" from a file like data/confusion.txt that shows the probability of substituting one character for another. Returns a nested dictionary
channel[insym][outsym]is the cost, or $-\log_2$ probability of substituting outsym for insym. insym is along the rows and outsym along the columns. Deletion probabilities are given by outsym = eps and insertions by insym = eps.
For example, the cost of
- transforming a to a is given by $-\log_2$(channel[a][a]) = $-\log_2(0.75)$,
- transforming a to c is given by $-\log_2$(channel[a][c]) = $-\log_2(0.0096)$,
- deleting a is given by $-\log_2$(channel[a][eps]) = $-\log_2(0.01)$,
- inserting b is given by $-\log_2$(channel[eps][b]) = $-\log_2(0.02)$.
levenshtein: computes the edit distance of the string w from the string c using the dynamic programming algorithm, in terms of $-\log_2$ probabilities, according to the given channel substitution weights. In other words, it computes $-\log_2 P(w|c)$.
If the align parameter is False, return only the edit distance. If the align parameter is True, return a tuple where the second element is the edit distance, and the first element is the alignment (derived by following backpointers). Break pointer ties by prioritizing substitutions/matches over insertions over deletions. The alignment is a list of tuples showing each substitution, insertion, or deletion.
channel = load_channelmodel('data/confusion.txt')should return
levenshtein(channel, 'taste', 'test', True)
([('t', 't'), ('a', 'e'), ('s', 's'), ('t', 't'), ('e', 'eps')], 14.59171856643955)whereas
channel = load_channelmodel('confusion.txt') levenshtein(channel, 'taste', 'test', False)should return only the edit distance; i.e.
14.59171856643955The edit distance for this example comes from $$-\log P(t|t) -\log P(e|a) -\log P(s|s) -\log P(t|t) -\log P(eps|e) \\ = -\log 0.75 -\log 0.0096 -\log 0.75 -\log 0.75 -\log 0.01$$
You may find numpy arrays convenient for computing the dynamic programming matrix, though you are not required to use numpy. For speed, don't compute backpointers when align = False.
spellcorrectloops through every word c in the language model lm, and returns the word that minimizes $$-\log P(w|c) -\log P(c)$$ where $-\log P(w|c)$ is the edit distance returned by
levenshtein(channel, c, w), and $-\log P(c)$ is the negative log probability of c according to the dictionary returned by
spellcheck.pywith a single command line argument, either interactive, which repeatedly prompts for a misspelt word and corrects it, or the name of a file such as data/wikipedia.err containing a list of spelling errors and the corresponding correct words. (The words in data/wikipedia.err were mined from actual spelling mistakes made by Wikipedia editors.)
Here's an example of an interactive session.
python spellcheck.py interactiveThe batch run
Enter a word: teh
Enter a word: beleive
python spellcheck.py data/wikipedia.errhypothesizes the correct spelling for each word in data/wikipedia.err using your functions and writes a new file, data/wikipedia.err.corrected, with a third column showing the algorithm's hypothesis correction for each misspelling, and a fourth column indicating whether the hypothesis matches the actual correct word. (All of this code is written for you.)
Since we haven't optimized the search process, each spelling correction takes a little over a second. There are 1244 words in data/wikipedia.err, so wait a while for the program to finish. (You only have to run this program once, though.) The provided
batchcorrectprints a dot after each word to show the progress, and finally, displays the accuracy of the predictions.
Examine the output of the program, comparing the hypothesized predictions to the actual words, and answer the first question in
Section (b): Designing Regular Expressions [5 pts]
This section is independent of the others.
complete the functions to look for the appropriate patterns in a text
by filling in the appropriate regular expressions.
Running the program
python regexes.pychecks the matches against the "gold standard" file data/goldregex.pkl of results for each pattern.
Section (c): Designing Context-Free Grammars [18 pts]
This section is independent of the others.
Examine the sentences in data/sentences.txt. You need to design a context-free grammar that will accept all of the valid sentences (those not prefixed by *), and none of the invalid sentences (those prefixed with *).
data/grammar.cfg contains the beginnings of the grammar. Running
python parsecfg.py ggenerates several random sentences from this grammar.
Add and modify the rules to be consistent with data/sentences.txt. All your rules must be in Chomsky Normal Form.
- Running the provided
parsecfg.pycode like so:
python parsecfg.py pverifies that your grammar is correct, and also shows which of the sentences were parsed by your grammar, using the CKY algorithm from class. Tweak the grammar until all valid sentences parse and none of the invalid ones do. Try to name your non-terminals in a linguistically meaningful way, but technically, the names can be arbitrary.
Grammar.parsemethod takes an optional verbose argument that prints the CKY parsing matrix. When "debugging" your grammar, you may find it useful to pass verbose as True in the calls to
parsein the main block.
python parsecfg.py gto generate sentences from your grammar. Do any of them seem ungrammatical to you? Answer the second and third questions in
Misc [5 pts]
- Answer the remaining questions in
- Fill out
README.mdand submit your work correctly.
- Check that your code is reasonably documented, elegant, and efficient.