# Assignment 3: Spelling Correction

**Out:**Mon, Feb 22

**Due:**11:59 pm EST on Mon, Mar 07

**Repository:**

`github.com/wellesleynlp/yourusername-cs349-a3-spell`

**Solution:**

`github.com/wellesleynlp/a3solution`

### Objective

Implement the string edit distance algorithm in conjunction with the noisy channel, to build a spelling correction program. Design regular expressions and context-free grammars to model linguistic patterns.

### Setup

This assignment gives you the option to work with a partner.
Note that you may *not* work with the same partner as A2.
Use the
Google Sheet to find a partner
and set up your joint repository as instructed.

The assignment has much less coding than previous ones, since a good chunk of it involves designing regular and context-free grammars.

If you find any errors in the assignment writeup or starter code, point them out ASAP, and you'll get a bonus point.

### Testing

The programs for sections b and c verify themselves. A verification script for section a is provided. Run it as

python verify.py

### Submitting

Before the deadline (11:59 pm EST on Mon, Mar 07),
push your edited
`spellcheck.py`

, `regexes.py`

, and `data/grammar.cfg`

files,
as well as the completed `README.md`

and `reflections.md`

,
to your repository following these instructions.

If you are working with a partner, make sure you have set things up as described so you are both contributing to a single repository.

## Section (a): Spelling Correction Engine [22 pts]

All code in this section is to be written in `spellcheck.py`

.

`load_languagemodel`

: reads word probabilities from a file like`data/wordprobs.txt`and returns a dictionary mapping each word to its $-\log_2$ probability.`load_channelmodel`

: reads the edit "confusion matrix" from a file like`data/confusion.txt`that shows the probability of substituting one character for another. Returns a nested dictionary`channel`

where`channel[insym][outsym]`

is the cost, or $-\log_2$ probability of substituting`outsym`for`insym`.`insym`is along the rows and`outsym`along the columns. Deletion probabilities are given by`outsym = eps`and insertions by`insym = eps`.

For example, the cost of- transforming
`a`to`a`is given by $-\log_2$`(channel[a][a])`= $-\log_2(0.75)$, - transforming
`a`to`c`is given by $-\log_2$(`channel[a][c]`) = $-\log_2(0.0096)$, - deleting
`a`is given by $-\log_2$(`channel[a][eps]`) = $-\log_2(0.01)$, - inserting
`b`is given by $-\log_2$(`channel[eps][b]`) = $-\log_2(0.02)$.

- transforming
`levenshtein`

: computes the edit distance of the string`w`from the string`c`using the dynamic programming algorithm, in terms of $-\log_2$ probabilities, according to the given`channel`substitution weights. In other words, it computes $-\log_2 P(w|c)$.

If the`align`parameter is`False`, return only the edit distance. If the`align`parameter is`True`, return a tuple where the second element is the edit distance, and the first element is the alignment (derived by following backpointers). Break pointer ties by prioritizing substitutions/matches over insertions over deletions. The alignment is a list of tuples showing each substitution, insertion, or deletion.

For example,channel = load_channelmodel('data/confusion.txt')

should return

levenshtein(channel, 'taste', 'test', True)([('t', 't'), ('a', 'e'), ('s', 's'), ('t', 't'), ('e', 'eps')], 14.59171856643955)

whereaschannel = load_channelmodel('confusion.txt') levenshtein(channel, 'taste', 'test', False)

should return only the edit distance; i.e.14.59171856643955

The edit distance for this example comes from $$-\log P(t|t) -\log P(e|a) -\log P(s|s) -\log P(t|t) -\log P(eps|e) \\ = -\log 0.75 -\log 0.0096 -\log 0.75 -\log 0.75 -\log 0.01$$

You may find`numpy`arrays convenient for computing the dynamic programming matrix, though you are not required to use numpy. For speed, don't compute backpointers when`align = False`.`spellcorrect`

loops through every word`c`in the language model`lm`, and returns the word that minimizes $$-\log P(w|c) -\log P(c)$$ where $-\log P(w|c)$ is the edit distance returned by`levenshtein(channel, c, w)`

, and $-\log P(c)$ is the negative log probability of`c`according to the dictionary returned by`load_languagemodel`

.- Run
`spellcheck.py`

with a single command line argument, either`interactive`, which repeatedly prompts for a misspelt word and corrects it, or the name of a file such as`data/wikipedia.err`containing a list of spelling errors and the corresponding correct words. (The words in`data/wikipedia.err`were mined from actual spelling mistakes made by Wikipedia editors.)

Here's an example of an interactive session.python spellcheck.py interactive

The batch run

Enter a word: teh

the

Enter a word: beleive

believepython spellcheck.py data/wikipedia.err

hypothesizes the correct spelling for each word in`data/wikipedia.err`using your functions and writes a new file,`data/wikipedia.err.corrected`, with a third column showing the algorithm's hypothesis correction for each misspelling, and a fourth column indicating whether the hypothesis matches the actual correct word. (All of this code is written for you.)

Since we haven't optimized the search process, each spelling correction takes a little over a second. There are 1244 words in`data/wikipedia.err`, so wait a while for the program to finish. (You only have to run this program once, though.) The provided`batchcorrect`

prints a dot after each word to show the progress, and finally, displays the accuracy of the predictions.

Examine the output of the program, comparing the hypothesized predictions to the actual words, and answer the first question in`reflection.md`

.

## Section (b): Designing Regular Expressions [5 pts]

This section is independent of the others.

In `regexes.py`

,
complete the functions to look for the appropriate patterns in a text
by filling in the appropriate regular expressions.

Running the program

python regexes.pychecks the matches against the "gold standard" file

`data/goldregex.pkl`of results for each pattern.

You might find this Coursera video useful. Pythex is a nice interactive way to test your regexes. The demo code from class is posted online for your reference.

## Section (c): Designing Context-Free Grammars [18 pts]

This section is independent of the others.

Examine the sentences in `data/sentences.txt`.
You need to design a context-free grammar that will accept all of the valid sentences
(those not prefixed by *),
*and* none of the invalid sentences (those prefixed with *).

`data/grammar.cfg` contains the beginnings of the grammar.
Running

python parsecfg.py ggenerates several random sentences from this grammar.

Add and modify the rules to be consistent with

`data/sentences.txt`. All your rules must be in Chomsky Normal Form.

- Running the provided
`parsecfg.py`

code like so:python parsecfg.py p

verifies that your grammar is correct, and also shows which of the sentences were parsed by your grammar, using the CKY algorithm from class. Tweak the grammar until all valid sentences parse and none of the invalid ones do. Try to name your non-terminals in a linguistically meaningful way, but technically, the names can be arbitrary.

Tip: the`Grammar.parse`

method takes an optional`verbose`argument that prints the CKY parsing matrix. When "debugging" your grammar, you may find it useful to pass`verbose`as`True`in the calls to`parse`

in the main block. - Run
python parsecfg.py g

to generate sentences from your grammar. Do any of them seem ungrammatical to you? Answer the second and third questions in`reflection.md`

.

## Misc [5 pts]

- Answer the remaining questions in
`reflection.md`

. - Fill out
`README.md`

and submit your work correctly. - Check that your code is reasonably documented, elegant, and efficient.