CS 349: Natural Language Processing

Spring 2016, Wellesley College

Assignment 3: Spelling Correction

Out: Mon, Feb 22      Due: 11:59 pm EST on Mon, Mar 07
Repository: github.com/wellesleynlp/yourusername-cs349-a3-spell
Solution: github.com/wellesleynlp/a3solution

Objective

Implement the string edit distance algorithm in conjunction with the noisy channel, to build a spelling correction program. Design regular expressions and context-free grammars to model linguistic patterns.

Setup

This assignment gives you the option to work with a partner. Note that you may not work with the same partner as A2. Use the Google Sheet to find a partner and set up your joint repository as instructed.

The assignment has much less coding than previous ones, since a good chunk of it involves designing regular and context-free grammars.

If you find any errors in the assignment writeup or starter code, point them out ASAP, and you'll get a bonus point.

Testing

The programs for sections b and c verify themselves. A verification script for section a is provided. Run it as

python verify.py

Submitting

Before the deadline (11:59 pm EST on Mon, Mar 07), push your edited spellcheck.py, regexes.py, and data/grammar.cfg files, as well as the completed README.md and reflections.md, to your repository following these instructions.

If you are working with a partner, make sure you have set things up as described so you are both contributing to a single repository.

Section (a): Spelling Correction Engine [22 pts]

All code in this section is to be written in spellcheck.py.

  • load_languagemodel: reads word probabilities from a file like data/wordprobs.txt and returns a dictionary mapping each word to its $-\log_2$ probability.
  • load_channelmodel: reads the edit "confusion matrix" from a file like data/confusion.txt that shows the probability of substituting one character for another. Returns a nested dictionary channel where channel[insym][outsym] is the cost, or $-\log_2$ probability of substituting outsym for insym. insym is along the rows and outsym along the columns. Deletion probabilities are given by outsym = eps and insertions by insym = eps.

    For example, the cost of
    • transforming a to a is given by $-\log_2$(channel[a][a]) = $-\log_2(0.75)$,
    • transforming a to c is given by $-\log_2$(channel[a][c]) = $-\log_2(0.0096)$,
    • deleting a is given by $-\log_2$(channel[a][eps]) = $-\log_2(0.01)$,
    • inserting b is given by $-\log_2$(channel[eps][b]) = $-\log_2(0.02)$.
  • levenshtein: computes the edit distance of the string w from the string c using the dynamic programming algorithm, in terms of $-\log_2$ probabilities, according to the given channel substitution weights. In other words, it computes $-\log_2 P(w|c)$.

    If the align parameter is False, return only the edit distance. If the align parameter is True, return a tuple where the second element is the edit distance, and the first element is the alignment (derived by following backpointers). Break pointer ties by prioritizing substitutions/matches over insertions over deletions. The alignment is a list of tuples showing each substitution, insertion, or deletion.

    For example,
    channel = load_channelmodel('data/confusion.txt')
    levenshtein(channel, 'taste', 'test', True)
    should return
    ([('t', 't'), ('a', 'e'), ('s', 's'), ('t', 't'), ('e', 'eps')], 14.59171856643955)
    whereas
    channel = load_channelmodel('confusion.txt') levenshtein(channel, 'taste', 'test', False)
    should return only the edit distance; i.e.
    14.59171856643955
    The edit distance for this example comes from $$-\log P(t|t) -\log P(e|a) -\log P(s|s) -\log P(t|t) -\log P(eps|e) \\ = -\log 0.75 -\log 0.0096 -\log 0.75 -\log 0.75 -\log 0.01$$

    You may find numpy arrays convenient for computing the dynamic programming matrix, though you are not required to use numpy. For speed, don't compute backpointers when align = False.
  • spellcorrect loops through every word c in the language model lm, and returns the word that minimizes $$-\log P(w|c) -\log P(c)$$ where $-\log P(w|c)$ is the edit distance returned by levenshtein(channel, c, w), and $-\log P(c)$ is the negative log probability of c according to the dictionary returned by load_languagemodel.
  • Run spellcheck.py with a single command line argument, either interactive, which repeatedly prompts for a misspelt word and corrects it, or the name of a file such as data/wikipedia.err containing a list of spelling errors and the corresponding correct words. (The words in data/wikipedia.err were mined from actual spelling mistakes made by Wikipedia editors.)

    Here's an example of an interactive session.
    python spellcheck.py interactive
    Enter a word: teh
    the
    Enter a word: beleive
    believe
    The batch run
    python spellcheck.py data/wikipedia.err
    hypothesizes the correct spelling for each word in data/wikipedia.err using your functions and writes a new file, data/wikipedia.err.corrected, with a third column showing the algorithm's hypothesis correction for each misspelling, and a fourth column indicating whether the hypothesis matches the actual correct word. (All of this code is written for you.)

    Since we haven't optimized the search process, each spelling correction takes a little over a second. There are 1244 words in data/wikipedia.err, so wait a while for the program to finish. (You only have to run this program once, though.) The provided batchcorrect prints a dot after each word to show the progress, and finally, displays the accuracy of the predictions.

    Examine the output of the program, comparing the hypothesized predictions to the actual words, and answer the first question in reflection.md.
Now you have a spell-checker to show off! (A former student recently told me she got a job interview because of her spell-checker code from a similar assignment, so there's that.)

Section (b): Designing Regular Expressions [5 pts]

This section is independent of the others.

In regexes.py, complete the functions to look for the appropriate patterns in a text by filling in the appropriate regular expressions.

Running the program

python regexes.py
checks the matches against the "gold standard" file data/goldregex.pkl of results for each pattern.

You might find this Coursera video useful. Pythex is a nice interactive way to test your regexes. The demo code from class is posted online for your reference.

Section (c): Designing Context-Free Grammars [18 pts]

This section is independent of the others.

Examine the sentences in data/sentences.txt. You need to design a context-free grammar that will accept all of the valid sentences (those not prefixed by *), and none of the invalid sentences (those prefixed with *).

data/grammar.cfg contains the beginnings of the grammar. Running

python parsecfg.py g
generates several random sentences from this grammar.

Add and modify the rules to be consistent with data/sentences.txt. All your rules must be in Chomsky Normal Form.

  • Running the provided parsecfg.py code like so:
    python parsecfg.py p
    verifies that your grammar is correct, and also shows which of the sentences were parsed by your grammar, using the CKY algorithm from class. Tweak the grammar until all valid sentences parse and none of the invalid ones do. Try to name your non-terminals in a linguistically meaningful way, but technically, the names can be arbitrary.

    Tip: the Grammar.parse method takes an optional verbose argument that prints the CKY parsing matrix. When "debugging" your grammar, you may find it useful to pass verbose as True in the calls to parse in the main block.
  • Run
    python parsecfg.py g
    to generate sentences from your grammar. Do any of them seem ungrammatical to you? Answer the second and third questions in reflection.md.

Misc [5 pts]

  • Answer the remaining questions in reflection.md.
  • Fill out README.md and submit your work correctly.
  • Check that your code is reasonably documented, elegant, and efficient.