CS 349: Natural Language Processing

Spring 2016, Wellesley College

Assignment 2: N-Grams for Classification

Out: Mon, Feb 08      Due: 11:59 pm EST on Mon, Feb 22
Repository: github.com/wellesleynlp/yourusername-cs349-a2-ngrams
Solution: github.com/wellesleynlp/a2solution

Objective

Use n-gram models to categorize text. Understand the effect of n-gram settings on classification performance. Get experience preparing data.

Setup

This assignment gives you the option to work with a partner, which involves a slight change in your repository configuration. Follow these instructions.

As in A1, starter code is provided. Do not modify any method or function that isn't explicitly annotated with TODO.

This assignment doesn't involve too much code, and it builds up to a single coherent task, unlike A1. However, the functions are a bit more complex, and necessitate that you thoroughly understand n-grams. You also have to invest some time experimenting and data munging, which are both as essential to the NLP experimental pipleline as programming. A single day or weekend is almost certainly going to be too tight. Start early! I recommend finishing section (a) before Feb 15.

If you find any errors in this assignment writeup or repository, point them out in the Google Group ASAP, and you'll get a bonus point!

Testing

There likely won't be an automated checker for this assignment, but test cases and expected classification results under some parameter settings are provided, which end up testing all of your code. You should individually test your functions as you write them as well.

Submitting

Before the deadline (11:59 pm EST on Mon, Feb 22), push your edited ngram.py, classify.py, and convert.py code, as well as the completed README.md and reflections.md files to your repository following these instructions.

If you are working with a partner, make sure you have set things up as described so you are both contributing to a single repository.

Section (a): Estimating and Applying N-Grams [15 pts]

All code in this section is to be written in ngram.py.

The NGram class is slightly changed from A1 to handle n-grams besides unigrams. An NGram object represents an n-gram model, containing the conditional probability parameters at all orders from n-1 down to 0 (unigrams), plus the scaling factor for linear interpolation, and whether or not the model should be open vocabulary.

The constructor initializes an empty dictionary called self.relcounts to store the conditional probabilities of the model, estimated with maximum likelihood and therefore amounting to relative counts from data.

This dictionary is nested to 3 levels: order of model (length of context) from 0 through n-1, context represented by a tuple (which is empty for unigrams), and outcome. For example, in a model with n = 3 -- that is, a trigram model -- over the characters 'a', 'b', and 'c', the dictionary looks like this:

  self.relcounts = {0: {(): {'a': P(a), 'b': P(b), 'c': P(c), ...}},
                    1: {('a'): {'a': P(a|a), 'b': P(b|a), 'c': P(c|a)},
                        ('b'): {'a': P(a|b), 'b': P(b|b), 'c': P(c|b)},
                        ('c'): {'a': P(a|c), 'b': P(b|c), 'c': P(c|c)}
                        },
                    2: {('a', 'a'): {'a': P(a|a a), 'b': P(b|a a), 'c': P(c|a a)},
                        ('a', 'b'): {'a': P(a|a b), 'b': P(b|a b), 'c': P(c|a b)},
                        ('a', 'c'): {'a': P(a|a c), 'b': P(b|a c), 'c': P(c|a c)},
                        ('b', 'a'): {'a': P(a|b a), 'b': P(b|b a), 'c': P(c|b a)},
                        ('b', 'b'): {'a': P(a|b b), 'b': P(b|b b), 'c': P(c|b b)},
                        ('b', 'c'): {'a': P(a|b c), 'b': P(b|b c), 'c': P(c|b c)},
                        ('c', 'a'): {'a': P(a|c a), 'b': P(b|c a), 'c': P(c|c a)},
                        ('c', 'b'): {'a': P(a|c b), 'b': P(b|c b), 'c': P(c|c b)},
                        ('c', 'c'): {'a': P(a|c c), 'b': P(b|c c), 'c': P(c|c c)},
                        }}

Why do we like this representation? Practically speaking, it makes it easy to normalize raw frequencies to get relative counts within each dictionary, as well to quickly locate conditional probabilities as needed. For those new to Python, a dictionary is basically a hash-map, which means that locating self.relcounts[1][('a')]['b']) to get P(b|a) is constant time.

You'll notice the constructor uses Python's defaultdict structure. This is like a dictionary, except that it lazily assigns a default value to any key that hasn't been set in the dictionary, which is extremely useful for n-grams. For instance, if we've never seen the string 'a c b' in the data, we don't have to fill in the value explicitly; invoking self.relcounts[2][('a', 'c')]['b'] will be 0 by default.

Fill in the function definitions following the provided docstrings and example outputs.

  1. tuple_all_lengths_count

    This is similar to utils.tuple_count from A1, except that it returns the counts of all tuples from length 1 through the argument. Example: tuple_all_lengths_count('mississippi', 3) returns

    {('i', 's', 's'): 2, ('m', 'i', 's'): 1, ('s', 's', 'i'): 2, ('i', 'p', 'p'): 1,
     ('s', 'i', 's'): 1, ('p', 'p', 'i'): 1, ('s', 'i', 'p'): 1,
     ('s', 'i'): 2, ('p', 'i'): 1, ('s', 's'): 2, ('i', 's'): 2, ('i', 'p'): 1,
     ('p', 'p'): 1, ('m', 'i'): 1,
     ('i',): 4, ('p',): 2, ('m',): 1, ('s',): 4}
    
    and tuple_all_lengths_count(['the', 'hat', 'is', 'in', 'the', 'hat'], 2) returns
    {('hat', 'is'): 1, ('the', 'hat'): 2, ('in', 'the'): 1, ('is', 'in'): 1,
     ('the',): 2, ('hat',): 2, ('in',): 1, ('is',): 1}
    
  2. NGram.estimate_from_text

    This is the function that computes the conditional probability parameters of the model from text and stores them in self.relcounts. The argument traindata is a list of sentences, each of which is a list of words. You needn't tokenize the list; typically, it will have been tokenized before passing through the function. However, you must add beginning and end of sentences tags (<s> and </s>) to the start and end of each sentence.

    If the object's openvocab flag is True, account for out of vocabulary words by replacing the first token of every word type by <UNK> as discussed in class.

    If we've constructed an NGram object wordmodel where wordmodel.n is 2, wordmodel.unit is 'word', and wordmodel.openvocab is True, then

    wordmodel.estimate_from_text([['the', 'hat', 'is', 'in', 'the', 'hat'], ['it', 'is', 'the', 'blue', 'hat']]) should first transform the text to

    [['<s>', '<UNK>', '<UNK>, '<UNK>', '<UNK>', 'the', 'hat', '</s>'], ['<s>', '<UNK>', 'is', 'the', '<UNK>', 'hat', </s>']] and then set wordmodel.relcounts and wordmodel.vocabulary accordingly:

    wordmodel.relcounts:
      {0: {(): {'<s>': 2/15, 'hat': 2/15, 'is': 1/15, 'the': 2/15,
                '</s>': 2/15, '<UNK>': 6/15}
          },
       1: {('<UNK>',): {'the': 1/6, 'hat': 1/6, 'is': 1/6, '<UNK>': 3/6},
           ('the',): {'hat': 1/2, '<UNK>': 1/2},
           ('<s>',): {'<UNK>': 2/2},
           ('is',): {'the': 1/1},
           ('hat',): {'</s>': 2/2}})}
    
    wordmodel.vocabulary:
       set(['the', 'hat', 'is', '<s>', '</s>'])
    

    You may see symbols like "defaultdict( at 0x107d006e0>"; this just means it's a defaultdict structure, but you can treat it like any other Python dictionary.

    If we've constructed an NGram object charmodel where charmodel.n is 2, charmodel.unit is 'character', and charmodel.openvocab is False, then charmodel.estimate_from_text([['the', 'hat', 'is', 'in', 'the', 'hat'], ['it', 'is', 'the', 'blue', 'hat']]) should transform this data to lists of characters (including whitespace) to

    [['<s>', 't', 'h', 'e', ' ', 'h', 'a', 't', ' ', 'i', 's', ' ', 'i', 'n', ' ', 't', 'h', 'e', ' ', 'h', 'a', 't', '</s>'], ['<s>', 'i', 't', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'b', 'l', 'u', 'e', ' ', 'h', 'a', 't', '</s>']]

    and set charmodel.relcounts and charmodel.vocabulary as shown:

    charmodel.relcounts:
    {0: {(): {' ': 9/43, 'a': 3/43, 'b': 1/43, 'e': 4/43, 'h': 6/43,
              'i': 4/43, 'l': 1/43, 'n': 1/43, 's': 2/43, 't': 7/43,
              'u': 1/43, '<s>': 2/43, '</s>': 2/43}
        },
     1: {(' ',): {'i': 3/9, 'h': 3/9, 'b': 1/9, 't': 2/9},
         ('<s>,): {'i': 1/2, 't': 1/2},
         ('a',): {'t': 3/3},
         ('b',): {'l': 1/1},
         ('e',): {' ': 4/4},
         ('h',): {'a': 3/6, 'e': 3/6},
         ('i',): {'s': 2/4, 't': 1/4, 'n': 1/4},
         ('l',): {'u': 1/1},
         ('n',): {' ': 1/1},
         ('s',): {' ': 2/2},
         ('t',): {'h': 3/7, '</s>': 2/7, ' ': 2/7},
         ('u',): {'e': 1/1}
         }
    }
    
    charmodel.vocabulary:
    set([' ', 'a', 'b', 'e', 'h', 'i', 'l', 'n', 's', 't', 'u' '<s>', '</s>'])
    

    You might find tuple_all_lengths_count useful for this method.

  3. NGram.get_prob(x, context) returns the probability of x given context under this model, with smoothing. Replace words in x or context that are not in the model's vocab set with <UNK> before computing.

    Under the example word bigram model wordmodel shown above, with the linear interpolation constant wordmodel.lam = 0.8, wordmodel.get_prob('hat', ['the']); i.e., the smoothed $P(\mbox{hat}\vert\mbox{the})$, returns

    0.8*P(hat|('the',)) + 0.2*P(hat|()) = 0.8*0.5 + 0.2*0.1333 = 0.4267

    wordmodel.get_prob('hat', ['a']), i.e., $P(\mbox{hat}\vert\mbox{a})$, returns

    0.8*P(hat|(<UNK>,)) + 0.2*P(hat|()) = 0.8*0.16667 + 0.2*0.1333 = 0.16

    and wordmodel.get_prob('cat', ['the']), i.e., $P(\mbox{cat}\vert\mbox{the})$, returns

    0.8*P(<UNK>|('the',)) + 0.2*P(<UNK>|()) = 0.8*0.5 + 0.2*0.4 = 0.48

    With a trigram model, wordmodel = ngram.NGram(3, 'word', 0.8), estimated on the same text, wordmodel.get_prob('cat', ['<s>', 'the']) returns

    0.8*P(<UNK>|('<s>','the')) + 0.2*wordmodel.get_prob('cat', ['the'])
    = 0.8*0.0 + 0.2*(0.8*0.5 + 0.2*0.4) = 0.096.

  4. NGram.cross_entropy returns the cross entropy in bits of the model on a text (input as a list of items). With a trigram word model, wordmodel = NGram(3, 'word', 0.8), estimated on the same data ([['the', 'hat', 'is', 'in', 'the', 'hat'], ['it', 'is', 'the', 'blue', 'hat']]), wordmodel.cross_entropy(['the', 'cat', 'is', 'in', 'the', 'hat']) returns 3.47075, by taking $$-\log_2 P(\mbox{<s>}) -\log_2 P(\mbox{the}\vert\mbox{<s>}) -\log_2 P(\mbox{cat}\vert\mbox{<s> the}) -\log_2 P(\mbox{is}\vert\mbox{the cat}) -\log_2 P(\mbox{in}\vert\mbox{cat is}) -\log_2 P(\mbox{the}\vert\mbox{is in}) -\log_2 P(\mbox{hat}\vert\mbox{in the}) -\log_2 P(\mbox{</s>}\vert\mbox{the hat})$$ and dividing by 8. Note the slight difference from "hello" example in class where we ignored $P(\mbox{<s>})$.

    Similarly, with wordmodel = NGram(2, 'word', 0.8) estimated on the same data, wordmodel.cross_entropy(['the', 'cat', 'is', 'in', 'the', 'hat']) returns 2.46939.

    With a unigram model, wordmodel = NGram(1, 'word', 0.8), also estimated on the same data, wordmodel.cross_entropy(['the', 'cat', 'is', 'in', 'the', 'hat']) returns 2.63564. The calculation should include beginning and end of sentence tags for unigrams as well, despite what the docstring in the starter code says.

Section (b): Text Classification [13 pts]

This section relies on Section (a).

All code in this section is to be written in classify.py.

Grab politicalblogs.zip from this Google Drive link, unzip it, and place the resulting politicalblogs folder in the data/ directory. This is a large dataset; do not push it to your repository.

Examine the directory structure of data/politicalblogs/train. It contains 5 directories, where each directory name denotes the "labels" (blog names) of the files in that directory. In this case, the labels are redstate (http://www.redstate.com), dailykos (http://www.dailykos.com), rightwingnews (http://rightwingnews.com), matthewyglesias (http://matthewyglesias.theatlantic.com), and carpetbagger ( http://www.thecarpetbaggerreport.com). redstate and rightwingnews are Conservative blogs and the other three are Liberal. I adapted this data from the CMU Political Blog Corpus.

Within each label's directory is a list of files, each file representing all the comments on a single blog post. Note that each line is a comment, which may contain multiple sentences, but you needn't split them into sentences. Since the files have been processed and tokenized, your NGram class can read them as-is.

You will use all of these training files together to estimate an N-Gram model of each blog's comment section.

data/politicalblogs/test contains a list of files whose labels are unknown. You will hypothesize the label for every file by computing the cross-entropy of each of the models on the file and taking the lowest one, just like the A1 language identification task.

To evaluate the accuracy of the classifier, "gold-standard" labels (the true labels of the test set files) are provided in data/politicalblogs/goldtest.csv. Needless to say, the program should not consider these labels or the filenames when predicting the blog labels from the text.

Your program, classify.py, should be invoked as below:

python classify.py <datadir> n unit lam
This program trains open vocabulary NGram models on each label in the training data. It then applies these models to each file in the test data to predict the label of that file. Cross entropy should be computed by taking the average cross entropy over all lines in the file.

For example, to classify the test data using bigram word models trained with an interpolation scaling factor of 0.8:

python classify.py data/politicalblogs 2 word 0.8

Running this command should print the accuracy of your classifer on test set by comparing its predictions against the labels in goldtest.csv. In this case, the accuracy is

0.677966101695
The program should also write a file named data/politicalblogs/hypothesis-2-word-0.8.csv, identical to the provided data/politicalblogs/hypothesis-2-word-0.8.csv, where each row is a file in the test data, represented filename,predicted label.

Test classify.py using the different settings for which hypothesis results have been provided.

Section (c): Data preparation [10 pts]

This section is independent of the others.

To use your classifier out-of-the-box on other datasets besides the one provided, you'll need to get them into the same format. Real-world "data science" requires a lot of file format conversions. Here's some practice.

Get rawtwitter.zip from this Google Drive link, and unzip it to create a directory rawtwitter within your data/, which contains two files, train.txt and test.txt. Each file contains a list of tweets, including the name and gender of the author. We would like to train gender models on the users in the training set to predict the genders of the users in the test set.

Write a script named convert.py to transform the data in rawtwitter into the desired format, and store it in data/twitter. Just as the data in data/politicalblogs, there should be two directories for training and test sets. Each file must contain the entire list of tweets from a single user. The file names don't matter, but it may be easiest to name them the same as the userid -- just avoid spaces or special characters. After conversion, data/twitter/train/female/ should contain files corresponding to each female user in the training data, with each file containing the tweets of that user (one tweet per line) (with data/twitter/train/male/ set up similarly). data/twitter/test/ should contain files corresponding to each user in the testing data, regardless of gender. Also create a goldtest.csv in the appropriate place.

Your script should live in the top level of the repository, and create data/twitter from data/rawtwitter when called with no arguments, like so:

python convert.py

If your data has been transformed correctly, you should be able to run classify.py from Section (b) as-is:

python classify.py data/twitter 2 word 0.8

Since the Twitter data is also quite large, please don't push it to the repository. I should be able to construct it by running convert.py.

Because of licensing restrictions, don't distribute the Twitter data publicly.

Section (d): Experimental Analysis [5 pts]

This section relies on Section (b).

  • Principled model estimation is scientific and satisfying -- and it gets you about 80% of the way to building a good system in NLP or machine learning. The remainder, which is more tedious but inevitable, is accomplished by experimenting with model design settings. In this case, the design questions are: should we use character or word models (NGram.unit)? What order of n-grams (NGram.n)? The answers will likely vary by dataset, and are generally arrived at by running the classifier with several different settings.
  • The interpolation factor, NGram.lam can be estimated in a principled way using smoothing methods like Good Turing or Knesser Ney. But since we're not studying these in class, let's treat them as design parameters to be tuned by experimentation, just like NGram.unit and NGram.n
  • Run classify.py on the Political Blogs data with a few different combinations of values for the NGram constructor arguments to find the settings that produce the highest accuracies for that dataset.

    You can restrict lam to [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], and use your intuition to discard some combinations: for example, it's unlikely that low order character n-grams would do well.

    Run these experiments any way you like -- either by manually invoking classify.py with different arguments, or by writing a script to do so. Since each run takes some time, you might want to run them in parallel.
  • Note the optimal settings and summarize your findings in reflection.md.

That's it! You've built and experimented with your first (I assume) n-gram text classifier. We'll see other classification algorithms later in the term, but n-grams tend to be fairly versatile.

Now that you have a broad sense of what text classification entails and how to set up your experiments and data, you can start thinking ahead to final project ideas that may rely on classification, such as author identification, sentiment analysis, genre prediction, or quality estimation; i.e., whatever involves putting a label on text. Think of a domain that particularly interests you -- literature, music, politics, medicine... anything textual, and talk to me for ideas on how to obtain data.
Only a suggestion, of course. Your final project doesn't have to employ text classification at all.

Misc [7 pts]

  • Answer the questions in reflection.md.
  • Fill out README.md and submit your work correctly.
  • Check that your code is reasonably documented, elegant, and efficient.