Assignment 5: Text as Vectors

Out: Thu, Mar 31 Due: 11:59 pm EST on Thu, Apr 14
Repository: github.com/wellesleynlp/yourusername-cs349-a5-vectors
Solution: github.com/wellesleynlp/a5solution

Section (a)
/
Section (b)
/
Section (c)
/
Section (d)
/
Section (e)

Objective

Transform words and documents into vector representations. Implement k-means clustering and kNN classification.

Setup

You may work with someone you have collaborated with before. Use the Google Sheet to find a partner and set up your joint repository as instructed.

Install the scipy Python package with pip, or run Python on tempest, following the instructions in the Tools doc.

We will use a corpus adapted from the Yelp Dataset Challenge, released by Yelp to motivate research in machine learning and NLP, consisting of reviews of businesses around Urbana-Champaign, Edinburgh, Pittsburgh, Phoenix, Las Vegas, Charlotte, Madison, and Montreal. I have filtered this corpus to only includes busineses with at least 100 and users with at least 20 reviews, and tokenized the review text.

Download the dataset from this Google Drive link, and save the unzipped "data" directory to your repository clone. Because of the size of the data, please don't push it to your repo -- each partner should independently download it instead.

If you are working on tempest, you must use the files in /home/sravana/nlp/a5data directly, without copying them to your home directory (or it may break your quota). Do this by first making an empty data directory, and then creating symbolic links from each of the files into it.

cd your_repo_clone/
mkdir data
cd data/
ln -s /home/sravana/nlp/a5data/* .

The corpus is on the larger side, but as long you're conscious about data structures, and appropriate use of numpy/scipy, your program runtimes will be reasonable. Estimates are given for each section. (Note: explicit for-loops are necessary for operations on text files; numpy doesn't kick in until the data is in vector representation.)

Please read the provided code carefully, especially the main functions in all the files, and the helper functions in utils.py. You may get scipy programming ideas from similarwords.py and analogies.py as well.

Submitting

Before the deadline (11:59 pm EST on Thu, Apr 14), push the files below to your repository following these instructions.

contextvectors.py
cluster.py
docterm.py
knnclassify.py
the filled-in README.md and reflection.md.

If you are working with a partner, make sure you have set things up as described so you are both contributing to a single repository.

Section (a): Context-Vector Word Embeddings [10 pts]

This section is required for part of section (e), but independent of the others. It takes me about 10 minutes to run on the full data/allreviews.txt file, with count_context_vectors responsible for most of it.

Fill in count_context_vectors in contextvectors.py as described in the function contract. Debug by estimating the vectors on the example from class, data/testsents.txt, running the program in "debug" mode:
python contextvectors.py data/testsents 2 2 --debug=True
which will create a file with the vectors in data/testsents.window2.thresh2.todebug.vecs (one vector per row) and the corresponding words for each vector in data/testsents.window2.thresh2.todebug.labels. Check these against the files in sampleoutputs.
When you're ready, run the program to estimate PPMI word embeddings with SVD dimensionality reduction on the Yelp review texts. Create embeddings for words with a minimum frequency of 500, with context windows of 1 and 5. The default dimensionality of the reduced space is 100.
python contextvectors.py data/allreviews 1 500
python contextvectors.py data/allreviews 5 500

Section (b): Document-Term Feature Vectors [10 pts]

This section is required for part of section (e), but independent of the others. It takes me about 3 minutes to run on the data/starredreviews.txt file.

Let's build feature vectors for documents (for example, each review is a document). This will be a document-term matrix with TF-IDF weighting, where the documents are along the rows and the terms along the columns.

Fill in tfidf_docterm in docterm.py following the contract. This should read a text file like data/testsents.txt, treating each line as a document. To test, run
python docterm.py data/testsents 2 --debug=True
which produces data/testsents.tfidf.thresh2.todebug.vecs with the vectors, and data/testsents.tfidf.thresh2.todebug.dims with the words corresponding to each dimesnion.

Work out the TF-IDF matrix by hand and verify that it's correct.
data/starredreviews.txt contains the filtered subset of reviews that you will use for star-rating prediction. Generate the vectors for this dataset, filtering out words that appear fewer than 200 times:
python docterm.py data/starredreviews 200
We will use these vectors for the kNN classification task.

Section (c): K-Means Clustering [10 pts]

This section is required for part of section (e), but independent of the others. The program shouldn't take more than a few seconds for any of the data.

Fill in the kmeans function in cluster.py to cluster points into k clusters. The k centroids should be initialized with the first k points from the data. The function should iterate at most maxiter times, stopping earlier when the assignments of each point to a cluster don't change. The function returns a dictionary with keys from 0 to k-1 (the cluster IDs), with each value being a list of the labels of the points in that cluster.

It may happen that empty clusters are created during an iteration. In such a case, let the corresponding mean of the empty cluster i be the ith point in the data. Otherwise, you will be attempting to take the mean of an empty set of points, which is undefined. (Ideally, we'd pick random points, but eliminating randomness makes debugging easier.)

Test your code by clustering a small subset of word vectors into 20 groups.

python cluster.py data/commonwords 20

This should write a file named data/commonwords.cluster20 identical to the one in sampleoutputs.

Section (d): K Nearest Neighbors Classification [10 pts]

This section is required for part of section (e). If you haven't finished section (b), you can make up a small toy dataset to test the program, but it's more effective to run this after you create the document-term vectors. My computer took about 8 minutes to run on data/starredreviews.

Fill in the knn function in knnclassify.py.

data/starredreviews.{txt, cats, ttsplit} is a filtered subset of reviews, keeping only polarized reviews (1 or 5 stars) that are at least 200 words long. The reviews are split into training (pre-2014) and testing (2014 and after). I collapsed 1- and 2-stars into 1 because 1-stars alone are uncommon.

Use-case: Imagine if Yelp stopped allowing star-ratings in 2014. Can a program still predict them from the review text?
The .cats file has the star ratings for the corresponding reviews in the .txt file. The .ttsplit file marks each review as belonging to the training ("0") or test ("1") sets.
Rename the data/starredreviews.tfidf.thresh200.vecs file you generated in section (b) to data/starredreviews.vecs. Now you're ready to run your k Nearest Neighbors classifier to predict the star ratings of the 2014-and-later reviews.
python knnclassify.py data/starredreviews 7

Section (e): Analysis [5 pts]

This section relies on the previous sections.

Run similarwords.py and/or cluster.py on the context vectors from section (a).
python similarwords.py data/allreviews.window1.thresh500
python similarwords.py data/allreviews.window5.thresh500

python cluster.py data/allreviews.window1.thresh500 100
python cluster.py data/allreviews.window5.thresh500 100
Qualitatively, what are the differences in the results (if any) between the window=1 and window=5 embeddings?
Come up with at least 5 creative analogies (of the king-man+woman form) besides the ones we tried in class, on
python analogies.py data/allreviews.window5.thresh500
Did the program find the expected answers?
We motivated vector space models by their ability to incorporate arbitrary features, but the document-term vectors in section (b) only use single words. However, with a few extra lines of code, we could tack on other features of the review texts and get better classification results. Describe a few quantifiable features that you might add. How many extra dimensions would they consume?
Eyeball a few of the reviews for which your classifier in section (d) made incorrect predictions, by looking at the output log, data/starredreviews.predictions. Is there something about those reviews that makes their star-rating more ambiguous? Are there words (i.e.. features) missing from data/starredreviews.tfidf.thresh200.dims that would have been useful for prediction? Would your proposed feature additions in question 3 help?

Misc [5 pts]

Answer the remaining questions in reflection.md.
Fill out README.md and submit and name your repository correctly.
Check that your code is reasonably documented, elegant, and efficient.

CS 349: Natural Language Processing

Spring 2016, Wellesley College