PS2: Classification with Perceptrons

Out: Thu, Feb 9th Due: Thu, Feb 16th at 11:00 pm EST
Repository: click here

In this assignment, you will write a supervised binary classifier using the perceptron learning algorithm, and apply it to sentiment analysis of movie reviews.

As in all problem sets, you may work in pairs. Use the search for teammates post on Piazza to find partners.

Objective

Understand and implement the perceptron learning algorithm.

Setup

Click the repository link at the top of the page while logged on to GitHub. This should create a new private repository for you with the skeleton code.

If you are working in a pair, go the Settings > Collaborators and Teams in your repository and add your partner as a collaborator. You will work on and submit the code in this repository together.

Clone this repository to your computer to start working. Commit your changes early and often. There's no separate submission step: just fill out honorcode.py and README.md and commit.

See the workflow and commands for managing your Git repository and making submissions.

Download data.zip from this location, unzip it (creating a folder named data), and place it in the directory where you cloned your repository. Do not commit this folder.

The data folder contains three datasets.

data/toy has a 2-d example of data separable by a hyperplane through the origin.
data/toybias has a 2-d example of data separable by a hyperplane not centered at the origin, and hence requiring the bias update
data/sentiment contains movie reviews labeled as positive and negative. The folder contains the raw text of the reviews in .text files, and featurized representations in .sparse files. Each feature is a word in vocabulary and each value is the count of that word in the review. Since these vectors are very high-dimensional and sparse, we use the following representation for each review instead of wasting space on thousands of zeros.
```
label dimension:value dimension:value ...
```
The featuremap.txt file in that folder shows the correspondence between the dimensions and the words they represent. For example, line 0 says that dimension 0 represents "!"

The provided skeleton code takes care of loading this type of file.

All your code will be written in perceptron.py. Examine the main function. It takes the path to a data directory (such as data/toy or data/sentiment) and a section of this problem set (a, b, or d), and invokes the appropriate functions:

Part A: Code a basic perceptron learner without bias, shuffling, or averaging
Part B: Write a function to featurize any input text, and apply the perceptron to predict its sentiment.
Part C: Add bias learning, shuffling, and averaging to the algorithm
Part D: Analyze performance with different settings

Part A: Basic Perceptron [10 pts]

perceptron_train: given a training data matrix trainX, the associated training labels trainy, a positive integer maxiter representing the maximum number of epochs, and a flag verbose, learn a hyperplane on the training data using the basic perceptron algorithm. If verbose is True, print the epoch number, data point index, and updated weight vector on each update (useful for debugging).

This function returns a tuple consisting of
1. the final weight vector
2. the final bias (which will be 0 with no updates)
3. the number of epochs taken to converge. This will be maxiter if it never converged. If it did converge, count the number of epochs in which there were any mistakes.
4. the accuracy on the training data at the end of the algorithm (which should be 1 if it converged before maxiter epochs). If the docstring in your starter file says that the accuracy will 0 if it converged, this is a typo, since it should be 1.
perceptron_apply: given a hyperplane represented by a weight vector w, a bias b (which may be 0), and a testing data matrix testX, predict the label (+1 or -1) for each testing data point, and return an array or list of the predictions in order.

After you're done, execute

python perceptron.py data/toy a

to see your algorithm in action. Since the verbose flag is on, it should print the epoch number, data point index, weight vector, etc. on each update as indicated above. Compare these displayed values against your manual calculations on the data to debug your code.

Now let's see how well the classifier does on the movie reviews data in comparison with kNN. Set the verbose flag in the main function's part a block's invocation to False, uncomment the line with the kNN invocation, and run

python perceptron.py data/sentiment a

Make a note of these numbers for your analysis in Part D.

Part B: Real-Time Sentiment Analysis [3 pts]

With a working perceptron learner, you can train a sentiment analysis system from the provided dataset of movie reviews.

Define the featurize function following its description. Now run

python perceptron.py data/sentiment b

to play with your program. Try inputting a few different English texts (of at least 100 words), or ask your friends to help you experiment. What kinds of inputs does it work well on? Where does it fail? Briefly record your experiments in the sentimentguesser.md file.

Part C: Perceptron with Bias, Averaging, Shuffling [12 pts]

Modify perceptron_train to incorporate the bias update when the bias argument is True.

Test the function by changing the verbose and bias flags to True in the main function's part a block's invocation. Run
python perceptron.py data/toybias a
Debug by checking that the updates match up with your manual computations.
Modify perceptron_train to incorporate averaging when the averaging argument is True. Debug it as above after manually computing an average weight vector on the toy data.
Finally, modify perceptron_train to shuffle the order of the training data points on each epoch when the shuffle argument is True.

Part D: Analysis [15 pts]

Execute

python perceptron.py data/sentiment d

which trains and tests your algorithm with different combinations of settings for the perceptron. (This is not quite the same as hyperparameter tuning, since we're just exploring on the test set rather than trying to fix the hyperparameters.)

This execution will also use the trained weight vector to display the features with the largest coeficients for each class.

Examine the output, make a copy of this Google Doc, answer the questions, and share it with me. Important: add the link to your copy of the GDoc to README.md.

Complete honorcode.py and README.md and push your code by Thu, Feb 16th at 11:00 pm EST

CS 349-02:
Machine Learning