CS 349-02:
Machine Learning

Spring 2017, Wellesley College

PS3: Logistic Regression

Out: Tue, Feb 16th      Due: Thu, Mar 2nd at 11:00 pm EST
Repository: click here

In this assignment, you will write a supervised binary classifier using logistic regresssion, and apply it to predicting congressional bill survival, using this dataset. I want to give you a broad sampling of ML applications to play with: health and computer vision, text sentiment analysis, and now politics.

The dataset and task comes from the paper, Textual Predictors of Bill Survival in Congressional Committees (Yano et al, 2012). By now, you have enough machinery in your toolbox to understand much of the content in such a paper. You will find that some of the terminology may be different from what we use in class -- this terminological variation is sadly pervasive in ML -- but most of the ideas should be familiar by our Feb 23rd class. Reading this paper is not required, but doing so and having a 5-10 min chat with me about it is another extra credit+sticker opportunity, and I encourage it for anyone interested in ML beyond this class.

As in all problem sets, you may work in pairs. Use the search for teammates post on Piazza to find partners.

Objective

  • Primary: Understand and implement gradient ascent for logistic regression
  • Secondary: Get practice with reading data from a new (JSON) file format

Setup

Click the repository link at the top of the page while logged on to GitHub. This should create a new private repository for you with the skeleton code.

If you are working in a pair, go the Settings > Collaborators and Teams in your repository and add your partner as a collaborator. You will work on and submit the code in this repository together. Each pair should only have one repository for this problem set; delete any others lying around.

Clone this repository to your computer to start working. Commit your changes early and often! There's no separate submission step: just fill out honorcode.py and README.md, and commit. The last commit before the deadline will be graded, unless you increment the LateDays variable in honorcode.py.

See the workflow and commands for managing your Git repository and making submissions.

Download data.zip from this location, unzip it (creating a folder named data), and place it in the directory where you cloned your repository. Do not commit this folder.

The data folder contains only one dataset: a featurized collection of a subset of US Congressional Bills. See data-description.md in your repository for an explanation of the data, the format, and features. It is only slightly modified from the original data-set, and as such, gives you practice parsing "real-world" data formats.

All your code will be written in jsonparse.py and logisticreg.py.

Part A: Parse JSON Files [10 pts]

This section needs no background knowledge, and can be started immediately. All code must be written in jsonparse.py

Fill out the function definitions for load_training and load_testing following the descriptions. Execute them on training.json and testing.json respectively, and verify your functions' correctness by manually checking the returned arrays for the couple of examples in each file.

Clarification: If you cloned the repo on Wed, the doctring for your load_training has an example where a label is equal to -1. This is just an example. The actual data doesn't have any -1 labels, only 0 and 1, following logistic regression conversions. You will retain the same labels as what's in the data file.

Part B: Logistic Regression [25 pts]

This section needs background from Thursday's class, but you should look over it for planning purposes. Code is to be written in logisticreg.py

logisticreg.py contains the perceptron_train function from PS2 (with slight modifications) for your reference.

Implement the stochastic and mini-batch gradient ascent training procedure for logistic regression in logreg_train, following the docstring. Note that a sigmoid function has been provided for you. The returned training accuracy should be the proportion of points correctly classified at the end of the procedure (just like with the perceptron).

Also implement predict_confidence which takes a hyperplane and a collection of test data points, and returns an array where the ith element is the probability of that data point being +1 according to the hyperplane. (This is a more nuanced version of your perceptron_apply PS2 code.) Finally, implement get_meansq_accuracy which takes an array of true labels and an array of probabilities such as the one returned by predict_confidence, and computes the mean squared accuracy, defined as 1 - mean squared error, which is

$\frac{1}{n}\sum_{i=1}^n (y^{(i)} - h_\theta(x^{(i)})^2$

The main function does a grid search over a few hyperparameters for the perceptron and your new logistic regression learner. Run

python logisticreg.py p
and observe the output. When your functions are filled in, run
python logisticreg.py l

If you like, you may modify the hyperparameter search space for your Part C write-up.

Part C: Analysis [15 pts]

Make a copy of this Google Doc, and share it with me. Important: add the link to your copy of the GDoc to README.md.

Complete honorcode.py and README.md and push your code by Thu, Mar 2nd at 11:00 pm EST