Final Project Guidelines

The final project is your chance to apply machine learning to an application that you care about, or explore a topic that we haven't covered in class.

While the quizzes and assignments help you master the machinery we learn in class, the project is about applying this knowledge creatively and independently -- which is an equally important learning goal in the long run.

The project should be done in teams of 2-3. Talk to me if you'd prefer to work on your own. You may use the Piazza teammates post as usual to find people to work with. You'll have better luck finding teammates if you specify the kind of problems that interest you, or preliminary ideas. You may use this Google Doc to note ideas and find partners. Don't wait until the last minute to find a team!

There is no restriction on which programming language you code in, or what datasets and external tools you use. However, you have to submit all of your code and supporting data with your final submission.

You may run into difficulties once you start working, in which case it's totally fine to change the plans you set out in the proposal. Do try to figure this out soon, much before the project update is due, so you're not rushing at the end.

Evaluation

The project is worth 25% of the course grade, about the same as 3 problem sets, and more than the quizzes put together. I recommend spreading the work out over the time span of the project, doing a little chunk each week.

Milestone 1: Proposal	10%	Mar 23
Milestone 2: Data Collection	10%	Apr 06
Milestone 3: Update	5%	Apr 20
Milestone 4: Peer Review	5%	May 11
Final Paper	70%	May 15

Project outcomes will be evaluated on amount of demonstrated effort, depth of research into related work, creative problem-solving, and demonstrated understanding of the underlying machine learning.

You will not be penalized if your results fall short of expectations -- after all, that's just the uncertainty of research and development! In other words, as long as you write correctly-working programs that implement sensible algorithms and featurizers, and can show that you made an effort to try multiple ideas, it's okay if your accuracies are not as high as you expect.

However, if things don't work simply because you have bugs in your code, show a conceptual misunderstanding of the algorithms you're using, or didn't spend the time to try multiple approaches, it will affect your grade. For this reason, it's a good idea to informally check in with me to keep on track.

Milestone 1: Proposal

Due by 11:59 pm Thursday, March 23.

Some time before March 23, come see me to talk about potential ideas. This is required. You may sign up for an appointment, or come during office hours.

Once you have a team, one member should click here to create a repository for your project, and give all the team members write-permissions. Change the name of your repository to mlproject-student1name-student2name[-student3name]. Create a markdown file named proposal.md in this repository with the contents of your proposal, and push it by the deadline. This repository will be private by default, but you may switch it to public if you prefer.

You'll also present your idea to all of us in a short "pitch" (2-3 minutes) in class on the 23rd. The slides are due in this Google Drive directory by noon.

The project proposal sets out your topic and goals. The proposal outline and presentation together are worth 10% of your project grade. Grades are based on clarity, organization, depth of research into prior work, and writing and presentation style.

Jump to: Topics / Proposal Outline / Pitches

Topics

The most important criterion for choosing a topic is that it genuinely excites you. Don’t be afraid to get creative. The second is feasibility -- you have under two months, so set your goals realistically.

Applied Projects

Identify a dataset and a task involving supervised classification, regression, or recommendations. You may use clustering and dimensionality reduction as part of the experimentation, but pure clustering tasks are discouraged since they're hard to evaluate.

The task by itself could be novel, or you could explore new ways of attacking it compared to existing work. The project would generally comprise of trying different feature representations and machine learning models, and evaluating their performance.

You must (eventually) read some academic papers that tackle similar problems, and compare your results to previous work. Your final submission in May will be a ~5-page paper summarizing the literature, your methods, and experimental results, as well as supporting data and code.

See past projects from the Stanford ML course (scroll to bottom of page) for inspiration. You can also search for machine learning class projects from other universities.

Data Sources (this isn't a comprehensive list; we're not short of data nowadays!)

Simple Google searches on topics of your interest + "dataset" may bring up results
Any of the problem set data
Any data you have collected in other classes or your research
Image Data for Computer Vision
Kaggle (some are ongoing, with potential for cash prizes or job offers)
UC Irvine ML Repository (only use data contributed by reputable authors)
KDD Repositories
Amazon Reviews
Wikipedia Dumps
Meta-Collection of these and other links
Shared tasks:
- Build it/Break it (an ongoing task on sentiment analysis)
- SemEval 2017 (text analysis); also see previous years'
- Emotion Intensity Classification
- Abusive Language
- Yelp Dataset Challenge (grab downloaded data from /home/sravana/data/yelp on tempest)

Please do share any other interesting resources you've found to Piazza.

Independent Studies

Is there a topic you're keen to learn about, but that we can't do in class? Devise a mini-curriculum to teach it to yourself. Your final submission will be a tutorial teaching others what you just learned.

Ideas

Support Vector Machines
Graphical Models
Bayesian Networks
various flavors of Deep Learning
Bayesian ML
Statistical Learning Theory
Optimization Algorithms
Time Series or Sequence Modeling

If you prefer theory and math to programming and experimentation, such projects will be a good fit. Team sizes of 1 or 2 are prefered for this category.

Proposal Outline

Applied Projects

Your proposal.md should be about 300 words and include:

A description of your problem and motivations.
A brief survey of existing work, with links to the relevant papers or websites
The dataset(s) you will be using, with a link if relevant
A description of the featurization and classification algorithms that you envision using. It is all right if this plan evolves as you see more algorithms later in the course. While you will most likely use some form of supervised classification, you may also like to apply some unsupervised methods like clustering and dimensionality reduction as an aid to featurization or data exploration.
How you will evaluate your results (accuracy, mean squared error, precision/recall,...)
Is the primary purpose of your task prediction, or do you also want to explain something about the data (i.e., analyzing features that are predictive of a class, the way you did in PS2 and perhaps in PS3)?
Responsibilities of each team member. These may evolve over the course of your project, but it helps to have a plan.
Three goals. Be realistic about them, but also ambitious.
- What to complete by April 20th when the project update is due. The second milestone on April 06th is writing code to parse your data, so don't include data loading as a goal.
- The minimum desired outcome of your project by the final submission on May 15th
- The ideal final outcome of your project
(Optional) Look ahead to the data collection requirements of milestone 2, and start writing code to load and featurize the data.

Independent Studies

In proposal.md, list

The topic you will be studying
How the topic relates to what we have already studied in class
A list of at least four resources (textbook chapters, survey papers, course webpages, etc) that you plan to use, with links.
How you plan to test your own understanding along the way. Are there exercises or problem sets in your sources or other course websites? Alternately, you can plan to implement some of the relevant algorithms.
The format of your envisioned tutorial. You could do a write-up like a textbook chapter, or you can get creative and make a webpage with demos and examples. Obviously, whatever you write up should not be a pure rehash of your sources. Your target audience could be a member of our class, or a CS student who hasn't taken the class.

Pitches

Prepare an entertaining and informative presentation for March 23, consisting of a 2-3 minute talk. You need not use slides, but if you do, submit your slide deck as a Google Slides document to the the shared folder by noon.

Milestone 2: Data Collection

Due by 11:59 pm Thursday, April 06.

Applied Projects

By this milestone, you should:

Collect and finalize the dataset. It's generally not a good idea to commit large data to the repository, so place a copy of it in a folder named data in your clones, like you do for problem sets. Let me know if you'd like storage space on tempest.
Write functions to read and parse this data. If you're using Python, sticking to the convention of having an X matrix where the rows are data points, and a y array with the labels, is a good idea since it's compatible with your problem set code as well as scikit-learn. However, feel free to come up with your own conventions, and if you prefer using some other language, that's all right too.

This part is where you decide your target labels, as well as the basic featurization of your raw data. More complex featurization, such as dimensionality reduction, clustering, scaling or normalization, can be done later. For some datasets (collections of atrributes, or images), the data will come already featurized -- you're in luck! For others, such as text or logfiles, you may need to devise featurizers. I recommend passing the featurization strategy as a parameter to your parsing functions so you can experiment with various candidates later on. See if any of sklearn.feature_extraction methods will help.
If the data hasn't already been split into training, development, and test sets, do that now. These splits can be arbitrary (80:10:10 is a good ratio), or informed by the data -- for example, if the data points have dates attached, it makes sense to use earlier points for training, since you're ultimately trying to predict the future.

Rather than making a single split, you may also want to use cross-validation, which we will discuss in class.

Commit the code you have written for parsing, feature extraction, splitting (if relevant), etc. by the deadline. Also create a document named data-collection.md in the repository, in which you summarize what you have done, as well as any difficulties you've encountered that you need help with. It should also include a link to download the data.

I will primarily look at data-collection.md rather than your code, so document all your efforts.

Milestone 3: Update

Due by 11:59 pm Thursday, April 20.

Commit all the code you've written so far.

In a file named update.md, describe whether you have reached the goal you set for yourself for this milestone, whether your plans for the project have changed, what work is pending before your planned completion, and any difficulties you're encountering.

Milestone 4: Peer Review

Due by 11:59 pm Thursday, May 11.

Pair up with another team. Describe your project goals, experiments, and results to each other. This is a chance for you to share your work with someone besides me!

Use Piazza to pair up with teams. Since we have 17 teams, there should be one triangle (A reviews B, B reviews C, C reviews A).

Each team must submit feedback on the other team's project by 11:59 pm on May 11.

Final Paper

Due by 4:00 pm Monday, May 15.

Push all your code. Briefly document what each code file does in your README.

As usual, do not commit your data; instead, place a link and directions for downloading the data in README.md (which you may have done already in a previous milestone).

Write a 5-6 page paper summarizing the literature, your methods, experimental results, and ideas for future work. You may write it as a PDF, HTML, Google Doc, or Markdown. Push it to your repository (or if it's a Google Doc, share it with me and place a link in the README).

The paper should resemble a published experimental ML paper in its format and flow. As a guideline, it will contain:

A description of your data, problem and motivations.
An overview of existing work.
A description of your machine learning models, and why you chose them.
The results of your experiments, including numbers, visulalizations, and interpretations as appropriate.
Analysis of any shortcomings of your work, and ideas for future research.

I will not push your project grades to the repository, so feel free to make the repos public to share them with your classmates or others.

CS 349-02:
Machine Learning

Spring 2017, Wellesley College

Final Project Guidelines

Evaluation

Milestone 1: Proposal

Topics

Applied Projects

Independent Studies

Proposal Outline

Applied Projects

Independent Studies

Pitches

Milestone 2: Data Collection

Applied Projects

Milestone 3: Update

Milestone 4: Peer Review

Final Paper