CS 349-02:
Machine Learning

Spring 2017, Wellesley College

PS6: Neural Networks for Named Entity Recognition

Out: Mon, Apr 17th      Due: Thu, Apr 27th at 11:00 pm EST
Repository: click here

This problem set implements a model for Named Entity Recognition (identifying people, locations, etc. in sentences) using two different neural network architectures. It is partially adapted from here.

As in all problem sets, you may work in pairs (and are, in fact, encouraged to do so). Use the search for teammates post on Piazza to find partners.

Objective

  • Gain practice working with Keras
  • Understand different ways of approaching sequence classification problems with neural networks

Setup

Click the repository link at the top of the page while logged on to GitHub. This should create a new private repository for you with the skeleton code.

If you are working in a pair, go the Settings > Collaborators and Teams in your repository and add your partner as a collaborator. You will work on and submit the code in this repository together. Each pair should only have one repository for this problem set; delete any others lying around.

Clone this repository to your computer to start working. Download the data.zip file from here, unzip it, and place the resulting data directory in your repository clone.

If you are working on tempest (because you can't install keras, for example), follow all the instructions in this box:
  • Clone your repo into tempest. Do not download the data.
  • Place a soft-link to the data, by cd-ing into your clone, and typing
    ln -s /home/sravana/public_html/ml/ps6/data .
    
  • Copy keras.json from your clone to your home directory by typing these commands from your clone.
    mkdir -p ~/.keras
    cp keras.json ~/.keras/
      
  • Add the following lines to your ~/.bashrc file.
    export PYTHONPATH='/home/sravana/nlpbin/lib/python2.7/site-packages/':$PYTHONPATH
    alias python='/opt/bin/python2.7'
      

Commit your changes early and often! There's no separate submission step: just fill out honorcode.py and README.md, and commit. The last commit before the deadline will be graded, unless you increment the LateDays variable in honorcode.py.

See the workflow and commands for managing your Git repository and making submissions.

All code is to be written in ner.py

Problem Description

Named Entity Recognition is the task of locating and classifying named entities in text into pre-defined categories such as the names of persons, organizations, locations, etc. In the assignment, for a given a word in a context, we want to predict whether it represents one of four categories:

  • Person (PER): e.g. 'Martha Stewart', 'Obama', 'Tim Wagner', etc. Pronouns like 'he' or 'she' are not considered named entities.
  • Organization (ORG): e.g. 'American Airlines', 'Goldman Sachs', 'Department of Defense'.
  • Location (LOC): e.g. 'Germany', 'Panama Strait', 'Brussels', but not unnamed locations like 'the bar' or 'the farm'.
  • Miscellaneous (MISC): e.g. 'Japanese', 'USD', '1,000'.
We formulate this as a 5-class classification problem, using the four above classes and a null-class (O) for words that do not represent a named entity (most words fall into this category). For an entity that spans multiple words ('Department of Defense'), each word is separately tagged, and every contiguous sequence of non-null tags is considered to be an entity.

Here is a sample sentence with the named entities tagged above each token.

ORG ORG O PER PER O
United Airlines CEO Oscar Munoz arrived

To evaluate the quality of a NER system's output on the test set, we look at the F-Score (harmonic mean of precision and recall) for each of the 5 classes.

Word Vectors

Both Part A and Part B rely on word vectors, which are provided in the constructor. You need to explicitly use the word vectors in Part A, while Part B only requires that you map each word to its index in the vocabulary, and the Embedding layer does the mapping to the vectors.

These word vectors have been trained using word2vec on another dataset so that the vectors of similar words are similar, and vice versa.

Here's a PCA projection of some of the word vectors onto two dimensions, to give you an idea of what the space looks like. Some information is obviously lost going from the original 50 dimensions down to 2.

Part A: Windowed Named Entity Recognition [20 pts]

Implement the load_conll, train, and predict methods in WindowedNER.

Run the code to train the model and get predictions.

python ner.py window

Part B: Named Entity Recognition with LSTMs [15 pts]

Follow the instructions to implement the load_conll and predict methods in LSTMNer.

python ner.py lstm

Note that the doc-string in the load_conll method for this class should say "Each data point (row in X) is a sentence, represented as the concatenation of the indicesof each word in the sentence" rather than the one-hot vectors.

Part C: Execution [5 pts]

Once you have the above sections working, run them and observe the prediction F-scores and confusion matrices. Answer the questions in README.md.

Testing

A tester.py script has been pushed to your repositories. You need to download windowed.pickle and lstm.pickle into your repo clones from here. If you're working on tempest, create soft-links to these files from your clone:

ln -s /home/sravana/public_html/ml/ps6/windowed.pickle .
ln -s /home/sravana/public_html/ml/ps6/lstm.pickle .

Run the code to check part A or part B:

python tester.py a
python tester.py b

Complete honorcode.py and README.md and push your files by Thu, Apr 27th at 11:00 pm EST