PS6: Neural Networks for Named Entity Recognition
This problem set implements a model for Named Entity Recognition (identifying people, locations, etc. in sentences) using two different neural network architectures. It is partially adapted from here.
As in all problem sets, you may work in pairs (and are, in fact, encouraged to do so). Use the search for teammates post on Piazza to find partners.
Objective
- Gain practice working with Keras
- Understand different ways of approaching sequence classification problems with neural networks
Setup
Click the repository link at the top of the page while logged on to GitHub. This should create a new private repository for you with the skeleton code.
If you are working in a pair, go the Settings > Collaborators and Teams in your repository and add your partner as a collaborator. You will work on and submit the code in this repository together. Each pair should only have one repository for this problem set; delete any others lying around.
Clone this repository to your computer to start working. Download the data.zip file from here, unzip it, and place the resulting data directory in your repository clone.
- Clone your repo into tempest. Do not download the data.
- Place a soft-link to the data, by cd-ing into your clone, and typing
ln -s /home/sravana/public_html/ml/ps6/data .
- Copy keras.json from your clone to your home directory
by typing these commands from your clone.
mkdir -p ~/.keras cp keras.json ~/.keras/
- Add the following lines to your ~/.bashrc file.
export PYTHONPATH='/home/sravana/nlpbin/lib/python2.7/site-packages/':$PYTHONPATH alias python='/opt/bin/python2.7'
Commit your changes early and often! There's no separate submission step: just fill out honorcode.py and README.md, and commit. The last commit before the deadline will be graded, unless you increment the LateDays variable in honorcode.py.
See the workflow and commands for managing your Git repository and making submissions.
All code is to be written in ner.py
Problem Description
Named Entity Recognition is the task of locating and classifying named entities in text into pre-defined categories such as the names of persons, organizations, locations, etc. In the assignment, for a given a word in a context, we want to predict whether it represents one of four categories:
- Person (PER): e.g. 'Martha Stewart', 'Obama', 'Tim Wagner', etc. Pronouns like 'he' or 'she' are not considered named entities.
- Organization (ORG): e.g. 'American Airlines', 'Goldman Sachs', 'Department of Defense'.
- Location (LOC): e.g. 'Germany', 'Panama Strait', 'Brussels', but not unnamed locations like 'the bar' or 'the farm'.
- Miscellaneous (MISC): e.g. 'Japanese', 'USD', '1,000'.
Here is a sample sentence with the named entities tagged above each token.
ORG | ORG | O | PER | PER | O |
United | Airlines | CEO | Oscar | Munoz | arrived |
To evaluate the quality of a NER system's output on the test set, we look at the F-Score (harmonic mean of precision and recall) for each of the 5 classes.
Word Vectors
Both Part A and Part B rely on word vectors, which are provided in the constructor. You need to explicitly use the word vectors in Part A, while Part B only requires that you map each word to its index in the vocabulary, and the Embedding layer does the mapping to the vectors.
These word vectors have been trained using word2vec on another dataset so that the vectors of similar words are similar, and vice versa.
Here's a PCA projection of some of the word vectors onto two dimensions, to give you an idea of what the space looks like. Some information is obviously lost going from the original 50 dimensions down to 2.
Part A: Windowed Named Entity Recognition [20 pts]
Implement the load_conll
,
train
,
and predict
methods in WindowedNER
.
Run the code to train the model and get predictions.
python ner.py window
Part B: Named Entity Recognition with LSTMs [15 pts]
Follow the instructions
to implement the load_conll
and predict
methods
in LSTMNer
.
python ner.py lstmNote that the doc-string in the
load_conll
method for this class
should say "Each data point (row in X) is a sentence, represented as the concatenation of the indicesof each word in the sentence" rather than the one-hot vectors.
Part C: Execution [5 pts]
Once you have the above sections working,
run them and observe the prediction F-scores and confusion matrices.
Answer the questions in README.md
.
Testing
A tester.py
script has been pushed to your repositories.
You need to download windowed.pickle
and lstm.pickle
into your repo clones from here.
If you're working on tempest, create soft-links to these files from your clone:
ln -s /home/sravana/public_html/ml/ps6/windowed.pickle . ln -s /home/sravana/public_html/ml/ps6/lstm.pickle .
Run the code to check part A or part B:
python tester.py a python tester.py b
Complete honorcode.py
and README.md
and push your files by Thu, Apr 27th at 11:00 pm EST