PS6: Neural Networks for Named Entity Recognition
Repository: click here
This problem set implements a model for Named Entity Recognition (identifying people, locations, etc. in sentences) using two different neural network architectures. It is partially adapted from here.
As in all problem sets, you may work in pairs (and are, in fact, encouraged to do so). Use the search for teammates post on Piazza to find partners.
- Gain practice working with Keras
- Understand different ways of approaching sequence classification problems with neural networks
Click the repository link at the top of the page while logged on to GitHub. This should create a new private repository for you with the skeleton code.
If you are working in a pair, go the Settings > Collaborators and Teams in your repository and add your partner as a collaborator. You will work on and submit the code in this repository together. Each pair should only have one repository for this problem set; delete any others lying around.
Clone this repository to your computer to start working. Download the data.zip file from here, unzip it, and place the resulting data directory in your repository clone.
- Clone your repo into tempest. Do not download the data.
- Place a soft-link to the data, by cd-ing into your clone, and typing
ln -s /home/sravana/public_html/ml/ps6/data .
- Copy keras.json from your clone to your home directory
by typing these commands from your clone.
mkdir -p ~/.keras cp keras.json ~/.keras/
- Add the following lines to your ~/.bashrc file.
export PYTHONPATH='/home/sravana/nlpbin/lib/python2.7/site-packages/':$PYTHONPATH alias python='/opt/bin/python2.7'
Commit your changes early and often! There's no separate submission step: just fill out honorcode.py and README.md, and commit. The last commit before the deadline will be graded, unless you increment the LateDays variable in honorcode.py.
See the workflow and commands for managing your Git repository and making submissions.
All code is to be written in
Named Entity Recognition is the task of locating and classifying named entities in text into pre-defined categories such as the names of persons, organizations, locations, etc. In the assignment, for a given a word in a context, we want to predict whether it represents one of four categories:
- Person (PER): e.g. 'Martha Stewart', 'Obama', 'Tim Wagner', etc. Pronouns like 'he' or 'she' are not considered named entities.
- Organization (ORG): e.g. 'American Airlines', 'Goldman Sachs', 'Department of Defense'.
- Location (LOC): e.g. 'Germany', 'Panama Strait', 'Brussels', but not unnamed locations like 'the bar' or 'the farm'.
- Miscellaneous (MISC): e.g. 'Japanese', 'USD', '1,000'.
Here is a sample sentence with the named entities tagged above each token.
To evaluate the quality of a NER system's output on the test set, we look at the F-Score (harmonic mean of precision and recall) for each of the 5 classes.
Both Part A and Part B rely on word vectors, which are provided in the constructor. You need to explicitly use the word vectors in Part A, while Part B only requires that you map each word to its index in the vocabulary, and the Embedding layer does the mapping to the vectors.
These word vectors have been trained using word2vec on another dataset so that the vectors of similar words are similar, and vice versa.
Here's a PCA projection of some of the word vectors onto two dimensions, to give you an idea of what the space looks like. Some information is obviously lost going from the original 50 dimensions down to 2.
Part A: Windowed Named Entity Recognition [20 pts]
predict methods in
Run the code to train the model and get predictions.
python ner.py window
Part B: Named Entity Recognition with LSTMs [15 pts]
Follow the instructions
to implement the
python ner.py lstmNote that the doc-string in the
load_conllmethod for this class should say "Each data point (row in X) is a sentence, represented as the concatenation of the indicesof each word in the sentence" rather than the one-hot vectors.
Part C: Execution [5 pts]
Once you have the above sections working,
run them and observe the prediction F-scores and confusion matrices.
Answer the questions in
tester.py script has been pushed to your repositories.
You need to download
into your repo clones from here.
If you're working on tempest, create soft-links to these files from your clone:
ln -s /home/sravana/public_html/ml/ps6/windowed.pickle . ln -s /home/sravana/public_html/ml/ps6/lstm.pickle .
Run the code to check part A or part B:
python tester.py a python tester.py b
and push your files by Thu, Apr 27th at 11:00 pm EST