CS 349: Natural Language Processing

Spring 2016, Wellesley College

Assignment 6: Speech Recognition

Out: Thu, Apr 14      Due: 11:59 pm EST on Thu, Apr 28
Repository: github.com/wellesleynlp/yourusername-cs349-a6-speech


Build a template-based isolated-word speech recognizer.


You may work with someone you have collaborated with before. Use the Google Sheet to find a partner and set up your joint repository as instructed.

Install the scikits.talkbox using pip. This relies on scipy being installed on your computer.

pip install scikits.talkbox
Unless you already have an audio program that can resample files, also install the sox command-line program.

If these installations give you trouble, use tempest as instructed in the Tools doc.

We will use a portion of the CSLU ISOLET corpus, consisting of recordings of isolated letters by different speakers. Each filename is of the form speakername-let1-t.wav (a recording of speakername saying the letter "let").

Download the corpus from this link, and place the unzipped isolet folder in your repository clone (but don't add it when you push your code). If you are working on tempest, you should use the dataset stored in /home/sravana/nlp/isolet directly, without copying it to your home directory. Do this by creating a symbolic link to the wav directories.

cd your_repo_clone/
mkdir isolet
mkdir isolet/train
mkdir isolet/test
mkdir isolet/train/mfc
mkdir isolet/test/mfc
ln -s /home/sravana/nlp/isolet/test/wav isolet/test/wav
ln -s /home/sravana/nlp/isolet/train/wav isolet/train/wav

Important: Before you start, you must run the provided feature extraction script to compute MFCCs from your WAV files. Execute featurize.py appropriately on the train and test folders, which will write .mfc files corresponding to every .wav file into train/mfc and test/mfc. Each row in a .mfc file represents the cepstral coefficients from one time frame.

The default Python on tempest has changed recently. Please explicitly call /opt/bin/python2.7 since this is the one that has scipy, or set your PYTHONPATH.


Before the deadline (11:59 pm EST on Thu, Apr 28), push the files below to your repository following these instructions.

  • the completed dtwrecognize.py
  • the .csv output file created on your dataset for section (b)
  • README.md and reflection.md
In addition, upload to Google Drive the myrecordings dataset you create in section (b) as a .zip file named ghusername1_ghusername2-speech.zip (where the ghusername1_ghusername2 name matches your repository name). Make it public, or privately share it with me, and include the link in your README.

If you are working with a partner, make sure you have set things up as described so you are both contributing to a single repository.

Section (a): Word Recognition with Dynamic Time Warping [35 pts]

utils.py contains a helper function for kNN classification (like the one for A5). This function takes an integer k, a 2-d array distances, and a list trainlabels. distances[i, j] contains the DTW distance between the ith test audio file and the jth training audio file trainlabels[j] contains the label of the jth training file. (In this case, the labels are the transcription of the file -- "a", "b", "q", etc.) The function returns a list testlabels where testlabels[i] is the predicted transcription of the ith audio file.

Since you're mature Python programmers by now, no other starter code is provided for this section. Write your entire program in a file named dtwrecognize.py. This program should take these arguments in order:

  1. path to the training directory with MFCC files
  2. integer value of k for kNN classification
  3. a string specifying whether to run in "batch" or "single"-file mode. Batch mode computes transcription predictions for several test files using pre-processed MFCCs, while single mode reads a WAV file and predicts its transcription.
  4. the test directory name with MFCC files in case of "batch" mode, or a filename of a .wav file in case of "single" mode.

Batch Mode


python dtwrecognize.py isolet/train/mfc 1 batch isolet/test/mfc
  1. Computes the DTW distances (with Euclidean distance for each frame) between every MFCC training file in isolet/train/mfc and every MFCC test file in isolet/test/mfc.

    Debug by checking that the DTW distance between isolet/test/mfc/mvcw0-m1-t.mfc and isolet/train/mfc/fgw0-x1-t.mfc is 271.04.
  2. Runs kNN with k=1 on this distance matrix to predict the transcriptions of the test files.
  3. Produces a hypothesis file named isolet-1.csv identical (except for ordering) to isolet-1.csv where each line contains the test filename,actual letter,predicted letter.
  4. Finally, prints out the accuracy score to standard output, which is 78.84% for this example.

The program takes about 1 minute on the isolet data. You can run it on a subset of the test data for faster debugging.

Single-File Mode

Testing on a set of MFCC files and computing accuracy is great for development and evaluation, but it's not a particularly fun tool to show your friends.

Add an option for the program to read a single wav file of someone saying a letter, compute its MFCC representation on the fly, and predict its transcription using your code from section (a). Use the function from featurize.py. For example, executing

python dtwrecognize.py isolet/train/mfc 3 single isolet/test/wav/fmf0-x1-t.wav
should print out
Predicted label: x

Section (b): Personalized Speech Recognition [10 pts]

Partly relies on section (a).

Your program can be used out-of-the-box to build a recognizer for any small vocabulary task. Come up with your own vocabulary: names, commands for a device ["open", "read", "turnon"], anything that floats your boat. The vocabulary does not even have to be in English. Each "word" must be relatively short, but may contain multiple syllables.

Make about 5 recordings of each word. Partners should both contribute recordings -- data from multiple speakers makes a better system. You can also clip speech segments containing these words from other sound files (from YouTube, poscasts, etc) if you prefer. Save them as 16000 Hz mono-channel files. Use SoX or any audio program to convert the recordings.

sox infile.wav -r 16000 -c 1 outfile.wav

Store these files in a directory named myrecordings having the same directory and filename structure as isolet, with about 4 recordings per word for training and 1 for testing.

Test the performance of the recognizer on this dataset in batch mode. You can vary k if you think it will help. What's the accuracy? Is it better or worse than you expected given the results on the isolet data? Push the resulting myrecordings-k.csv file with the best accuracy result.

Finally, record yourself saying a few isolated letters, and run the recognizer trained on the isolet data in single-file mode on those recordings. (Remember to resample the wav files.) Is it comparable to the performance on the isolet test files?

Misc [5 pts]

  • Answer the remaining questions in reflection.md.
  • Fill out README.md and submit your work correctly.
  • Check that your code is reasonably documented, elegant, and efficient.