Cleaning Transcript Data with Python

By | May 31, 2017

wikipedia is a corpusPerforming an analysis of text data or using text data to train machine learning models oftentimes requires a lot of data. Usually people look to Wikipedia for large amounts of text data, but occasionally scholars will make use of less traditional sources of data, like movie reviews for performing sentiment analysis on sentences or Ubuntu IRC chat dialogue. Movie reviews are paired with scores on a scale of 1-10, making them excellent data for supervised machine learning. Similarly, there exists a mountain of professionally annotated text data that can be linked with data of other mediums as well (video, image, audio, etc), and that is movie and T.V. show subtitle data. There are a variety of sources you can obtain this data from, one is In my search for The Big Bang Theory transcripts, I found this site which contained a gold standard level transcript of each Big Bang Theory episode. I will be using this data in my example code. It only took a short script to scrape the data in one go, similar to how I obtained rated dated on quotes in this post. Subtitle data from .srt files may not be as nicely formatted as this, and it might not have the character’s name paired with their lines. Once you have the data you will need to transform it and format it into your desired data format (frequently CSV), or perhaps just a single text file. Here I will show how to accomplish both. Let’s start with an example where we have data from each episode in a separate file, and a goal of consolidating it into one file while also trimming out the lines we don’t want (like scene descriptions – “Leonard’s apartment, shortly after the comic book store. . .” or timestamps).

import os
import sys

directory = sys.argv[1] # takes the path to the folder with the data as an argument e.g., "C:\User\Documents\bbt_data"
os.chdir(directory) # move to data folder
os.chdir('..') # move outside of the folder to save output there
outputfile = open('full_bbt_corpus_s1-s9_raw', 'w')
print("Saving output to: %s" % os.getcwd())

I should first point out that these examples are scripts and nothing more. If you were developing code and planned on building on top of it, you would want to put these lines within the main method and create methods or functions to perform common operations. For data preparation and cleaning, ad-hoc scripts are preferable because each dataset is different. Python is an excellent language for these tasks.
The first thing we do is store the path to the folder in the string variable “directory”. Then we move out of the directory so we can save the output file in the parent folder and not have it get mixed in with the transcript files. To read each file, we’ll use a nested for loop like:

for filename in os.listdir(directory): # read each file in the data folder, read line by line

    # build full path name
    filename = directory + '/' + filename
    f = open(filename, 'r')

    for line in f:
        outputfile.write(line) # write the cleansed data

    f.close() # close the episode's file when done

outputfile.close() # close the output file

This is the basic method to achieve what I described. Right before “outputfile.write(line)” you can make any modifications to the line variable needed. One thing I did was strip out unicode encodings and skip lines without dialogue. If your data spans multiple directories, you can adapt the script to use the os.walk method (I have a post that uses it here). For the unabridged, copy pasta-able code, you can receive it (as well as the compressed raw data, if you don’t want to scrape it) by subscribing via this form:

And now, if we want to create a CSV formatted file, we only need to make a few adjustments. Continuing with the example of Big Bang Theory dialogue data, let’s try to organize the data to best reflect how Sheldon responds to things. The first column will be the context, or the things the other characters have said. The second column will be Sheldon’s line. The third column will be a ‘1’ as a label for later use. All we have to do is add this little bit of logic within our second for loop to achieve this:

            if line.startswith('Sheldon:'):
                if context == '':
                    context += line + " "

                line = line.replace(',','')
                line = line.replace(';', '')
                outputfile.write('"'+ context + '", ' + line + ", 1" + '\n') # 
                context = ''

            context += line + " " # save the preceding line for context/reference

Plain English translation of what this does: if this is Sheldon’s line, and if there is no context stored in the context variable, append the context variable with Sheldon’s line and continue with the next for loop iteration. However, if there is data in the context variable, write a row to the output file with commas separating the variables. Append a newline character to the end.
If this is not Sheldon’s line, append it to the context variable.

If you’re newer, it may help to read/understand that last line before the rest of the code, since it will be accumulating the dialogue until Sheldon’s line appears. Notice how the continue statement allows the graceful continuation of the for loop; there are programming languages (that shall not be named) that don’t allow this. I will also work on uploading the data files I’m using, or better yet an Ipython notebook, to make these examples easier to interact with. Though I didn’t use it, the CSV module in Python is extremely useful – writing a row is as simple as creating a list and passing it in to the writerow() method, with each element of the list being a column.

That’s all for now. I plan on making this post a part of a series of posts of some sort, so more will be added in the future. There should be at least 4 parts: scraping the data, transforming and cleaning the data, creating test and validation sets of data for machine learning models, and an interesting way to apply the resultant model. . .

Leave a Reply

Your email address will not be published.