Abstract

This vignette demonstrates how to use the transcribeR package in R. The transcribeR package is built to request and receive transcriptions from HP IDOL OnDemand's Speech Recognition API. The package is easy to use with the related R package stm for the estimation of audio and video topic models.

Table of Contents

Introduction

Overview

Dependencies

HP IDOL OnDemand's Speech Recognition API

Calling Asynchronous

API Key

Job IDs

Language Codes

Segmentation

Usage Quotas

transcribeR CSVs

The sendAudioGetJobs Function

The retrieveText Function

Examples

Introduction

Overview

Politicians and other speakers increasingly turn to YouTube and other video-sharing websites to both convey their messages to millions around the world and digitally archive their speeches. As a consequence, valuable data to social scientists and historians are being locked away in traditionally intractable audio and video formats. The transcribeR package provides convenient access to HP IDOL OnDemand's Speech Recognition API and allows for audio and video files to be processed en masse, with all input and output organized into one file. The transcribeR package allows users to send all the files in a specified directory to HP IDOL OnDemand's Speech Recognition API by using the sendAudioGetJobs function. The same function stores a list of the Job IDs returned by HP into a CSV file. Once all the files are posted to HP, the transcripts can be extracted with the retrieveText function and stored in the same CSV. Hence, all of the information about the audio files and their transcripts is stored in one CSV, eliminating potential confusion and alignment errors that could arise from storing the filename, job ID, and transcripts in separate files.

Dependencies

The transcribeR package has the following easy-to-obtain dependencies:

In addition, transcribeR currently supports uploading the following file formats: WAV, MP3, MP4, WMA.

HP IDOL OnDemand's Speech Recognition API

HP IDOL OnDemand's Speech Recognition API is trained to transcribe both audio and video files of human speech. It can work with both local audio and video files as well as video files available from a URL. The transcription time of the video depends on the length and the segmentation of the video, and it can take anywhere from 0.05 to 0.20 seconds per second of video.

Calling Asynchronous

The two requests required for calling the API are POST and GET. POST posts a file to the API and GET retrieves the resulting JSON object, which contains the status and output. Because the transcription process takes a nontrivial amount of time to complete, the requests cannot be completed simultaneously, so the output isn't immediately available. Instead, HP IDOL OnDemand provides a Job ID for tracking the status of the call, and connection with the API does not need to be maintained. When the call is complete, the JSON object has the “finished” status and the transcription.

API Key

In order to use the Speech Recognition API, one must have an API Key. An API Key can be obtained by making an account at HP IDOL OnDemand's Website. API Keys are passed as parameters to the functions sendAudioGetJobs and retrieveText.

Job IDs

The Speech Recognition API keeps track of requests by assigning each one a Job ID. Users can retrieve the job status and transcript at any time via the Job ID. The transcribeR package not only uses Job IDs to retrieve the status, but the function sendAudioGetJobs checks to see if a file already has a Job ID on the transcribeR CSV. If it finds one, it avoids posting the file to HP to minimize upload time and API Key usage.

Language Codes

When posting to the Speech Recognition API, a language code must be specified. The default in sendAudioGetJobs is “en-US” to indicate American English, but other languages can be specified instead. In order to examine which languages are supported, the printLanguages function can be called to list all the language codes to the user.

Segmentation

The interval parameter in sendAudioGetJobs specifies segmentation length for transcripts in milliseconds. Specifying segment length risks segmenting the audio during an utterance or spoken word. An argument of -1 indicates that the entire file should be transcribed at once (hence no segmentation), and an argument 0 indicates segmentation by each word, which are determined by periods of silence. Because specifying a segmentation length could split utterances and not periods of silence, it is not recommended to specify a positive segmentation length. Because the transcription process relies on context to determine additional probability, it is also not recommended to set the interval parameter to zero. This is why -1 is the default value for interval.

Usage Quotas

HP IDOL OnDemand provides monthly quotas for using its APIs. When a file is posted to the Speech Recognition API, HP IDOL OnDemand subtracts a call and the data of the audio file from its monthly quota. The quotas can be found at https://www.idolondemand.com/account/quotas.html. The relevant ones as of July 22, 2015 are listed below:

transcribeR CSVs

The transcribeR package uses CSVs to store all data because R can easily read a CSV to make a data frame and write to a CSV from a data frame. Within the CSV, six columns are used to store the values associated with each file that transcribeR processes. By storing the filename in the CSV, transcribeR can read that file name determine if that file was ever successfully posted to the Speech Recognition API. And pairing the file name and job ID as two entries in the same row of a CSV allows the user to clearly determine which file corresponds to which request made to the API. The resulting transcript from the API is stored in the Transcript field of the same row. With this arrangement, the user can easily determine the input and output at each stage of using the API.

The transcribeR package reads and writes CSVs which have the following columns (in order and verbatim):

The sendAudioGetJobs() Function

The sendAudioGetJobs function handles the POST request to the Speech Recognition API. It accepts the following parameters:

While it examines all WAV files in wav.dir, it only posts the files that are not also found in the existing CSV by checking the FILENAME column for the filename. The function itself returns the number of files that were uploaded in the instance it was called.

The retrieveText() Function

The retrieveText function updates the TRANSCRIPT field in a transcribeR with the results currently held by the Speech Recognition API. It currently checks for rows whose TRANSCRIPT field is one of the following: 'queued', NA, or NULL, and then uses the Job IDs corresponding to those rows to retrieve the transcripts from the Speech Recognition API. It accepts the following parameters:

The function itself returns the updated data frame corresponding to the Job CSV.

Examples

From the included example.R file:

fnames <- c('boxer.wav', 'merkley.wav') # Example with 2 files
urls <- paste("http://christopherlucas.org/transcribeR/", fnames, sep = '')
lapply(urls, function(x) download.file(x, destfile = basename(x), mode = "wb"))

WAV_DIR <- getwd()

library(transcribeR)

print("First, printing language codes.")

printLanguages()

Sys.sleep(3)

print("Now posting requests to HP IDOL OnDemand's Speech Recognition API.")

sendAudioGetJobs(wav.dir = WAV_DIR,
                 api.key = API_KEY,
                 interval = "-1",             # Transcript will not be segmented
                 encode = "multipart",
                 existing.csv = NULL,         # Intended to create a new CSV
                 csv.location = CSV_LOCATION,
                 language = "en-US",          # As printed above, one of the language codes
                 verbose = TRUE)              # Prints out uploading progress to the user

Sys.sleep(3)

print("Waiting before requesting transcriptions.")

Sys.sleep(20) # Adequate delay to allow the Speech Recognition API to compute the most probable transcription.

retrieveText(job.file = CSV_LOCATION,
             api.key = API_KEY)

print("Transcripts were requested.")

print("What if another file is added to the directory?") # Runs again with a newly added file

Sys.sleep(3)

url <- "http://christopherlucas.org/transcribeR/warren.wav"
download.file(url, destfile = basename(url))
WAV_DIR <- getwd()

sendAudioGetJobs(wav.dir = WAV_DIR,
                 api.key = API_KEY,
                 interval = "-1",             # Transcript will not be segmented
                 encode = "multipart",
                 existing.csv = CSV_LOCATION,         # Intended to create a new CSV
                 csv.location = CSV_LOCATION,
                 language = "en-US",          # As printed above, one of the language codes
                 verbose = TRUE)              # Prints out uploading progress to the user

Sys.sleep(3)

print("Waiting before requesting transcriptions.")

Sys.sleep(10) # Adequate delay to allow the Speech Recognition API to compute the most probable transcription.

retrieveText(job.file = CSV_LOCATION,
             api.key = API_KEY)

print("Transcripts were requested.")

print("Done!")