This vignette demonstrates how to use the transcribeR package in R. The transcribeR package is built to request and receive transcriptions from HP IDOL OnDemand's Speech Recognition API. The package is easy to use with the related R package stm for the estimation of audio and video topic models.
HP IDOL OnDemand's Speech Recognition API
Politicians and other speakers increasingly turn to YouTube and other video-sharing websites to both convey their messages to millions around the world and digitally archive their speeches. As a consequence, valuable data to social scientists and historians are being locked away in traditionally intractable audio and video formats. The transcribeR package provides convenient access to HP IDOL OnDemand's Speech Recognition API and allows for audio and video files to be processed en masse, with all input and output organized into one file. The transcribeR package allows users to send all the files in a specified directory to HP IDOL OnDemand's Speech Recognition API by using the sendAudioGetJobs
function. The same function stores a list of the Job IDs returned by HP into a CSV file. Once all the files are posted to HP, the transcripts can be extracted with the retrieveText
function and stored in the same CSV. Hence, all of the information about the audio files and their transcripts is stored in one CSV, eliminating potential confusion and alignment errors that could arise from storing the filename, job ID, and transcripts in separate files.
The transcribeR package has the following easy-to-obtain dependencies:
httr (R package)
HP IDOL OnDemand Speech Recognition API Key
In addition, transcribeR currently supports uploading the following file formats: WAV, MP3, MP4, WMA.
HP IDOL OnDemand's Speech Recognition API is trained to transcribe both audio and video files of human speech. It can work with both local audio and video files as well as video files available from a URL. The transcription time of the video depends on the length and the segmentation of the video, and it can take anywhere from 0.05 to 0.20 seconds per second of video.
The two requests required for calling the API are POST and GET. POST posts a file to the API and GET retrieves the resulting JSON object, which contains the status and output. Because the transcription process takes a nontrivial amount of time to complete, the requests cannot be completed simultaneously, so the output isn't immediately available. Instead, HP IDOL OnDemand provides a Job ID for tracking the status of the call, and connection with the API does not need to be maintained. When the call is complete, the JSON object has the “finished” status and the transcription.
In order to use the Speech Recognition API, one must have an API Key. An API Key can be obtained by making an account at HP IDOL OnDemand's Website. API Keys are passed as parameters to the functions sendAudioGetJobs
and retrieveText
.
The Speech Recognition API keeps track of requests by assigning each one a Job ID. Users can retrieve the job status and transcript at any time via the Job ID. The transcribeR package not only uses Job IDs to retrieve the status, but the function sendAudioGetJobs
checks to see if a file already has a Job ID on the transcribeR CSV. If it finds one, it avoids posting the file to HP to minimize upload time and API Key usage.
When posting to the Speech Recognition API, a language code must be specified. The default in sendAudioGetJobs
is “en-US” to indicate American English, but other languages can be specified instead. In order to examine which languages are supported, the printLanguages
function can be called to list all the language codes to the user.
The interval
parameter in sendAudioGetJobs
specifies segmentation length for transcripts in milliseconds. Specifying segment length risks segmenting the audio during an utterance or spoken word. An argument of -1 indicates that the entire file should be transcribed at once (hence no segmentation), and an argument 0 indicates segmentation by each word, which are determined by periods of silence. Because specifying a segmentation length could split utterances and not periods of silence, it is not recommended to specify a positive segmentation length. Because the transcription process relies on context to determine additional probability, it is also not recommended to set the interval
parameter to zero. This is why -1 is the default value for interval
.
HP IDOL OnDemand provides monthly quotas for using its APIs. When a file is posted to the Speech Recognition API, HP IDOL OnDemand subtracts a call and the data of the audio file from its monthly quota. The quotas can be found at https://www.idolondemand.com/account/quotas.html. The relevant ones as of July 22, 2015 are listed below:
Monthly API Calls (50000)
Total Input Bytes (30000 MB Source Data)
The transcribeR
package uses CSVs to store all data because R can easily read a CSV to make a data frame and write to a CSV from a data frame. Within the CSV, six columns are used to store the values associated with each file that transcribeR
processes. By storing the filename in the CSV, transcribeR
can read that file name determine if that file was ever successfully posted to the Speech Recognition API. And pairing the file name and job ID as two entries in the same row of a CSV allows the user to clearly determine which file corresponds to which request made to the API. The resulting transcript from the API is stored in the Transcript field of the same row. With this arrangement, the user can easily determine the input and output at each stage of using the API.
The transcribeR
package reads and writes CSVs which have the following columns (in order and verbatim):
DATE - the date when the file was posted to the Speech Recognition API.
APIKEY - the APIKEY used to make the request.
FILENAME - the filepath of the posted file; the sendAudioGetJobs
function relies on reading the FILENAME field to determine if a file was previously posted so that it will avoid posting it again.
LANGUAGE - the associated HP language code
JOBID - the Job ID returned by the API when the file is posted
TRANSCRIPT - the transcript retrieved from the API once retrieveText
when the job's status is “finished”“; if the job's status is not "finished”, then the current status is written to the TRANSCRIPT field. If and only if the TRANSCRIPT field for a file is either NULL
or a status that isn't “finished”, then retrieveText
will update the field.
The sendAudioGetJobs
function handles the POST request to the Speech Recognition API. It accepts the following parameters:
wav.dir – the directory in which the WAV files of interest are located
api.key – the API Key used to authenticate requests made to HP Idol OnDemand
interval (defaults to “-1”) – the segmentation interval used by the Speech Recognition API
encode (defaults to “multipart”) – the format used when posting the file to the Speech Recognition API
existing.csv (defaults to NULL) – a CSV that sendAudioGetJobs
will append to rather than overwrite; if NULL, it will create a new CSV specified by csv.location
csv.location – the path of the output file to be used when a new CSV is created
language (defaults to “en-US”) – the language code used by HP
verbose (defaults to FALSE) – when TRUE, prints the uploading progress to the user
While it examines all WAV files in wav.dir, it only posts the files that are not also found in the existing CSV by checking the FILENAME column for the filename. The function itself returns the number of files that were uploaded in the instance it was called.
The retrieveText
function updates the TRANSCRIPT field in a transcribeR with the results currently held by the Speech Recognition API. It currently checks for rows whose TRANSCRIPT field is one of the following: 'queued', NA, or NULL, and then uses the Job IDs corresponding to those rows to retrieve the transcripts from the Speech Recognition API. It accepts the following parameters:
job.file – the transcribeR CSV that will be examined
api.key – the API Key used to authenticate requests made to HP Idol OnDemand
The function itself returns the updated data frame corresponding to the Job CSV.
From the included example.R file:
fnames <- c('boxer.wav', 'merkley.wav') # Example with 2 files
urls <- paste("http://christopherlucas.org/transcribeR/", fnames, sep = '')
lapply(urls, function(x) download.file(x, destfile = basename(x), mode = "wb"))
WAV_DIR <- getwd()
library(transcribeR)
print("First, printing language codes.")
printLanguages()
Sys.sleep(3)
print("Now posting requests to HP IDOL OnDemand's Speech Recognition API.")
sendAudioGetJobs(wav.dir = WAV_DIR,
api.key = API_KEY,
interval = "-1", # Transcript will not be segmented
encode = "multipart",
existing.csv = NULL, # Intended to create a new CSV
csv.location = CSV_LOCATION,
language = "en-US", # As printed above, one of the language codes
verbose = TRUE) # Prints out uploading progress to the user
Sys.sleep(3)
print("Waiting before requesting transcriptions.")
Sys.sleep(20) # Adequate delay to allow the Speech Recognition API to compute the most probable transcription.
retrieveText(job.file = CSV_LOCATION,
api.key = API_KEY)
print("Transcripts were requested.")
print("What if another file is added to the directory?") # Runs again with a newly added file
Sys.sleep(3)
url <- "http://christopherlucas.org/transcribeR/warren.wav"
download.file(url, destfile = basename(url))
WAV_DIR <- getwd()
sendAudioGetJobs(wav.dir = WAV_DIR,
api.key = API_KEY,
interval = "-1", # Transcript will not be segmented
encode = "multipart",
existing.csv = CSV_LOCATION, # Intended to create a new CSV
csv.location = CSV_LOCATION,
language = "en-US", # As printed above, one of the language codes
verbose = TRUE) # Prints out uploading progress to the user
Sys.sleep(3)
print("Waiting before requesting transcriptions.")
Sys.sleep(10) # Adequate delay to allow the Speech Recognition API to compute the most probable transcription.
retrieveText(job.file = CSV_LOCATION,
api.key = API_KEY)
print("Transcripts were requested.")
print("Done!")