2 Using soundgen

2.1 Where to start

To generate a sound, you can either type soundgen_app() to open an interactive Shiny app or call soundgen() from R console with manually specified parameters. The app offers nice visualizations and is more user-friendly if you are not used to programming, but note that it doesn’t support some advanced features (e.g., vectorization of some control parameters). An object called presets contains a collection of presets that demonstrate some of the possibilities. More information is available on the project’s homepage at http://cogsci.se/soundgen.html.

2.2 Audio playback

Audio playback may fail, depending on your platform and installed software. Soundgen relies on tuneR library for audio playback, via a wrapper function called playme() that accepts both Wave objects and simple numeric vectors. If soundgen(play = TRUE) throws an error, make sure the audio can be played before you proceed with using soundgen. To do so, save some sound as a vector first: sound = soundgen(play = FALSE) or even simply sound = rnorm(10000). Then try to find a way to play this vector sound. You may need to change the default player in tuneR or install additional software. See the seewave vignette on sound input/output for an in-depth discussion of audio playback in R. Sueur (2018, p. 100) recommends Windows Media Player on Windows, AudioUnits for Mac OS, SoX for Linux (the player is called “play”), or VLC on any platform. I find that “play” from the “vox” library or “aplay” work well on Linux, and “afplay” on Macs.

Because of possible errors, audio playback is disabled by default in the rest of this vignette. To turn it on without changing any code, simply set the global variable playback to the appropriate value for your specific OS, for ex.:

playback = list(TRUE, FALSE, 'vlc', 'my-awesome-player')[[2]]
# TRUE means defaulting to "play" on Linux, "afplay" on Mac, 
# and the defaults of tuneR::play on Windows
# FALSE means no sound playback

2.3 From the console

The basic workflow from R console is as follows:

library(soundgen)
s001 = soundgen(play = playback)  # default sound: a short [a] by a male speaker
# 's' is a numeric vector - the waveform. You can save it, play it, plot it, ...

# names(presets)  # speakers in the preset library
# names(presets$Chimpanzee)  # presets per speaker
s002 = eval(parse(text = presets$Chimpanzee$Scream_conflict))  # a screaming chimp
# playme(s)

2.4 From the app

The basic workflow in the Shiny app is as follows:

Start the app by typing soundgen_app(). RStudio should open it in the default web browser (there will be no sound if the app runs in an RStudio window instead of a browser). Firefox and Chrome are known to work. Safari will probably fail to play back the generated audio, although the output can still be exported as a .wav file.
Set parameters in the tabs on the left (see the sections below for details). You can also start with a preset that resembles the sound you want and then fine-tune control parameters.
Check the preview plots and tables of anchors to ensure you get what you want.
Click Generate. This will create a .wav file, play it, and display the spectrogram or long-term average spectrum.
Save the generated sound or go back to (1) to make further adjustments.

TIP The interactive app soundgen_app() gives you the exact R code for calling soundgen(), which you can copy-paste into your R environment and generate manually the same sound as the one you have created in the app. If in doubt about the right format for a particular argument, you can use the app first, copy-paste the code into your R console, and modify it as needed. You can also import an existing formula into the app, adjust the parameters in an interactive environment, and then export it again. BUT: the app can only use a single value for many parameters that are vectorized when called from the command line (rolloff, jitterDep, etc.).

2.5 Syllables

If you need to generate a single syllable without pauses, the only temporal parameter you have to set is sylLen (“Syllable length, ms” in the app). For a bout of several syllables, you have two options:

Set nSyl (“Number of syllables” in the app). Unvoiced noise is then allowed to fill in the pauses (if noise is longer than the voiced part), and you can specify an amplitude contour, intonation contour, and formant transitions that will span the entire bout. For ex., if the vowel sequence in a three-syllable bout is “uai”, the output will be approximately “[u] – pause – [a] – pause – [i]”.

s003 = soundgen(formants = 'uai', repeatBout = 1, nSyl = 3, play = playback)
# to replay without re-generating the sound, type "playme(s)"

Set repeatBout (“Repeat bout # times” in the app). This is the same as calling soundgen repeatedly with the same settings or clicking the Generate button in the app several times. If temperature = 0, you will get exactly the same sound repeated each time, otherwise some variation will be introduced. For the same “uai” example, the output will be “[uai] – pause – [uai] – pause – [uai]”.

s004 = soundgen(formants = 'uai', repeatBout = 3, nSyl = 1, play = playback)
# playme(s)

Like most arguments to soundgen, sylLen and pauseLen can also be vectors. For example, if you want to synthesize 5 syllables of progressively shorter duration and separated by increasingly longer pauses, you can write:

s005 = soundgen(nSyl = 5, 
             sylLen = c(300, 100),   # linearly decreasing from 300 to 100 ms
             pauseLen = c(50, 150),  # increasing from 50 to 150 ms
             plot = TRUE,
             play = playback)

# playme(s)

For more complicated changes in the length of syllables or pauses, you can use the function getSmoothContour to upsample your anchors (see “Intonation” for examples) or manually code longer sequences of values. The length of your input vector doesn’t matter: it will be up- or downsampled automatically. This also works with all other vectorized arguments to soundgen (rolloff, jitterDep, vibratoFreq, etc).

s006 = soundgen(
  nSyl = 10, 
  sylLen = c(60, 200, 90, 50, 50),  # quickly up to 200 and down to 50
  pauseLen = c(50, 60, 80, 150),    # growing ~exponentially
  plot = TRUE,
  play = playback
)

As a special case, your values will be used without interpolation if you provide exactly as many as needed:

s007 = soundgen(
  nSyl = 5, 
  sylLen = c(300, 100, 400, 50, 100),  # 5 syllables, 5 values
  pauseLen = c(50, 150, 50, 100),      # 4 pauses, 4 values
  plot = TRUE,
  play = playback
)

You can use both repeatBout and nSyl simultaneously. The pause between bouts is equal to the length of the first syllable:

s008 = soundgen(
  repeatBout = 2,
  nSyl = 3, 
  sylLen = c(300, 100), 
  pauseLen = c(100, 50),     
  plot = TRUE,
  play = playback
)

Note that all pauses between syllables have to be positive. A negative pause (overlap) between bouts is allowed, but you have to enforce it with invalidArgAction = "ignore":

s009 = soundgen(
  repeatBout = 2,
  sylLen = c(300, 100), 
  pauseLen = -50,     
  plot = TRUE,
  play = playback,
  invalidArgAction = 'ignore'
)

## Warning in validatePars(p, gp, permittedValues, invalidArgAction): 
## pauseLen should be between 0 and 1000; override with caution

2.6 Intonation

2.6.1 One syllable

When we hear a tonal sound such as someone singing, one of its most salient characteristics is intonation or, more precisely, the contour of the fundamental frequency (f0), or, even more precisely, the contour of the physically present or perceptually extrapolated spectral band which is perceived to correspond to the fundamental frequency (pitch). Soundgen literally generates a sine wave corresponding to f0 and several more sine waves corresponding to higher harmonics, so f0 is straightforward to implement. However, how can its contour be specified with as few parameters as possible? The solution adopted in soundgen is to take one or more anchors as input and generate a smooth contour that passes through all anchors.

In the simplest case, all anchors are equidistant, dividing the sound into equal time steps. You can then specify anchors as a numeric vector. For example:

# steady pitch at 440 Hz
s010 = soundgen(pitch = 440, play = playback)

# downward chirp
s011 = soundgen(pitch = 3000:2000, play = playback,
                samplingRate = 44100, pitchSamplingRate = 44100)  
# when f0 is high, increase samplingRate and pitchSamplingRate for better quality

# up and down
s012 = soundgen(pitch = c(150, 250, 100), sylLen = 700, play = playback)

# 3rd quarter silent
s013 = soundgen(pitch = c(150, 200, NA, 110), 
                 sylLen = 700, play = playback)

You can also use a mathematical formula to produce very precise pitch modulation, just check that the values are on the right scale. For example, sinusoidal pitch modulation can be created as follows:

anchors = (sin(1:70 / 3) * .25 + 1) * 350
plot(anchors, type = 'l', xlab = 'Time (points)', ylab = 'Pitch (Hz)')
s014 = soundgen(pitch = anchors, sylLen = 1000, play = playback)

For more flexibility, anchors can also be specified at arbitary times using the “anchor format” - a dataframe with two columns: time (ms) and value (in the case of pitch, this is frequency in Hz). The function that generates smooth contours of f0 and other parameters is getSmoothContour(). When you generate sounds, soundgen() has an argument smooth(list = ...), where you can put the settings passed on to getSmoothContour(). So you do not have to call getSmoothContour() explicitly, although sometimes it can be helpful to do so in order to visualize the curve implied by your anchors. Time can range from 0 to 1, or it can be specified in ms – it makes no difference, since the produced contour is rescaled to match syllable duration.

For example, say we want f0 first to increase sharply from 350 to 700 Hz and then to slowly return to baseline. Time anchors can then be specified as c(0, .1, 1) (think of it as “start”, “10%”, and “end” of the sound), and the arguments len and samplingRate together determine the duration: len / samplingRate gives duration in seconds. Values are processed on a logarithmic (musical) scale if thisIsPitch is TRUE, and the resulting curve is smoothed (the default behavior is to use loess for up to 10 anchors, cubic spline for 11-50 anchors, and linear interpolation for >50 anchors).

A sound with this intonation can be generated as follows:

s015 = soundgen(
  sylLen = 900, play = playback,
  pitch = list(time = c(0, .1, 1),  # or (c(0, 30, 300)) - in ms
               value = c(350, 700, 350)))

Beware of smoothing! A curve interpolated from a few anchors is not uniquely defined, and the interpolation algorithm has a major effect on its shape. The amount of smoothing can be controlled with loessSpan and interpol:

sylLen = 500  # desired syllable length, in ms
samplingRate = 16000
sylLen_points = sylLen / 1000 * samplingRate
anchors = data.frame(time = c(0, .1, 1), 
                       value = c(350, 700, 350))

par(mfrow = c(1, 3))
smc1 = getSmoothContour(
  anchors = anchors,
  len = sylLen_points,
  interpol = 'approx',
  thisIsPitch = TRUE, plot = TRUE, 
  main = 'No smoothing', samplingRate = samplingRate
)
sm2 = getSmoothContour(
  anchors = anchors,
  len = sylLen_points,
  loessSpan = 0.75,
  thisIsPitch = TRUE, plot = TRUE, 
  main = 'loessSpan = .75', samplingRate = samplingRate
)
smc3 = getSmoothContour(
  anchors = anchors,
  len = sylLen_points,
  loessSpan = 1,
  thisIsPitch = TRUE, plot = TRUE, 
  main = 'loessSpan = 1', samplingRate = samplingRate
)
par(mfrow = c(1, 1))
# likewise: soundgen(smoothing = list(interpol = 'loess', loessSpan = 1))

To get more complex curves, simply add more anchors. If you are not satisfied with the smooth curve generated by soundgen() based on your anchors, you can produce a longer vector (e.g., you could use analyze() or pitch_app() to extract the pitch contour of an existing recording), and then you can feed soundgen with this arbitrarily long vector instead of using the anchor format, ensuring very precise control over the intonation contour.

TIP Many arguments to soundgen are vectorized, and most vectorized arguments understand the “anchor format” you just encountered above, namely something like my_argument = list(time = ..., value = ...), where time can be in ms or ~[0, 1]. See ?soundgen for a complete list of anchor-format arguments and keep in mind two important special cases that use a slightly different format: formants and noise (see below). And remember to check that interpolation looks reasonable!

The assumption behind specifying an entire contour with a few discrete anchors is that the contour is smooth and continuous. However, there may be special occasions when you do want a discontinuity such as an instantaneous pitch jump. The default behavior of getSmoothContour() is to make a jump if two anchors are closer than one percent of the syllable length (as specified with the default jumpThres = 0.01). To make a pitch jump, you thus provide two values of f0 that are very close in time, for example:

s016 = soundgen(sylLen = 800, plot = TRUE, play = playback,
                pitch = list(time = c(0, .2, .201, .4, 1), 
                             value = c(900, 1200, 1800, 2000, 1500)),
                samplingRate = 22050)

## pitchSampingRate should be much higher than the highest pitch; resetting to 20000 Hz

TIP Given the same anchors, the shape of the resulting curve depends on syllable duration. That’s because the amount of smoothing is adjusted automatically as you change syllable duration. Double-check that all your contours still look reasonable if you change the duration!

To draw f0 contour in the Shiny app, use “Intonation / Intonation syllable” tab and click the intonation plot to add anchors. Soundgen then generates a smooth curve through these anchors. If you click the plot close to an existing anchor, the anchor moves to where you clicked; if you click far from any existing anchor, a new anchor is added. To remove an anchor, double-click it. To go back to a straight line, click the button labeled “Flatten pitch contour”. Exactly the same principles apply to all anchors in soundgen_app (pitch, amplitude, mouth opening, and noise). Note also that all contours are rescaled when the duration changes, with the single exception of negative time anchors for noise (i.e. the length of pre-syllable aspiration does not depend on syllable duration).

2.6.2 Multiple syllables

If the bout consists of several syllables (nSyl > 1), you can also specify the overall intonation over several syllables using pitchGlobal (app: “Intonation / Intonation global”). The global intonation contour specifies the deviation of pitch per syllable from the main pitch contour in semitones, i.e. 12 semitones = 1 octave. In other words, it shows how much higher or lower the average pitch of each syllable is compared to the rest of the syllables. For ex., we can generate five seagull-like sounds, which have the same intonation contour within each syllable, but which vary in average pitch spanning about an octave in an inverted U-shaped curve. Note that the number of anchors need not equal the number of syllables:

s017 = soundgen(nSyl = 5, sylLen = 200, pauseLen = 140, 
                plot = TRUE, play = playback,
                pitch = list(time = c(0, 0.65, 1), 
                             value = c(977, 1540, 826)),
                pitchGlobal = list(time = c(0, .5, 1), 
                                   value = c(-6, 7, 0)))

# pitchGlobal = c(-6, 7, 0) is equivalent, since time steps are equal

TIP Calling soundgen with argument plot = TRUE produces a spectrogram using a function from soundgen package, spectrogram. Type ?spectrogram or ?spectrogramFolder and see the vignette on acoustic analysis for plotting tips and advanced options. You can also plot the waveform produced by soundgen using any other function, e.g. seewave::spectro

2.6.3 Vibrato

Vibrato adds frequency modulation (FM) to f0 contour by modifying f0 per glottal cycle. In contrast to irregular jitter and temperature-related random drift, this FM is regular, namely sinusoidal:

# variable, but deterministic vibrato (same every time)
s018 = soundgen(vibratoDep = 0:3, vibratoFreq = 7:5, 
                sylLen = 2000, pitch = c(300, 280), 
                play = playback, plot = TRUE)

# stochastic vibrato (different every time)
s019 = soundgen(vibratoDep = rnorm(n = 10, mean = .5, sd = .1), 
                vibratoFreq = rnorm(n = 10, mean = 5, sd = .5), 
                sylLen = 2000, pitch = c(300, 280), 
                play = playback, plot = TRUE)

2.7 Hyper-parameters

2.7.1 Temperature

It is a basic principle of soundgen that random variation can be introduced in the generated sound. This behavior is controlled by a single high-level parameter, temperature (app: “Main / Hypers”). If temperature = 0, you will get exactly the same sound by executing the same call to soundgen repeatedly. If temperature > 0, each generated sound will be somewhat different, even if all the control parameters are exactly the same. In particular, positive temperature introduces fluctuations in syllable structure, all contours (intonation, breathing, amplitude, mouth opening), and many effects (jitter, subharmonics, etc). It also “wiggles” user-specified formants and adds new formants above the specified ones at a distance calculated based on the estimated vocal tract length (see Section “Spectral filter (formants)” below).

Code example :

# the sound is a bit different each time, because temperature is above zero
s020 = soundgen(repeatBout = 5, temperature = 0.3, play = playback)
# Setting repeatBout = 5 is equivalent to:
# for (i in 1:5) soundgen(temperature = 0.3, play = playback)

If you don’t want stochastic behavior, set temperature to zero. But note that some effects, notably jitter and subharmonics, will then be added in an all-or-nothing manner: either to the entire sound or not at all. Also note that additional formants will not be added above the user-specified ones if temperature is exactly 0. In practice it may be better to set temperature to a very small positive value like 0.01. You can also change the extent to which temperature affects different parameters (e.g., if you want more variation in intonation and less variation in syllable structure). To do so, use tempEffects, which is a list of scaling coefficients that determine how much different parameters vary at a given temperature. tempEffects includes the following scaling coefficients:

amplDep: random fluctuations of user-specified amplitude anchors across syllables (if nSyl > 1)
amplDriftDep: drift of amplitude mirroring pitch drift
formDisp: irregularity of the dispersion of stochastic formants that are added above user-specified formants (if any) at distances consistent with the specified length of the vocal tract vocalTract
formDrift: the amount of random drift of formants
glottisDep: proportion of glottal cycle with closed glottis
noiseDep: random fluctuations of user-specified noise anchors across syllables (if nSyl > 1)
pitchDep: random fluctuations of user-specified pitch anchors across syllables (if nSyl > 1)
pitchDriftDep: amount of slow random drift of f0 (the higher, the more f0 changes)
pitchDriftFreq: frequency of slow random drift of f0 (the higher, the faster f0 changes)
rolloffDriftDep: drift of rolloff mirroring pitch drift
specDep: random fluctuations of rolloff, nonlinear effects, attack
subDriftDep: drift of subharmonic frequency and bandwidth mirroring pitch drift
sylLenDep: random fluctuations of the duration of syllables and pauses between syllables

The default value of each scaling parameter is 1. To enhance a particular component of stochastic behavior, set the corresponding coefficient to a value >1; to remove it completely, set its scaling coefficient to zero.

# despite the high temperature, temporal structure does not vary at all, 
# while formants are more variable than the default
s021 = soundgen(repeatBout = 3, nSyl = 2, temperature = .3, play = playback,
                tempEffects = list(sylLenDep = 0, formDrift = 3))

2.7.2 Other hypers

To simplify usage, there are a few other hyper-parameters. They are redundant in the sense that they are not strictly necessary to produce the full range of sounds, but they provide convenient shortcuts by making it possible to control several low-level parameters at once in a coordinated manner. Hyper-parameters are marked “hyper” in the Shiny app.

For example, to imitate the effect of varying body size, you can use maleFemale. Since formants are not specified, but temperature is above zero, a schwa-like sound with approximately equidistant formants is generated using vocalTract (cm) to calculate the expected formant dispersion:

s022 = soundgen(
  maleFemale = -1,  # male: 100% lower f0, 25% lower formants, 25% longer vocal tract
  formants = NA, pitch = 220, vocalTract = 15, play = playback)
mf = c(-1,  # male: 100% lower f0, 25% lower formants, 25% longer vocal tract
       0,   # neutral (default)
       1)   # female: 100% higher f0, 25% higher formants, 25% shorter vocal tract
# See e.g. http://www.santiagobarreda.com/vignettes/v1/v1.html

s023 = soundgen(
  maleFemale = 0,  # neutral (default)
  formants = NA, pitch = 220, vocalTract = 15, play = playback)

s024 = soundgen(
  maleFemale = 1,  # female: 100% higher f0, 25% higher formants, 25% shorter vocal tract
  formants = NA, pitch = 220, vocalTract = 15, play = playback)

To change the basic voice quality along the breathy-creaky continuum, use creakyBreathy. It affects the rolloff of harmonics, the type and strength of pitch effects (jitter, subharmonics), and the amount of aspiration noise. For example:

cb = c(-1,  # max creaky
       -.5, # moderately creaky
       0,   # neutral (default)
       .5,  # moderately breathy
       1)   # max breathy (no tonal component)
silence = rep(0, 1600)
s025 = silence
for (i in cb) {
  s025 = c(s025, soundgen(creakyBreathy = i), silence)
}
# playme(s025)

2.8 Amplitude envelope

Use ampl and amplGlobal to modulate the amplitude (loudness) of an individual syllable or a polysyllabic bout, respectively. In the app, they are found under “Amplitude / Amplitude syllable” and “Amplitude / Amplitude global”. Note that ampl affects only the voiced component, while amplGlobal, attackLen (“Attack length, ms” in the app), and amDep (“Amplitude / Amplitude modulation / AM depth” in the app) affect both the voiced and the unvoiced components. Avoid attackLen = 0, since that can cause clicks.

# each syllable has a 10-dB dip in the middle (note the dumbbell shapes 
# in the oscillogram under the spectrogram), and there is an overall fade-out
# over the entire bout
s026 = soundgen(
  nSyl = 4, 
  ampl = list(time = c(0, .3, 1),  # unequal time steps
              value = c(0, -10, 0)),
  amplGlobal = c(0, -20),  # this fade-out applies to noise as well
  noise = -10,
  plot = TRUE, heights = c(1, 1), play = playback)

The dynamic amplitude range is determined by dynamicRange. This parameter sets the minimum level of loudness, below which components are discarded as essentially silence. For maximum sound quality, set a high dynamicRange, like 120 dB. This helps to avoid artifacts like audibly clicking harmonics, but it also slows down sound generation. The default is 80 dB.

Rapid amplitude modulation imitating a trill is implemented by multiplying the synthesized waveform by a wave with adjustable amType (“sine” or “logistic”), shape amShape (logistic only), frequency amFreq, and amplitude amDep:

s027 = soundgen(
  sylLen = 1000, formants = NA,
  # set the depth of AM (0% = none, 100% = max)
  amDep = c(0, 100),   
  # set AM frequency in Hz (vectorized)
  amFreq = c(50, 25),  
  # set the shape: 0 = close to sine, -1 = notches, +1 = clicks
  amShape = 0,  
  # asymmetrical attack: 20 ms at the beginning and 140 ms at the end
  attackLen = c(20, 140),
  plot = TRUE, heights = c(1, 1), ylim = c(0, 1), windowLength = 100, 
  play = playback)

A common special case of modifying the amplitude envelope of a synthesized or recorded sound is compression, which helps to make the amplitude remains relatively stable throughout the duration of the signal. There is a separate function for achieving this, namely compressor() AKA flatEnv():

s = rnorm(500) * seq(1, 0, length.out = 500)
s1 = compressor(s, samplingRate = 1000, plot = TRUE,
                killDC = TRUE, windowLength_points = 50)

Another common modification is to fade the sound in and/or out. One way to do this is to change the attack (which affects both the beginning and the end) or to use amplitude anchors. On other occasions, or if your sound already exists and you want to change it, the way to go about it is to use a separate function, fade(). This also gives you more options, e.g. different attack shapes, while soundgen() defaults to linear fade-in/out for attack.

# Create a sound with sharp attack
s028 = soundgen(sylLen = 300, pitch = 800, addSilence = 0, attackLen = 10)  
# playme(s)

s029 = fade(s028, fadeIn = 50, fadeOut = 100, samplingRate = 16000,
            shape = 'logistic', steepness = 1, plot = TRUE)
# playme(s029)
# different fades are available: linear, logarithmic, etc

TIP: attackLen in soundgen is applied only to voiced source, and before it is filtered (i.e., before formants are added). In case of artifacts increase attackLen or apply fade() after synthesizing the sound.

2.9 Spectral filter (formants)

2.9.1 Vowel presets

Argument formants (tab “Tract / Formants” in the app) sets the formants – frequency bands used to filter the excitation source. Just as an equalizer in a sound player amplifies some frequencies and dampens others, aappropriate filters can be applied to a tonal sound to make it resemble a human voice saying different vowels. Formants are created in the frequency domain using all-pole models if all formant amplitudes are positive and zero-pole models if there are anti-formants with negative amplitudes (Stevens, 2000, ch. 3).

Using presets for callers M1 and F1, you can directly specify a string of vowels. When you call soundgen with formants = 'aouuuui' or some such character string, the values are taken from presets$M1$Formants (or presets$F1$Formants if the speaker is “F1” in the Shiny app). Formants can remain the same throughout the vocalizations, or they can move. For example, formants = 'ai' produces a sound that goes smoothly from [a] to [i], while formants = 'aaai' produces mostly [a] with a rapid transition to [i] at the very end. Argument formantStrength (“Formant prominence” in the app) adjusts the overall effect of all formant filters at once, and formantWidth scales all bandwidths.

s030 = soundgen(formants = 'ai', play = playback)

s031 = soundgen(formants = 'aaai', play = playback)

2.9.2 Manual formants

Presets give you some rudimentary control over vowels. More subtle control is necessary for animal sounds, as well as for human vowels that are not included in the presets dictionary or for non-default speakers. For such cases you will have to specify at least the frequency of each formant (and optionally, also amplitude, bandwidth, and time stamps for each value). The easiest, and normally sufficient, approach is to specify frequencies only and have soundgen() figure out the appropriate amplitude and bandwidth for each formant. Bandwidth is calculated from frequency using a formula derived from human phonetic research. Namely, above 500 Hz it follows the original formula known as “TMF-1963” (Tappert, Martony, and Fant, 1963), and below 500 Hz it applies a correction to allow for energy losses at low frequencies (Khodai-Joopari & Clermont, 2002). Below 250 Hz the bandwidth starts to decrease again, in a purely empirical attempt to achieve reasonable values even for formant frequencies below ordinary human range. See the internal function soundgen:::getBandwidth() if you are interested and note that for anything but ordinary human voices it may be safer to specify formant bandwidths manually.

freqs = 2 ^ seq(log2(20), log2(20000), length.out = 500)
plot(freqs, soundgen:::getBandwidth(freqs), type = 'l', 
     log = 'xy', xlab = 'Center frequency, Hz',
     ylab = 'Bandwidth, Hz', 
     main = 'Default formant bandwidths')
abline(v = 250, lty = 3)
abline(v = 500, lty = 3)

Formant amplitudes are normally assumed to be determined by their frequency and bandwidth (see Stevens, 2000), but you can override this by specifying amplitudes explicitly. Note that changing the amplitude (or frequency, or bandwidth) of one formant affects the entire spectrum above it.

For moving formants, provide multiple values, which assumes equal time steps, or specify time points explicitly, where time varies from 0 to 1 (to be scaled appropriately depending on the length of sound). For example:

# shorthand specification with three stationary formants
formants = c(300, 2500, 3200)

# shorthand specification with two moving formants
formants = list(f1 = c(300, 900), f2 = c(2500, 1500))

# full specification with two moving formants and non-default amplitude and bandwidth
formants = list(
  f1 = list(freq = c(300, 900), 
            amp = c(30, 10), 
            width = 120),
  f2 = list(time = c(0, .2, 1),  # "time" is only needed for non-equidistant anchors
            freq = c(2500, 2400, 1500), 
            amp = 30, 
            width = c(0, 220, 240)))

Feed these lists into soundgen() to hear what they sound like.

2.9.3 Vocal tract length

In addition to user-specified formants, higher formants are added automatically based on the length of vocal tract length estimated from the user-specified formant frequency values. The function that estimates vocal tract length is imaginatively called estimateVTL:

estimateVTL(formants = c(400, 1800, 2550, 4100), plot = TRUE)

## [1] 15.73333

# 17.5 cm

A more general function called schwa() both estimates VTL and allows you to compare measured formant frequencies with those expected for a neutral schwa sound and perform more sophisticated operations with formants. See ?schwa for more details if you are working with vowels.

schwa(formants = c(820, 1320, 2550, 4100), plot = TRUE)

## $vtl_apparent
## [1] 16.08047
## 
## $formantDispersion
## [1] 1100.714
## 
## $ff_measured
## [1]  820 1320 2550 4100
## 
## $ff_schwa
## [1]  550.3571 1651.0714 2751.7857 3852.5000
## 
## $ff_relative
## [1]  48.994160 -20.051914  -7.332901   6.424400
## 
## $ff_relative_semitones
## [1]  6.903069 -3.874375 -1.318451  1.077947
## 
## $ff_relative_dF
## [1]  0.2449708 -0.3007787 -0.1833225  0.2248540

It is usually useful to allow soundgen to create upper formants automatically based on the estimated VTL, since upper formants are typically less perceptually salient. As a result, we don’t want to spend too much time specifying them manually, but still we don’t want them to be completely absent, either, since that makes for very weak high frequencies in the spectrum. An approximation based on standard open-closed tube models is often good enough for calculating the missing formant frequencies. You can remove them by setting temperature = 0 or formantDepStoch = 0, but note that without higher formants the entire spectrum loses energy at higher frequencies:

s032 = soundgen(formants = c(800, 1200), play = playback, plot = TRUE)

s033 = soundgen(formants = c(800, 1200), formantDepStoch = 0,
                play = playback, plot = TRUE)

Another useful method is to specify vocal tract length without any formants. In this case soundgen() approximates a neutral schwa sound for an animal with a vocal tract that looks like a uniform tube of this length. Crude, but often sufficient. A toy example (check presets for some more realistic sounds created using this method):

s034 = soundgen(
  sylLen = 800, formants = NULL, rolloff = -6,
  vocalTract = c(12, 18, 19), formantCeiling = 5, 
  play = playback, plot = TRUE)

For very long vocal tracts, it is advisable to increase formantCeiling to ~5-10 (i.e., 5 times the Nyquist frequency), otherwise the filter dampens high frequencies too much. The default setting is formantCeiling = 2, which is much faster and not too inaccurate for human-length vocal tracts. Note that vocal tract length can be variable, providing an easy way to create parallel formant transitions.

Unlike the argument “mouth”, variable “vocalTract” parameter recalculates formant bandwidths (unless these are specified manually) and is thus more accurate, but it requires you to specify a reasonable vocal tract length, in cm. If you provide both a few formant frequencies and a variable vocalTract, formants are synthesized as specified at the initial value of vocalTract (that is, at the very beginning of the sound) and can deviate afterwards as VTL changes. Careful with such combinations: make sure the initial value of the VTL is reasonable for these formant frequencies (any VTL is fine towards the end of the sound - the user-specified formants will move accordingly). Examples:

formants = list(f1 = c(500, 250), f2 = c(1500, 2800), f3 = 3500, f4 = 4300)
estimateVTL(formants)  # 13.7 cm - so the initial VTL should be close to 13.7

## [1] 13.58204

s035 = soundgen(
  sylLen = 800, rolloff = -6, formants = formants, 
  vocalTract = list(time = c(0, .3, 1), value = c(14, 28, 25)), 
  play = playback, plot = TRUE, main = 'Good initial VTL')

# wrong: VTL too high for these formants, causing an overlap
s036 = soundgen(
  sylLen = 800, rolloff = -6, formants = formants, 
  vocalTract = list(time = c(0, .3, 1), value = c(18, 28, 25)), 
  play = playback, plot = TRUE, main = 'Bad initial VTL')

2.9.4 Spectral envelope

Sometimes it may be useful to view moving formants before synthesizing the sound, or you may want to create a spectral envelope and then apply it to another sound with transplantFormants(). The way to create and view a spectral envelope is to call the function that handles all formant processing under the hood in soundgen, namely getSpectralEnvelope:

# plotting directly from getSpectralEnvelope() in spectrogram form
s = getSpectralEnvelope(nr = 1024,  # freq bins in FFT frame (window_length / 2)
                        nc = 50,    # time bins
                        samplingRate = 16000, 
                        formants = list(f1 = c(500, 250), 
                                        f2 = c(1500, 2800), 
                                        f3 = 3500, f4 = 4300),
                        plot = TRUE, 
                        dur = 1500,   # just an example
                        colorTheme = 'seewave',
                        lipRad = 6)  # lip radiation, dB/octave

Note that, in addition to formants, lip and nose radiation are also handled by this function (see the next section on mouth opening). This has the effect of boosting high frequencies.

TIP When using the app, you can start with a formant preset by typing in a vowel string, and then you can modify it. This way you don’t have to remember the right format. If you edit the list of formants and nothing in the sound seems to be changing, there may be a misprint, missing comma, etc.

2.9.5 Antiformants

For even more advanced spectral filters, you can specify both formants (poles) and antiformants (zeros). This may be useful if you want to create a nasalized sound. The numbering of formants is arbitrary, as long as they are arranged in the right order. For example, if you want to insert a new formant between F1 and F2 without renaming all higher formants, call it “f1.5” or something like that. It is important to use a non-integer number, since otherwise these additional formants will be inappropriately used to estimate the length of vocal tract and adding stochastic formants above the ones you specify (that is, if temperature > 0 and vocalTract = NA).

For example, a slow transition from [a] to [a nasalized] might be coded as follows (note that formant f1.7 has negative amplitude, so f1.5 and f1.7 form a pole-zero pair):

formants = list(
  f1   = list(time = c(0, 1), freq = c(880, 900), 
              amp = c(25, 15), width = c(80, 120)), 
  f1.5 = list(time = c(0, 1), freq = 600,
              amp = c(0, 15), width = 80),   # additional pole
  f1.7 = list(time=c(0, 1), freq = 750,
              amp = c(0, -15), width = 80),  # zero
  f2   = list(time = c(0, 1), freq = c(1480, 1250), 
              amp = c(30, 20), width = c(120, 200)), 
  f3   = list(time=c(0, 1), freq = c(2900, 3100), 
              amp = 25, width = 200))
s037 = soundgen(sylLen = 1500, play = playback, pitch = 140, formants = formants)
spectrogram(s037, samplingRate = 16000, ylim = c(0, 4), contrast = .5, 
            windowLength = 10, step = 5, colorTheme = 'seewave')
# long-term average spectrum (less helpful for moving formants but very good for stationary):
# seewave::meanspec(s, f = 16000, wl = 256)

If you look at the filter towards the end of the sound, you can observe the additional zero-pole pair between the first and second formant:

se = getSpectralEnvelope(nr = 512, nc = 100, formants = formants)
plot(as.numeric(rownames(se)), 20 * log10(se[, ncol(se)]), 
     type = 'l', xlab = 'KHz', ylab = 'dB')

2.9.6 Mouth opening

In addition to variable vocal tract length (vocalTract argument), an even easier shortcut for creating parallel formant transitions without coding all transitions by hand is provided by the mouth argument (in the app, tab “Tract / Mouth opening”). This can be thought of as a hyper-parameter offering an easy way to define moving formants within a bout (easy because you don’t need to know the VTL in cm): all formants simply go down relative to specified values as the mouth closes and rise as it opens (see Moore, 2016).

In addition, an open mouth has lip radiation, which has the effect of amplifying higher frequencies. Lip radiation is replaced by nose radiation when the mouth is completely closed, dampening the higher frequencies, and the vowel is automatically nasalized using a simple approximation (Hawkins & Stevens, 1985). Basically, with the mouth closed we switch from a tube open at one end to a tube closed at both ends and coupled with a (simplified) nasal cavity. Despite being a crude model of what really happens when a vocalizing animal closes its mouth, in many cases mouth can save you a lot of manual coding of formants. Here is a simple example, with the mouth gradually opening and closing again:

s038 = soundgen(sylLen = 1200, play = playback, pitch = 140, 
                mouth = list(time = c(0, .3, .75, 1), 
                             value = c(0, 0, .7, 0)))
spectrogram(s038, samplingRate = 16000, 
            ylim = c(0, 4), contrast = .5, 
            windowLength = 10, step = 5, 
            colorTheme = 'seewave')

TIP Here and elsewhere, I talk about applying soundgen to the task of synthesizing non-human sounds. It does work, but be aware that many computational routines are based on human phonetic research, simply because there is vastly more data available on human vocal production. For example, formant bandwidths and spectral consequences of nasalization are estimated based on human phonetics, but it is far from clear to what extent these equations are applicable to sounds produced by non-human mammals. Bird calls are again a whole new ball game. And once you move on to insects or non-biological sounds, just forget about hyperparameters like mouth and code everything at the lowest possible level.

2.9.7 Formant locking

The standard source-filter model (Fant, 1971) assumes that the vibration of the vocal folds is independent of the configuration of the supraglottal vocal tract - that is, that the source and filter are independent. However, in some situations this assumption does not hold, and the filter can have a noticeable effect on the vibration of the vocal folds. An example of such source-filter interaction that is implemented in soundgen is formant locking, in which the fundamental frequency or a higher harmonic becomes temporarily “locked” to the frequency of a formant. The relevant parameter is formantLocking (0 = none, 1 = the entire sound, vector form also accepted). See the internal function soundgen:::lockToFormants() for more information and examples.

In humans, spontaneous formant locking is usually observed in high-pitched sounds like screams, although formant matching can also be performed intentionally in order to maximize sonority, as in soprano singing, or to produce unusual vocal effects such as Tuvinian throat singing. In animal vocalizations, formant locking appears to be particularly common when the vocal tract is long and formants are closely spaced (my speculation). For example, elk bugles often contain a series of stepwise pitch jumps from one formant to the next, a bit like this:

s039 = soundgen(
  sylLen = 1500, rolloff = -20,
  pitch = c(500, 1000, 1800, 1700, 500),
  formants = NULL, vocalTract = 55,
  formantLocking = c(0, 1, 1, 1, 1, 0),  # except the beginning and end
  shortestEpoch = 200,  # affects both subharmonics (if any) and formantLocking
  noise = -20,  # just to make formants visible in this example
  temperature = .1,
  samplingRate = 22000, pitchSamplingRate = 22000,
  play = playback, plot = TRUE, ylim = c(0, 8)
)

2.9.8 Add formants to an existing sound

For some purposes it may be useful to separate the generation of glottal source (or another source of acoustic excitation) from its spectral filtering. You may also need to add formants to an existing waveform. To do so, you can call the helper function addFormants(), which normally works under the hood in soundgen. The algorithm is to take an STFT, multiply the resulting spectrum by the filter, and then convert it back to time domain via inverse STFT. The same function can theoretically be used to perform inverse filtering - that is, to remove formants from a signal - as long as you can provide a VERY accurate formant filter. See ?addFormants for more information.

TIP Bewildered by all these formants, antiformants, VTL, etc? Good news: you can simply lift formants off a real recording and “transplant” them onto your synthetic sound. See ?transplantFormants for more information and examples.

2.10 Source spectrum (glottal source)

Soundgen produces tonal sounds by means of generating a separate sine wave for each harmonic. However, it is very tricky to choose the appropriate strength of each harmonic. The simplest solution is to make each higher harmonic slightly weaker than the previous one, say by setting a fixed exponential decay rate from lower to higher harmonics. The corresponding parameter in soundgen is rolloff (in the app, “Source rolloff, dB/octave”). Unfortunately, this is often not really good enough, necessitating several more control parameters.

Soundgen allows a lot of flexibility when specifying source spectrum. You can change the basic rolloff of harmonics per octave, producing a sharper or more gentle decline of energy over frequencies (rolloffOct), adjust rolloff depending on f0, so that high-pitch sounds will have a steeper rolloff (rolloffKHz), or add a parabolic correction (rolloffParab) that affects the first rolloffParabHarm harmonics. Working from R console, the relevant function is getRolloff. Its arguments are well-documented: type ?getRolloff for help. Here is just a single example:

# strong F0, rolloff with a "shoulder"
r = getRolloff(rolloff = c(-5, -20),  # rolloff parameters are vectorized
               rolloffParab = -10, rolloffParabHarm = 13, 
               pitch_per_gc = c(170, 340), plot = TRUE)

# to generate the corresponding sound:
s040 = soundgen(sylLen = 1000, rolloff = c(-5, -20), rolloffOct = 0,
                rolloffParab = -10, rolloffParabHarm = 13,
                pitch = c(170, 340),  play = playback)

In the app the relevant parameters are found in the tab “Source / Rolloff”. To develop an intuition for source spectrum settings, I recommend practicing with disabled formants in the app (set “Formants prominence” under “Tract / Formants” to 0). This way you can isolate the effects of source spectrum and use the preview plot for instant feedback – it shows the rolloff for the lowest and the highest pitch in your intonation contour. Rolloff parameters are vectorized, but this functionality is only available from R console. However, rolloff also varies over time if temperature is above zero (use tempEffects$specDep to control the amount of stochastic variation of rolloff and other spectral characteristics).

2.10.1 Specifying the strength of each harmonic

Apart from using rolloff and related parameters to control the general shape of excitation spectrum, it is also possible to control each harmonic individually. Say, you have performed inverse filtering on a recording to estimate glottal source, giving you the strength of each harmonic in glottal pulses, and now you wish to hear the sound corresponding to this glottal source. To do so, use the rolloffExact argument and supply a matrix of numeric values on a scale of 0 to 1. Each row is one harmonic, and the sound will contain only as many partials as there are rows in rolloffExact:

rolloffExact = matrix(c(.1, .2, 1, .02, .2,  # strength of H1-H5 at time 0
                        1, .2, .01, .1, .4), # strength of H1-H5 at time 1000
                      ncol = 2)
s041 = soundgen(sylLen = 1000, pitch = c(400, 430), formants = NULL,
                rolloffExact = rolloffExact, 
                plot = TRUE, ylim = c(0, 4), play = playback)

If you do not want source spectrum to change over time, a single vector instead of a matrix will do:

s042 = soundgen(sylLen = 1000, pitch = c(400, 430), formants = NULL,
                rolloffExact = c(.1, .5, .25, 1, .25, .08, .05, .02), 
                plot = TRUE, ylim = c(0, 4), play = playback)

2.10.2 Closed glottis

If f0 is very low, as in vocal fry or some animal vocalizations like crocodile roaring or elephant rumbling, individual glottal pulses can be both seen on a spectrogram and perceived as distinct percussion-like acoustic events separated by noticeable pauses. Soundgen can create such sounds by switching to a new mode of production: instead of synthesizing continuous sine waves spanning the entire syllable, it creates each glottal pulse individually (each with its full set of harmonics) and then glues them together with pauses in between.

This is a lot slower than continuous sine wave synthesis and mostly justified for very low-pitched sounds, since with higher pitch there will be too few points per glottal cycle to sound convincing without increasing samplingRate to astronomical values. Also note that some spectral artifacts may appear. Example:

# Not a good idea: samplingRate is too low
s043 = soundgen(pitch = c(1500, 800), glottis = 75, 
                samplingRate = 16000, play = playback)

# This sounds better but takes a long time to synthesize:
s044 = suppressWarnings(soundgen(
  pitch = c(1500, 800), glottis = 75, 
  samplingRate = 80000, play = playback, 
  invalidArgAction = 'ignore'))
# NB: invalidArgAction = 'ignore' forces a "weird" samplingRate value
# to be accepted without question

# Now this is what this feature is meant for: vocal fry
s045 = soundgen(
  sylLen = 1500, pitch = c(110, 90), rolloff = -12,
  glottis = c(0, 500), 
  jitterDep = 1, shimmerDep = 20,
  # subharmonics not implemented with "glottis" 
  play = playback)
spectrogram(s045, samplingRate = 16000, heights = c(1, 1))

2.11 Nonlinear effects

Soundgen can add frequency jumps, subharmonics, sidebands (implemented as a modification of subharmonics or as amplitude modulation), and approximate deterministic chaos by adding strong jitter and shimmer. These effects basically make the sound appear noisy / harsh / rough / unpredictable / etc. Jitter and shimmer are created by adding random noise to the periods and amplitudes, respectively, of the “glottal cycles”. Subharmonics could be created by adding rapid amplitude and/or frequency modulation, but for maximum flexibility soundgen uses a different - slightly hacky, but powerful - technique of literally setting up an additional sine wave for each subharmonic. To achieve sidebands, the amplitude of each subharmonic is set to be a function of its distance from the nearest harmonic of the f0 stack; the rate at which g0 harmonics lose energy away from the nearest f0 harmonic determines the width of sidebands (subWidth). This way we can create either subharmonics or narrow sidebands that vary naturally as f0 changes over time, producing bifurcations and switching between different subharmonic regimes (see Wilden et al., 2012).

The main limitation of this approach is that it is too computationally costly to generate variable numbers of subharmonics for the entire bout. The solution currently adopted in soundgen is to break longer sounds into so-called “epochs” with a constant number of subharmonics in each. The epochs are synthesized separately, trimmed to the nearest zero crossing, and then glued together with a rapid crossFade(). This is suboptimal, since it shortens the sound and may introduce audible artifacts at transitions between epochs. shortestEpoch controls the approximate minimum length of each epoch. Longer epochs minimize problems with transitions, but the behavior of subharmonics then becomes less variable since their number is constrained to be constant within each epoch. NB: with short shortestEpoch, add ~20-30 ms per transition to the nominal sound duration in order to compensate for cross-fading the epochs.

2.11.1 Nonlinear regimes

To add nonlinear effects stochastically, you can use nonlinBalance, which regulates approximately what proportion of the sound is affected. At temperature > 0, nonlinBalance creates a random walk that divides each syllable into epochs defined by their regime, using two thresholds to determine when a new regime begins (see Fitch et al., 2002):

Regime 1: no nonlinear effects. If nonlinBalance = 0%, the whole syllable is in regime 1.
Regime 2: subharmonics only. Note that subharmonics are only added to segments with subFreq < f0 / 2.
Regime 3: subharmonics and jitter. If nonlinBalance = 100%, the whole syllable is in regime 3.

To see any effect, you have to set jitterDep, shimmerDep, and subFreq/subDep/subWidth to some positive values. With nonlinBalance < 100%, the result is a stochastic combination of the three regimes (tonal, subharmonics, subharmonics + jitter + shimmer):

s046 = soundgen(
  sylLen = 1500, pitch = c(170, 420, 400, 190),
  nonlinBalance = 60,
  subDep = 10, jitterDep = 1.5, shimmerDep = 25,
  play = playback, plot = TRUE, ylim = c(0, 5))

2.11.2 Subharmonics

To add nonlinearities non-stochastically (exactly where you want them), keep nonlinBalance at the default value of 100% and specify the nature and timing of nonlinearities manually. To add a single subharmonic between each pair of f-harmonics (period doubling), set subRatio = 2, for period tripling, subRatio = 3, etc. This number of subharmonics will be added regardless of pitch changes. Another way is to set subFreq (“Target subharmonic frequency, Hz” in the app), which gives an approximate g0 target, so that the number of subharmonics will vary with pitch. The amplitude (loudness) of subharmonics is controlled by subDep (“Depth of subharmonics”). All these parameters can vary within a syllable.

Add one subharmonic regardless of pitch:

s047 = soundgen(subRatio = 2, subDep = c(5, 20),
                sylLen = 800, pitch = c(700, 1300), formants = NULL,
                play = playback, plot = TRUE, ylim = c(0, 3))

Set target g0, so that the number of subharmonics depends on pitch:

s048 = soundgen(subFreq = 400, subDep = c(5, 20),
                sylLen = 800, pitch = c(700, 1300), formants = NULL,
                play = playback, plot = TRUE, ylim = c(0, 3))

Sidebands are best demonstrated with high-pitched sounds and low subharmonic frequencies. For example, chimpanzees emit piercing screams with narrow subharmonic bands. If we set subFreq to 75 Hz and subWidth to 130 Hz, subharmonics literally form a band around each harmonic of the main stack, creating a very distinct, immediately recognizable sound quality:

s049 = soundgen(
  sylLen = 800, 
  pitch = list(time=c(0, .3, .9, 1), 
               value = c(1200, 1547, 1487, 1154)),
  rolloff = -3, rolloffKHz = 0,
  # gradually increasing width of sidebands at 0-600 ms
  subFreq = 75, subDep = 25,
  subWidth = data.frame(time = c(0, 600, 650, 800), 
                        value = c(0, 130, 0, 0)),  
  vocalTract = 12, mouth = c(.1, .8, .1),
  temperature = .001,
  pitchSamplingRate = 22050, samplingRate = 22050,
  play = playback, plot = TRUE, ylim = c(0, 5))

Another way to create sidebands is to add amplitude modulation (AM). Perfectly sinusoidal AM creates a simple pair of extra harmonics, while non-sinusoidal AM creates sidebands:

s050 = soundgen(
  sylLen = 800, 
  pitch = list(time=c(0, .3, .9, 1), 
               value = c(1200, 1547, 1487, 1154)),
  rolloff = -3, rolloffKHz = 0,
  # gradually increasing width of sidebands at 0-600 ms
  amFreq = 75, amShape = .1,
  amDep = list(time = c(0, 600, 650, 800), 
               value = c(0, 100, 0, 0)),  
  vocalTract = 12, mouth = c(.1, .8, .1),
  temperature = .001,
  pitchSamplingRate = 22050, samplingRate = 22050,
  play = playback, plot = TRUE, ylim = c(0, 5))

TIP The parameters regulating nonlinear effects are vectorized, so you can write subDep = c(0, 130), jitterDep = c(0, 1), etc., or use the “anchor format” as above (console only, not available in the app)

2.11.3 Jitter / shimmer

As for jitter in pitch regime 3, it wiggles both f0 and g0 harmonic stacks, blurring the spectrum. Parameter jitterDep (“Jitter depth, semitones” in the app) defines how much the pitch fluctuates, while jitterLen (“Jitter period, ms”) defines how rapid these fluctuations are. Slow jitter with a period of ~50 ms produces the effect of a shaky, unsteady voice. It may sound similar to a vibrato, but jitter is irregular. Rapid jitter with a period of ~1 ms, especially in combination with subharmonics, may be used to imitate deterministic chaos, which is found in voiced but highly irregular animal sounds such as barks, roars, noisy screams, etc. This works best for high-pitched sounds like screams. Shimmer is similar to jitter, except that it defines random fluctuations of the amplitude rather than frequency. It is controlled by two arguments, shimmerDep (percent) and shimmerLen (ms).

s051 = soundgen(jitterLen = 40, jitterDep = 1,  # shaky voice
                shimmerLen = 30, shimmerDep = 30,   
                sylLen = 1000, pitch =  c(150, 170), 
                play = playback, plot = TRUE, ylim = c(0, 3))

s052 = soundgen(jitterLen = 1, jitterDep = 1,   # harsh voice
                shimmerLen = 1, shimmerDep = 10, 
                sylLen = 1000, pitch =  c(150, 170), 
                play = playback, plot = TRUE, ylim = c(0, 3))

Jitter + shimmer + subharmonics work well together. For example, barks of a small, annoying dog can be roughly approximated with this minimal code (ignoring respiration to keep things simple):

s053 = soundgen(repeatBout = 2, sylLen = 140, pauseLen = 100, 
                vocalTract = 8, formants = NULL, rolloff = 0,
                pitch = c(1100, 1600, 1100), mouth = c(0, 0.5, 0),
                jitterDep = 1, subDep = 60, play = playback)

Note that jitter is random variation around a target value, but soundgen also has relatively slow random pitch drift, which is implemented as a random walk and can thus wander into values quite far removed from the target. Random pitch drift is added whenever temperature > 0. Use tempEffects to regulate its amount and frequency:

# slight and slow (slightly unsteady voice)
s054 = soundgen(
  sylLen = 1500, pitch = 300,
  tempEffects = list(pitchDriftDep = 1, pitchDriftFreq = .5),
  play = playback, plot = TRUE, ylim = c(0, 2))

# strong and rapid (trembling voice, similar to jitter)
s055 = soundgen(
  sylLen = 1500, pitch = 300,
  tempEffects = list(pitchDriftDep = 5, pitchDriftFreq = 5),
  play = playback, plot = TRUE, ylim = c(0, 2))

# both drift and jitter (trembling voice ending with some "chaos")
s056 = soundgen(
  sylLen = 1500, pitch = 300,
  tempEffects = list(pitchDriftDep = 5, pitchDriftFreq = 5),
  jitterDep = c(0, 0, 0, 2),
  play = playback, plot = TRUE, ylim = c(0, 2))

2.11.4 Chaos

There is no way to synthesize true deterministic chaos with residual harmonic structure in soundgen. However, there are several roundabout ways to achieve a comparable effect. As already mentioned, strong jitter and shimmer create harsh sounds that are perceptually similar to deterministic chaos, especially for higher f0 values:

s057 = soundgen(
  sylLen = 1200, 
  pitch = list(
    time = c(0, 110, 111, 180, 350, 940, 941, 1100, 1200),
    value = c(700, 1150, 1550, 2000, 2240, 1940, 1180, 900, 500)),
  temperature = 0.05, tempEffects = list(pitchDep = 0),
  jitterDep = list(time = c(0, 200, 201, 900, 901, 1200),
                   value = c(0, 0,  1.7, 1.2, 0,   0)),
  formants = c(900, 1300, 3300, 4300),
  attackLen = c(10, 200),
  samplingRate = 44100, play = playback, plot = TRUE, ylim = c(0, 5))

## pitchSampingRate should be much higher than the highest pitch; resetting to 22400 Hz

Another method is to encode very rapid pitch jumps, say between f0 and a formant or between harmonically related values, like this:

s058 = soundgen(
  sylLen = 1200, 
  pitch = list(
    time = c(0, 80, 81, 230, 231, 385, 
             # 500 time anchors here - an episode of "chaos"
             seq(385, 850, length.out = 500), 
             851, 1020, 1021, 1085),
    value = c(700, 1130, 1000, 1200, 1860, 1840, 
              # random f0 jumps b/w 1.2 & 1.8 KHz 
              sample(c(1200, 1800), size = 500, replace = TRUE), 
              1620, 1540, 1220, 900)),
  temperature = 0.05, 
  tempEffects = list(pitchDep = 0),
  jitterDep = .3,
  rolloffKHz = 0, rolloff = 0, formants = c(900, 1300, 3300, 4300),
  samplingRate = 44100, play = playback, plot = TRUE, ylim = c(0, 5))

## pitchSampingRate should be much higher than the highest pitch; resetting to 18600 Hz

Incidentally, you can use similar tricks for introducing variation in any soundgen parameter. For example, you can use runif() or rnorm() to randomly vary things like mouth opening, pitch, amplitude. That’s the best part of working in R!

s059 = silence
# run several times to appreciate the randomness
for (i in 1:5) s059 = c(s059, soundgen(
  sylLen = 800, 
  mouth = rnorm(n = 5, mean = .5, sd = .3)
), silence)
# playme(s059)

2.11.5 Specifying the timing of each nonlinear regime

Sometimes it may be necessary to control precisely the timing of each nonlinear regime. For example, in an experiment a sound containing nonlinear effects may need to be synthesized repeatedly, varying one parameter and preserving everything else, including nonlinear regimes. To control the timing of nonlinear effects manually, set nonlinBalance to 100 (the entire vocalization) and vary the strength of nonlinear effects with their vectorized “depth” settings:

s060 = soundgen(
  # nonlinear settings
  jitterDep = c(0, 0, 1.5, .5), shimmerDep = c(0, 0, 15, 5),
  # settings for high precision
  temperature = .001, dynamicRange = 120,             
  samplingRate = 22050, pitchSamplingRate = 22050,  
  # other settings
  sylLen = 1000, pitch = c(240, 200),
  rolloff = c(-20, -18, -23, -28) + 4, vibratoDep = .2,
  formants = c(800, 1400, 2500, 3700, 5000, 6800),
  noise = list(time = c(0, 340, 900, 1000), 
               value = c(-60, -45, -60, -80) + 10),
  rolloffNoise = 0,
  mouth = c(.55, .5, .45, .6),
  play = playback, plot = TRUE, ylim = c(0, 4)
)

TIP For analytical-precision work, set pitchSamplingRate to the same (high) value as samplingRate, say 22050. By default, pitchSamplingRate is much lower to speed up the synthesis, but this way sound duration can vary considerably depending on nonliear regimes, especially in sounds like screams with highly variable pitch.

In the example above jitterDep = c(0, 0, 1.5, .5) means that there is no jitter roughly in the first half of the voiced fragment, then a jitter of 1.5 semitones, and then .5 semitones towards the end. For more precision, use the “anchor format”. The same goes for all vectorized parameters: jitterLen, shimmerDep, shimmerLen, subFreq, subDep, rolloff settings, etc. For example, to turn on jitter abruptly at 300 ms and turn it off again at 500 ms, and to have shimmer only between 600 and 800 ms, we can modify the code as follows (it still won’t be precise down to a millisecond, though):

s061 = soundgen(
  # nonlinear settings
  jitterDep = list(
    time = c(0, 300, 301, 500, 501, 1000), 
    value = c(0, 0, 1.5, 1.5, 0, 0)
  ),   
  shimmerDep = list(
    time = c(0, 600, 601, 800, 801, 1000), 
    value = c(0, 0, 40, 40, 0, 0)
  ),
  # settings for high precision
  temperature = .001, dynamicRange = 120,             
  samplingRate = 22050, pitchSamplingRate = 22050,  
  # other settings
  addSilence = 0,  # easier to check timing
  sylLen = 1000, pitch = c(240, 200),
  rolloff = c(-20, -18, -23, -28), vibratoDep = .2,
  formants = c(800, 1400, 2500, 3700, 5000, 6800),
  noise = list(time = c(0, 340, 900, 1000), 
               value = c(-60, -45, -60, -80) + 30),
  rolloffNoise = -8,
  mouth = c(.55, .5, .45, .6),
  play = playback, plot = TRUE, ylim = c(0, 4)
)

Here is another method of controlling the timing of nonlinear phenomena. When nonlinBalance < 100, soundgen divides a sound into different nonlinear regimes by generating a random walk, which also controls the drift of some other control parameters. Setting temperature at nearly zero (say, at 0.001) removes random variation in most control parameters, but the random walk for nonlinear effects still remains random. To standardize that random walk as well, use nonlinRandomWalk. nonlinRandomWalk should be a vector containing 0, 1, and 2, where 0 = no nonlinearities, 1 = subharmonics, and 2 = subharmonics + jitter + shimmer. The number and order of 0/1/2 determines which nonlinear regime is active at which time. For example, this will make a sound with no effect in the first third, subharmonics in the second third, and jitter in the final third of the total duration:

rw_bin = c(rep(0, 100), rep(1, 100), rep(2, 100))
s062 = soundgen(sylLen = 800, pitch = 300, temperature = 0.001,
                subFreq = 100, subDep = 70, jitterDep = 1,
                nonlinRandomWalk = rw_bin, 
                play = playback, plot = TRUE, ylim = c(0, 4))

Make nonlinRandomWalk a fairly long vector for greater precision, i.e. not just c(0, 1) - because of the way approx works, that will NOT split the sound into 50% with no nonlinear effects and 50% with subharmonics. Instead, write c(rep(0, 50), rep(1, 50)) or some such.

You can also generate an actual random walk and then use it in several sounds to make sure their nonlinear effects have exactly the same timing. For example, here are two sounds with different pitch levels, but identical otherwise, including identical nonlinear regimes:

# set up a random walk (repeat until satisfied with the contour)
rw = getRandomWalk(len = 1000, rw_range = 100, 
                   trend = c(0.5, -0.5), rw_smoothing = .95)
rw_bin = getIntegerRandomWalk(rw, minLength = 100, plot = TRUE)
# synthesize two sounds with identical nonlinear effects but different f0
s063 = soundgen(sylLen = 800, pitch = 300, temperature = 0.001,
              subFreq = 100, subDep = 20, jitterDep = 1,
              nonlinRandomWalk = rw_bin, 
              play = playback, plot = TRUE, ylim = c(0, 4))

s064 = soundgen(sylLen = 800, pitch = 500, temperature = 0.001,
                subFreq = 100, subDep = 20, jitterDep = 1,
                nonlinRandomWalk = rw_bin, 
                play = playback, plot = TRUE, ylim = c(0, 4))

2.12 Unvoiced component (turbulent noise)

In addition to the tonal (harmonic, voiced) component, which is synthesized as a stack of harmonics (sine waves), soundgen produces turbulent noise (unvoiced component). This noise can be added to the voiced component to create breathing, sniffing, snuffling, hissing, gargling, etc. It is often appropriate to include at least some noise in synthetic vocalizations, if only because it is more natural to have noise instead of harmonics in the upper part of the source spectrum.

2.12.1 Noise spectrum

The perceptual quality of turbulent noise depends on its spectral composition, which is controlled by two soundgen arguments: formantsNoise and rolloffNoise (in the app, use “Tract / Unvoiced type”). Noise is generated as white noise with spectral rolloff given by rolloffNoise (“Noise rolloff, dB/octave” in the app) above a certain cutoff value (flatSpectrum, the default is currently 1200 Hz). The timing of the unvoiced component relative to the voiced component is controlled by the argument noise, which is discussed in the next section. There are two basic types of turbulent noise in soundgen:

Noise with the same formant structure as the voiced component, the spectrum of which only depends on the rolloffNoise setting, while formantsNoise is NULL or NA. This is useful for adding noise that originates deep in the throat, close to the vocal cords. To generate breathing, specify noise, but leave formantsNoise blank (NA, which is its default value). Soundgen then assumes that the unvoiced component should have the same formant structure as the voiced component.

s065 = soundgen(
  sylLen = 500, 
  noise = list(time = c(0, 800), value = c(-20, -10)),
  formantsNoise = NA,  # breathing - same formants as for voiced
  play = playback, plot = TRUE)

# observe that the voiced and unvoiced components have exactly the same formants

Noise with a different formant structure than the voiced component. To generate such noise, you can use one of the available presets in the app (for now, only a few human consonants) or specify the formants for the unvoiced component (formantsNoise) manually in exactly the same format as for the voiced component (formants).

s066 = soundgen(
  sylLen = 200, pitch = c(150, 120),
  noise = list(time = c(180, 250, 400), value = c(-20, -10, -50)),
  # specify noise filter ≠ voiced filter to get ~[s]
  formantsNoise = list(f1 = list(freq = 7000, amp = 40, width = 1500)), 
  rolloffNoise = 0,
  play = playback, plot = TRUE)

# observe that the voiced and unvoiced components have different formants

TIP: pitch = NA or NULL removes the voiced component, so that only turbulent noise is synthesized. In the app, untick the box Intonation / Intonation syllable / “Generate voiced component?”

If formantsNoise = NA or NULL (i.e., if this is aspiration noise), formant structure is calculated based on vocal tract length, and then extra stochastic formants are added as usual. For example, to create simple sighs, you can just specify the length of your creature’s vocal tract:

s067 = soundgen(
  vocalTract = 15.5,  # ~human throat (15.5 cm)
  formants = NULL, attackLen = 200, play = playback,
  noise = list(time = c(0, 800), value = c(40, 40)))
# NB: since there is no voiced component, we control syllable length
# by specifying the appropriate noise$time, in this case 0 to 800 ms

s068 = soundgen(
  vocalTract = 30,    # a large animal
  formants = NULL, attackLen = 200, play = playback,
  sylLen = 800, noise = 40)  # another way to specify the length
# NB: voiced component is not generated if noise$value >= 40 dB

s069 = soundgen(
  vocalTract = 100, invalidArgAction = 'ignore',    # a whale
  formants = NULL, attackLen = 200, play = playback,
  sylLen = 800, pitch = NULL, noise = 0) 
# Another way to remove the voiced component is to write pitch = NULL

In contrast, if formantsNoise are specified explicitly (i.e., if this is not aspiration noise), breathing noise is by default NOT enriched with stochastically added formants. To avoid losing all high-frequency energy in your noise, make sure you add a sufficient number of formants in formantsNoise, ideally all the way up to Nyquist frequency (half the sampling rate). Alternatively, you can explicitly specify vocalTract, and then extra formants will be added to the unvoiced component. Compare:

# only two specified formants
s070 = soundgen(pitch = NULL, 
                formantsNoise = c(1000, 2000),  
                noise = 40, sylLen = 800,
                play = playback, plot = TRUE)

# two specified formants plus extra formants based on vocalTract
s071 = soundgen(vocalTract = 15.5, 
                pitch = NULL,
                formantsNoise = c(1000, 2000),  
                noise = 40, sylLen = 800,
                play = playback, plot = TRUE)

The excitation source for the unvoiced component can be synthesized as white noise (if rolloffNoise = 0) or as turbulent noise with a spectrum that linearly (not exponentially!) loses power above noiseFlatSpec (the default is 1200 Hz). The parameter rolloffNoise thus controls the source spectrum of the unvoiced component:

s072 = soundgen(vocalTract = 17.5, 
                noise = 40, rolloffNoise = c(5, -20),
                formants = NULL, attackLen = 200, 
                play = playback, plot = TRUE)

# NB: noise amplitude may change as rolloffNoise changes

2.12.2 Noise amplitude

In the shiny app, the tab “Source / Unvoiced timing” is for specifying the amplitude contour of the unvoiced component. In soundgen, the relevant argument is noise = data.frame(time = ..., value = ...). It sets the timing and loudness of turbulent noise relative to the voiced component of a typical syllable. Starting from soundgen v1.4, you have three options in terms of how the amplitudes of the voiced and unvoiced components are compaired:

If noiseAmpRef = 'f0', noise$value gives the maximum amplitude of noise relative to the maximum amplitude of the first harmonic (f0) of the voiced component. Until soundgen v1.4, this was the only available option. This setting makes the balance between harmonics and noise dependent on source spectrum (rolloff settings) and formants of both components. Example:

s073 = soundgen(noiseAmpRef = 'f0', rolloff = -1, 
                noise = list(time = c(-100, 400), value = c(0, 0)), 
                play = playback)

s074 = soundgen(noiseAmpRef = 'f0', rolloff = -15, 
                noise = list(time = c(-100, 400), value = c(0, 0)), 
                play = playback)

If noiseAmpRef = 'source', noise$value gives the maximum amplitude of noise relative to the maximum amplitude of the unfiltered voiced component (“glottal source”). In other words, you are specifying how loud the noise is relative to the source of excitation, but the actual balance between harmonics and noise can vary depending on the formant structure. Example:

# Harmonics-noise balance doesn't depend on rolloff...
s075 = soundgen(noiseAmpRef = 'source', rolloff = -15, rolloffNoise = 0,
                noise = list(time = c(-100, 400), value = c(0, 0)), 
                play = playback)

s076 = soundgen(noiseAmpRef = 'source', rolloff = -1, rolloffNoise = -20,
                noise = list(time = c(-100, 400), value = c(0, 0)), 
                play = playback)

# ...but it does depend on the formant structure
s077 = soundgen(noiseAmpRef = 'source', formants = 'a', 
                noise = list(time = c(-100, 400), value = c(0, 0)), 
                play = playback)

s078 = soundgen(noiseAmpRef = 'source', formants = 'u',
                noise = list(time = c(-100, 400), value = c(0, 0)), 
                play = playback)

If noiseAmpRef = 'filtered', noise$value gives the maximum amplitude of noise relative to the maximum amplitude of the filtered voiced component (after adding formants). The balance between harmonics and noise therefore doesn’t depend on either rolloff or formants. This is the default option. Example:

s079 = soundgen(noiseAmpRef = 'filtered', formants = 'a', 
                noise = list(time = c(-100, 400), value = c(0, 0)), 
                play = playback)

s080 = soundgen(noiseAmpRef = 'filtered', formants = 'u',
                noise = list(time = c(-100, 400), value = c(0, 0)), 
                play = playback)

TIP If you need a very precise balance of harmonics and noise based on another normalization (RMS amplitude, minimum or median instead of maximum amplitude, etc.), you can always synthesize the harmonic and noise components separately and then simply add them up at whatever amplitudes you like

2.12.3 Noise timing

If you want noise to be time-locked to the voiced component, make the argument noise a numeric vector. Anchor format with time = c(), value = c() is more flexible, but note that it may cause the noise and voiced components to be slightly out of sync (which may be useful if you want noise to extend beyond the voiced segment).

s081 = soundgen(nSyl = 2,
                noise = c(-10, 0),
                plot = TRUE, ylim = c(0, 4), play = playback)

Turbulent noise is allowed to fill the pauses between syllables, but not between bouts. For example, in this two-syllable bout noise carries over after the end of each voiced component, since syllable duration is 120 ms and the last breathing time anchor is 209 ms:

s082 = soundgen(
  nSyl = 2, sylLen = 120, pauseLen = 120, 
  temperature = 0.001, rolloffNoise = -2, 
  noise = list(time = c(39, 56, 209), 
               value = c(-40, 0, -20)),
  formants = list(f1 = c(860, 530),  f2 = c(1280, 2400)),
  formantsNoise = list(f1 = c(420, 1200)),
  plot = TRUE, ylim = c(0, 4), play = playback)

Note that in the previous example formantsNoise defines the change of filter for the unvoiced components over the entire bout, i.e. across multiple syllables. This is similar to the way formants define the global change in formants across syllables. In contrast, if you have multiple bouts with one syllable in each, the change of unvoiced filter plays out within each bout, and the pause between the bouts is counted from the end of the unvoiced component, without any overlap between bouts. Compare the example above to the following (the only change is to use repeatBout instead of nSyl). Observe the behavior of formantsNoise (moving within each bout) and the duration of the pause between the syllables (~30 ms) and between the bouts (120 ms):

s083 = soundgen(
  repeatBout = 2, nSyl = 2,
  sylLen = 120, pauseLen = 120, 
  temperature = 0.001, rolloffNoise = -2, 
  noise = list(time = c(39, 56, 209), 
               value = c(-40, 0, -20)),
  formants = list(f1 = c(860, 530),  f2 = c(1280, 2400)),
  formantsNoise = list(f1 = c(420, 1200)),
  plot = TRUE, ylim = c(0, 4), play = playback)

Both the timing and the amplitude of noise anchors are defined relative to the voiced component. Because noise can extend beyond voiced fragmets, however, time anchors for noise MUST be specified in ms (unlike all the other contours, which also accept time anchors on any arbitrary scale, say 0 to 1). If the noise starts before the voiced part, the first time anchor will be negative. This is easier to visualize in the app, which provides a preview. From R console, you can also preview the noise amplitude contour implied by your anchors by calling getSmoothContour, for example:

a = getSmoothContour(anchors = list(time = c(-50, 200, 300), 
                                    value = c(-80, 20, -80)),
                     voiced = 200, 
                     normalizeTime = FALSE,  # keep time in ms
                     plot = TRUE, ylim = c(-80, 40), main = '')

TIP: if the voiced part is shorter than permittedValues['sylLen', 'low'], it is not synthesized at all, so you only get the unvoiced component (if any). The voiced part is also not synthesized if the noise is at its loudest, namely permittedValues['noiseAmpl', 'high'] (40 dB)

Sound generation with soundgen

Andrey Anikin

2022-08-14

1 Intro

1.1 Purpose

1.2 Before you proceed: consider the alternatives

1.3 Basic principles of sound synthesis in soundgen