The function soundgen
is intended for the synthesis of
animal vocalizations, including human non-linguistic vocalizations like
sighs, moans, screams, etc. It can also create non-biological sounds
that require precise control over spectral and temporal modulations,
such as special sound effects in computer games or acoustic stimuli for
scientific experiments. Soundgen is NOT meant to be used for
text-to-speech conversion. It can be adapted for this purpose, but
existing specialized tools will probably serve better.
Soundgen uses a parametric algorithm, which means that sounds are synthesized de novo, and the output is completely determined by the values of control parameters, as opposed to concatenating or modifying existing audio recordings. Under the hood, the current version of soundgen generates and filters two sources of excitation: sine waves and white noise.
The rest of this vignette will unpack this last statement and
demonstrate how soundgen can be used in practice. To simplify setting
the control parameters and visualizing the output, soundgen library
includes an interactive Shiny app. To start the app, type
soundgen_app()
from R or try it online at
cogsci.se/soundgen.html. To
generate sounds from the console, use the function
soundgen
. Each section of the vignette focuses on a
particular aspect of sound generation, both describing the relevant
arguments of soundgen
and explaining how they can be set in
the Shiny app. Note that some advanced features, notably vectorization
of several arguments, are not implemented in the app and are only
accessible from the console.
TIP: this vignette is a hands-on, non-technical tutorial focusing on how to use soundgen in order to synthesize new sounds. For a more rigorous and theoretical discussion, please refer to Anikin, A. (2019). Soundgen: an open-source tool for synthesizing nonverbal vocalizations. Behavoir Research Methods, 51(2), 778-792.
There are several other R packages that offer sound synthesis,
notably tuneR
, seewave
, and
phonTools
. Both seewave
and tuneR
implement straightforward ways to synthesize pulses and square,
triangular, or sine waves as well as noise with adjustable (linear)
spectral slope. You can also create multiple harmonics with both
amplitude and frequency modulation using seewave::synth()
and seewave::synth2()
. There is even a function available
for adding formants and thus creating different vowels:
phonTools::vowelsynth()
. Basic tonal synthesis and many
acoustic manipulations can also be performed using the open-source
program PRAAT. If this is ample for your needs, you might want to try
these alternatives first.
So why bother with soundgen? First, it takes customization and flexibility of sound synthesis much further. You will appreciate this flexibility if your aim is to produce convincing biological sounds. And second, it’s a higher-level tool with dedicated subroutines for things like controlling the rolloff (relative energy of different harmonics), adding moving formants and antiformants, mixing harmonic and noise components, controlling voice changes over multiple syllables, adding stochasticity to imitate unpredictable voice changes common in biological sound production, and more. In other words, soundgen offers powerful control over low-level acoustic characteristics of synthesized sounds with the benefit of also offering transparent, meaningful high-level parameters intended for rapid and straightforward specification of whole bouts of vocalizing.
Because of this high-level control, you don’t really have to think about the math of sound synthesis in order to use soundgen (although if you do, that helps). This vignette also assumes that the reader has some training in phonetics or bioacoustics, particularly for sections on formants and subharmonics.
Feel free to skip this section if you are only interested in using soundgen, not in how it works under the hood.
Soundgen’s credo is to start with a few control parameters (e.g., the intonation contour, the amount of noise, the number of syllables and their duration, etc.) and to generate a corresponding audio stream, which will sound like a biological vocalization (a bark, a laugh, etc). The core algorithm for generating a single voiced segment implements the standard source-filter model (Fant, 1971). The voiced component is generated as a sum of sine waves and the noise component as filtered white noise, and both components are then passed through a frequency filter simulating the effect of human vocal tract. This process can be conceptually divided into three stages:
Note that soundgen currently implements only sine wave synthesis of voiced fragments. This is different from modeling glottal cycles themselves, as in phonetic models and some popular text-to-speech engines (e.g. Klatt, 1980). Normally multiple glottal cycles are generated simultaneously, with no pauses in between them (no closed phase) and with a continuously changing f0. It is also possible to add a closed phase, in which case each glottal cycle is generated separately, with f0 held stable within each cycle. In future versions of soundgen there may be an option to use a particular parametric model of the glottal cycle as excitation source as an alternative to generating a separate sine wave for each harmonic.
Some form of noise is synthesized in most sound generators. In soundgen noise is created in the frequency domain (i.e., as a spectrogram) and then converted into a time series via inverse FFT. Noise is generated with a flat spectrum up to a certain threshold, followed by user-specified linear rolloff (Johnson, 2012).
Note that this STFT-mediated method of adding formants is different from the more traditional convolution, but with multiple formants it is both considerably faster and (arguably) more intuitive. If you are wondering why we don’t simply apply the filter to the rolloff matrix before the iSTFT, this is an annoying consequence of some complexities of the temporal structure of a bout, especially of applying non-stationary filters (moving formants) that span multiple syllables. For the noise component, however, this extra step can be avoided, and we only do iSTFT once.
Having briefly looked at the fundamental principles of sound
generation, we proceed to control parameters. The aim of the following
presentation is to offer practical tips on using soundgen. For further
information on more fundamental principles of acoustics and sound
synthesis, you may find the vignettes in seewave
very
helpful, or you can check out the book on sound synthesis in R by Jerome
Sueur, the author of the seewave
package. Some essential
references are also listed at the end of this vignette, especially those
sources that have inspired particular routines in soundgen.
To generate a sound, you can either type soundgen_app()
to open an interactive Shiny app or call soundgen()
from R
console with manually specified parameters. The app offers nice
visualizations and is more user-friendly if you are not used to
programming, but note that it doesn’t support some advanced features
(e.g., vectorization of some control parameters). An object called
presets
contains a collection of presets that demonstrate
some of the possibilities. More information is available on the
project’s homepage at
http://cogsci.se/soundgen.html.
Audio playback may fail, depending on your platform and installed
software. Soundgen relies on tuneR
library for audio
playback, via a wrapper function called playme()
that
accepts both Wave objects and simple numeric vectors. If
soundgen(play = TRUE)
throws an error, make sure the audio
can be played before you proceed with using soundgen. To do so, save
some sound as a vector first:
sound = soundgen(play = FALSE)
or even simply
sound = rnorm(10000)
. Then try to find a way to play this
vector sound
. You may need to change the default player in
tuneR
or install additional software. See the
seewave
vignette on sound input/output for an in-depth discussion of audio
playback in R. Sueur (2018, p. 100) recommends Windows Media Player on
Windows, AudioUnits for Mac OS, SoX for Linux (the player is called
“play”), or VLC on any platform. I find that “play” from the “vox”
library or “aplay” work well on Linux, and “afplay” on Macs.
Because of possible errors, audio playback is disabled by default in
the rest of this vignette. To turn it on without changing any code,
simply set the global variable playback
to the appropriate
value for your specific OS, for ex.:
= list(TRUE, FALSE, 'vlc', 'my-awesome-player')[[2]]
playback # TRUE means defaulting to "play" on Linux, "afplay" on Mac,
# and the defaults of tuneR::play on Windows
# FALSE means no sound playback
The basic workflow from R console is as follows:
library(soundgen)
= soundgen(play = playback) # default sound: a short [a] by a male speaker
s001 # 's' is a numeric vector - the waveform. You can save it, play it, plot it, ...
# names(presets) # speakers in the preset library
# names(presets$Chimpanzee) # presets per speaker
= eval(parse(text = presets$Chimpanzee$Scream_conflict)) # a screaming chimp
s002 # playme(s)
The basic workflow in the Shiny app is as follows:
soundgen_app()
. RStudio should
open it in the default web browser (there will be no sound if the app
runs in an RStudio window instead of a browser). Firefox and Chrome are
known to work. Safari will probably fail to play back the generated
audio, although the output can still be exported as a .wav file.TIP The interactive app soundgen_app()
gives you the
exact R code for calling soundgen()
, which you can
copy-paste into your R environment and generate manually the same sound
as the one you have created in the app. If in doubt about the right
format for a particular argument, you can use the app first, copy-paste
the code into your R console, and modify it as needed. You can also
import an existing formula into the app, adjust the parameters in an
interactive environment, and then export it again. BUT: the app can only
use a single value for many parameters that are vectorized when called
from the command line (rolloff, jitterDep, etc.).
If you need to generate a single syllable without pauses, the only
temporal parameter you have to set is sylLen
(“Syllable
length, ms” in the app). For a bout of several syllables, you have two
options:
nSyl
(“Number of syllables” in the app). Unvoiced
noise is then allowed to fill in the pauses (if noise is longer than the
voiced part), and you can specify an amplitude contour, intonation
contour, and formant transitions that will span the entire bout. For
ex., if the vowel sequence in a three-syllable bout is “uai”, the output
will be approximately “[u] – pause – [a] – pause – [i]”.= soundgen(formants = 'uai', repeatBout = 1, nSyl = 3, play = playback)
s003 # to replay without re-generating the sound, type "playme(s)"
repeatBout
(“Repeat bout # times” in the app). This
is the same as calling soundgen
repeatedly with the same
settings or clicking the Generate button in the app several times. If
temperature = 0, you will get exactly the same sound repeated each time,
otherwise some variation will be introduced. For the same “uai” example,
the output will be “[uai] – pause – [uai] – pause – [uai]”.= soundgen(formants = 'uai', repeatBout = 3, nSyl = 1, play = playback)
s004 # playme(s)
Like most arguments to soundgen, sylLen
and
pauseLen
can also be vectors. For example, if you want to
synthesize 5 syllables of progressively shorter duration and separated
by increasingly longer pauses, you can write:
= soundgen(nSyl = 5,
s005 sylLen = c(300, 100), # linearly decreasing from 300 to 100 ms
pauseLen = c(50, 150), # increasing from 50 to 150 ms
plot = TRUE,
play = playback)
# playme(s)
For more complicated changes in the length of syllables or pauses,
you can use the function getSmoothContour
to upsample your
anchors (see “Intonation” for examples) or manually code longer
sequences of values. The length of your input vector doesn’t matter: it
will be up- or downsampled automatically. This also works with all other
vectorized arguments to soundgen (rolloff, jitterDep, vibratoFreq,
etc).
= soundgen(
s006 nSyl = 10,
sylLen = c(60, 200, 90, 50, 50), # quickly up to 200 and down to 50
pauseLen = c(50, 60, 80, 150), # growing ~exponentially
plot = TRUE,
play = playback
)
As a special case, your values will be used without interpolation if you provide exactly as many as needed:
= soundgen(
s007 nSyl = 5,
sylLen = c(300, 100, 400, 50, 100), # 5 syllables, 5 values
pauseLen = c(50, 150, 50, 100), # 4 pauses, 4 values
plot = TRUE,
play = playback
)
You can use both repeatBout
and nSyl
simultaneously. The pause between bouts is equal to the length of the
first syllable:
= soundgen(
s008 repeatBout = 2,
nSyl = 3,
sylLen = c(300, 100),
pauseLen = c(100, 50),
plot = TRUE,
play = playback
)
Note that all pauses between syllables have to be positive. A
negative pause (overlap) between bouts is allowed, but you have to
enforce it with invalidArgAction = "ignore"
:
= soundgen(
s009 repeatBout = 2,
sylLen = c(300, 100),
pauseLen = -50,
plot = TRUE,
play = playback,
invalidArgAction = 'ignore'
)
## Warning in validatePars(p, gp, permittedValues, invalidArgAction):
## pauseLen should be between 0 and 1000; override with caution
When we hear a tonal sound such as someone singing, one of its most salient characteristics is intonation or, more precisely, the contour of the fundamental frequency (f0), or, even more precisely, the contour of the physically present or perceptually extrapolated spectral band which is perceived to correspond to the fundamental frequency (pitch). Soundgen literally generates a sine wave corresponding to f0 and several more sine waves corresponding to higher harmonics, so f0 is straightforward to implement. However, how can its contour be specified with as few parameters as possible? The solution adopted in soundgen is to take one or more anchors as input and generate a smooth contour that passes through all anchors.
In the simplest case, all anchors are equidistant, dividing the sound into equal time steps. You can then specify anchors as a numeric vector. For example:
# steady pitch at 440 Hz
= soundgen(pitch = 440, play = playback) s010
# downward chirp
= soundgen(pitch = 3000:2000, play = playback,
s011 samplingRate = 44100, pitchSamplingRate = 44100)
# when f0 is high, increase samplingRate and pitchSamplingRate for better quality
# up and down
= soundgen(pitch = c(150, 250, 100), sylLen = 700, play = playback) s012
# 3rd quarter silent
= soundgen(pitch = c(150, 200, NA, 110),
s013 sylLen = 700, play = playback)
You can also use a mathematical formula to produce very precise pitch modulation, just check that the values are on the right scale. For example, sinusoidal pitch modulation can be created as follows:
= (sin(1:70 / 3) * .25 + 1) * 350
anchors plot(anchors, type = 'l', xlab = 'Time (points)', ylab = 'Pitch (Hz)')
= soundgen(pitch = anchors, sylLen = 1000, play = playback) s014
For more flexibility, anchors can also be specified at arbitary times
using the “anchor format” - a dataframe with two columns:
time
(ms) and value
(in the case of pitch,
this is frequency in Hz). The function that generates smooth contours of
f0 and other parameters is getSmoothContour()
. When you
generate sounds, soundgen()
has an argument
smooth(list = ...)
, where you can put the settings passed
on to getSmoothContour()
. So you do not have to call
getSmoothContour()
explicitly, although sometimes it can be
helpful to do so in order to visualize the curve implied by your
anchors. Time can range from 0 to 1, or it can be specified in ms – it
makes no difference, since the produced contour is rescaled to match
syllable duration.
For example, say we want f0 first to increase sharply from 350 to 700
Hz and then to slowly return to baseline. Time anchors can then be
specified as c(0, .1, 1)
(think of it as “start”, “10%”,
and “end” of the sound), and the arguments len
and
samplingRate
together determine the duration:
len / samplingRate
gives duration in seconds. Values are
processed on a logarithmic (musical) scale if thisIsPitch
is TRUE
, and the resulting curve is smoothed (the default
behavior is to use loess for up to 10 anchors, cubic spline for 11-50
anchors, and linear interpolation for >50 anchors).
A sound with this intonation can be generated as follows:
= soundgen(
s015 sylLen = 900, play = playback,
pitch = list(time = c(0, .1, 1), # or (c(0, 30, 300)) - in ms
value = c(350, 700, 350)))
Beware of smoothing! A curve interpolated from a few anchors is not
uniquely defined, and the interpolation algorithm has a major effect on
its shape. The amount of smoothing can be controlled with
loessSpan
and interpol
:
= 500 # desired syllable length, in ms
sylLen = 16000
samplingRate = sylLen / 1000 * samplingRate
sylLen_points = data.frame(time = c(0, .1, 1),
anchors value = c(350, 700, 350))
par(mfrow = c(1, 3))
= getSmoothContour(
smc1 anchors = anchors,
len = sylLen_points,
interpol = 'approx',
thisIsPitch = TRUE, plot = TRUE,
main = 'No smoothing', samplingRate = samplingRate
)= getSmoothContour(
sm2 anchors = anchors,
len = sylLen_points,
loessSpan = 0.75,
thisIsPitch = TRUE, plot = TRUE,
main = 'loessSpan = .75', samplingRate = samplingRate
)= getSmoothContour(
smc3 anchors = anchors,
len = sylLen_points,
loessSpan = 1,
thisIsPitch = TRUE, plot = TRUE,
main = 'loessSpan = 1', samplingRate = samplingRate
)par(mfrow = c(1, 1))
# likewise: soundgen(smoothing = list(interpol = 'loess', loessSpan = 1))
To get more complex curves, simply add more anchors. If you are not
satisfied with the smooth curve generated by soundgen()
based on your anchors, you can produce a longer vector (e.g., you could
use analyze()
or pitch_app()
to extract the
pitch contour of an existing recording), and then you can feed soundgen
with this arbitrarily long vector instead of using the anchor format,
ensuring very precise control over the intonation contour.
TIP Many arguments to soundgen are vectorized, and most
vectorized arguments understand the “anchor format” you just encountered
above, namely something like
my_argument = list(time = ..., value = ...)
, where time can
be in ms or ~[0, 1]. See ?soundgen for a complete list of anchor-format
arguments and keep in mind two important special cases that use a
slightly different format: formants
and noise
(see below). And remember to check that interpolation looks
reasonable!
The assumption behind specifying an entire contour with a few
discrete anchors is that the contour is smooth and continuous. However,
there may be special occasions when you do want a discontinuity such as
an instantaneous pitch jump. The default behavior of
getSmoothContour()
is to make a jump if two anchors are
closer than one percent of the syllable length (as specified with the
default jumpThres = 0.01
). To make a pitch jump, you thus
provide two values of f0 that are very close in time, for example:
= soundgen(sylLen = 800, plot = TRUE, play = playback,
s016 pitch = list(time = c(0, .2, .201, .4, 1),
value = c(900, 1200, 1800, 2000, 1500)),
samplingRate = 22050)
## pitchSampingRate should be much higher than the highest pitch; resetting to 20000 Hz
TIP Given the same anchors, the shape of the resulting curve depends on syllable duration. That’s because the amount of smoothing is adjusted automatically as you change syllable duration. Double-check that all your contours still look reasonable if you change the duration!
To draw f0 contour in the Shiny app, use “Intonation / Intonation syllable” tab and click the intonation plot to add anchors. Soundgen then generates a smooth curve through these anchors. If you click the plot close to an existing anchor, the anchor moves to where you clicked; if you click far from any existing anchor, a new anchor is added. To remove an anchor, double-click it. To go back to a straight line, click the button labeled “Flatten pitch contour”. Exactly the same principles apply to all anchors in soundgen_app (pitch, amplitude, mouth opening, and noise). Note also that all contours are rescaled when the duration changes, with the single exception of negative time anchors for noise (i.e. the length of pre-syllable aspiration does not depend on syllable duration).
If the bout consists of several syllables (nSyl > 1
),
you can also specify the overall intonation over several syllables using
pitchGlobal
(app: “Intonation / Intonation global”). The
global intonation contour specifies the deviation of pitch per syllable
from the main pitch contour in semitones, i.e. 12 semitones = 1 octave.
In other words, it shows how much higher or lower the average pitch of
each syllable is compared to the rest of the syllables. For ex., we can
generate five seagull-like sounds, which have the same intonation
contour within each syllable, but which vary in average pitch spanning
about an octave in an inverted U-shaped curve. Note that the number of
anchors need not equal the number of syllables:
= soundgen(nSyl = 5, sylLen = 200, pauseLen = 140,
s017 plot = TRUE, play = playback,
pitch = list(time = c(0, 0.65, 1),
value = c(977, 1540, 826)),
pitchGlobal = list(time = c(0, .5, 1),
value = c(-6, 7, 0)))
# pitchGlobal = c(-6, 7, 0) is equivalent, since time steps are equal
TIP Calling soundgen
with argument
plot = TRUE
produces a spectrogram using a function from
soundgen package, spectrogram
. Type
?spectrogram
or ?spectrogramFolder
and see the
vignette on acoustic analysis for plotting tips and advanced options.
You can also plot the waveform produced by soundgen
using
any other function, e.g. seewave::spectro
Vibrato adds frequency modulation (FM) to f0 contour by modifying f0 per glottal cycle. In contrast to irregular jitter and temperature-related random drift, this FM is regular, namely sinusoidal:
# variable, but deterministic vibrato (same every time)
= soundgen(vibratoDep = 0:3, vibratoFreq = 7:5,
s018 sylLen = 2000, pitch = c(300, 280),
play = playback, plot = TRUE)
# stochastic vibrato (different every time)
= soundgen(vibratoDep = rnorm(n = 10, mean = .5, sd = .1),
s019 vibratoFreq = rnorm(n = 10, mean = 5, sd = .5),
sylLen = 2000, pitch = c(300, 280),
play = playback, plot = TRUE)
It is a basic principle of soundgen that random variation can be
introduced in the generated sound. This behavior is controlled by a
single high-level parameter, temperature
(app: “Main /
Hypers”). If temperature = 0
, you will get exactly the same
sound by executing the same call to soundgen
repeatedly. If
temperature > 0
, each generated sound will be somewhat
different, even if all the control parameters are exactly the same. In
particular, positive temperature introduces fluctuations in syllable
structure, all contours (intonation, breathing, amplitude, mouth
opening), and many effects (jitter, subharmonics, etc). It also
“wiggles” user-specified formants and adds new formants above the
specified ones at a distance calculated based on the estimated vocal
tract length (see Section “Spectral filter (formants)” below).
Code example :
# the sound is a bit different each time, because temperature is above zero
= soundgen(repeatBout = 5, temperature = 0.3, play = playback)
s020 # Setting repeatBout = 5 is equivalent to:
# for (i in 1:5) soundgen(temperature = 0.3, play = playback)
If you don’t want stochastic behavior, set temperature to zero. But
note that some effects, notably jitter and subharmonics, will then be
added in an all-or-nothing manner: either to the entire sound or not at
all. Also note that additional formants will not be added above the
user-specified ones if temperature is exactly 0. In practice it may be
better to set temperature to a very small positive value like 0.01. You
can also change the extent to which temperature affects different
parameters (e.g., if you want more variation in intonation and less
variation in syllable structure). To do so, use
tempEffects
, which is a list of scaling coefficients that
determine how much different parameters vary at a given temperature.
tempEffects
includes the following scaling
coefficients:
amplDep
: random fluctuations of user-specified
amplitude anchors across syllables (if nSyl > 1)amplDriftDep
: drift of amplitude mirroring pitch
driftformDisp
: irregularity of the dispersion of stochastic
formants that are added above user-specified formants (if any) at
distances consistent with the specified length of the vocal tract
vocalTract
formDrift
: the amount of random drift of formantsglottisDep
: proportion of glottal cycle with closed
glottisnoiseDep
: random fluctuations of user-specified noise
anchors across syllables (if nSyl > 1)pitchDep
: random fluctuations of user-specified pitch
anchors across syllables (if nSyl > 1)pitchDriftDep
: amount of slow random drift of f0 (the
higher, the more f0 changes)pitchDriftFreq
: frequency of slow random drift of f0
(the higher, the faster f0 changes)rolloffDriftDep
: drift of rolloff mirroring pitch
driftspecDep
: random fluctuations of rolloff, nonlinear
effects, attacksubDriftDep
: drift of subharmonic frequency and
bandwidth mirroring pitch driftsylLenDep
: random fluctuations of the duration of
syllables and pauses between syllablesThe default value of each scaling parameter is 1. To enhance a particular component of stochastic behavior, set the corresponding coefficient to a value >1; to remove it completely, set its scaling coefficient to zero.
# despite the high temperature, temporal structure does not vary at all,
# while formants are more variable than the default
= soundgen(repeatBout = 3, nSyl = 2, temperature = .3, play = playback,
s021 tempEffects = list(sylLenDep = 0, formDrift = 3))
To simplify usage, there are a few other hyper-parameters. They are redundant in the sense that they are not strictly necessary to produce the full range of sounds, but they provide convenient shortcuts by making it possible to control several low-level parameters at once in a coordinated manner. Hyper-parameters are marked “hyper” in the Shiny app.
For example, to imitate the effect of varying body size, you can use
maleFemale
. Since formants
are not specified,
but temperature is above zero, a schwa-like sound with approximately
equidistant formants is generated using vocalTract
(cm) to
calculate the expected formant dispersion:
= soundgen(
s022 maleFemale = -1, # male: 100% lower f0, 25% lower formants, 25% longer vocal tract
formants = NA, pitch = 220, vocalTract = 15, play = playback)
= c(-1, # male: 100% lower f0, 25% lower formants, 25% longer vocal tract
mf 0, # neutral (default)
1) # female: 100% higher f0, 25% higher formants, 25% shorter vocal tract
# See e.g. http://www.santiagobarreda.com/vignettes/v1/v1.html
= soundgen(
s023 maleFemale = 0, # neutral (default)
formants = NA, pitch = 220, vocalTract = 15, play = playback)
= soundgen(
s024 maleFemale = 1, # female: 100% higher f0, 25% higher formants, 25% shorter vocal tract
formants = NA, pitch = 220, vocalTract = 15, play = playback)
To change the basic voice quality along the breathy-creaky continuum,
use creakyBreathy
. It affects the rolloff of harmonics, the
type and strength of pitch effects (jitter, subharmonics), and the
amount of aspiration noise. For example:
= c(-1, # max creaky
cb -.5, # moderately creaky
0, # neutral (default)
5, # moderately breathy
.1) # max breathy (no tonal component)
= rep(0, 1600)
silence = silence
s025 for (i in cb) {
= c(s025, soundgen(creakyBreathy = i), silence)
s025
}# playme(s025)
Use ampl
and amplGlobal
to modulate the
amplitude (loudness) of an individual syllable or a polysyllabic bout,
respectively. In the app, they are found under “Amplitude / Amplitude
syllable” and “Amplitude / Amplitude global”. Note that
ampl
affects only the voiced component, while
amplGlobal
, attackLen
(“Attack length, ms” in
the app), and amDep
(“Amplitude / Amplitude modulation / AM
depth” in the app) affect both the voiced and the unvoiced components.
Avoid attackLen = 0
, since that can cause clicks.
# each syllable has a 10-dB dip in the middle (note the dumbbell shapes
# in the oscillogram under the spectrogram), and there is an overall fade-out
# over the entire bout
= soundgen(
s026 nSyl = 4,
ampl = list(time = c(0, .3, 1), # unequal time steps
value = c(0, -10, 0)),
amplGlobal = c(0, -20), # this fade-out applies to noise as well
noise = -10,
plot = TRUE, heights = c(1, 1), play = playback)
The dynamic amplitude range is determined by
dynamicRange
. This parameter sets the minimum level of
loudness, below which components are discarded as essentially silence.
For maximum sound quality, set a high dynamicRange, like 120 dB. This
helps to avoid artifacts like audibly clicking harmonics, but it also
slows down sound generation. The default is 80 dB.
Rapid amplitude modulation imitating a trill is implemented by
multiplying the synthesized waveform by a wave with adjustable
amType
(“sine” or “logistic”), shape amShape
(logistic only), frequency amFreq
, and amplitude
amDep
:
= soundgen(
s027 sylLen = 1000, formants = NA,
# set the depth of AM (0% = none, 100% = max)
amDep = c(0, 100),
# set AM frequency in Hz (vectorized)
amFreq = c(50, 25),
# set the shape: 0 = close to sine, -1 = notches, +1 = clicks
amShape = 0,
# asymmetrical attack: 20 ms at the beginning and 140 ms at the end
attackLen = c(20, 140),
plot = TRUE, heights = c(1, 1), ylim = c(0, 1), windowLength = 100,
play = playback)
A common special case of modifying the amplitude envelope of a
synthesized or recorded sound is compression, which helps to make the
amplitude remains relatively stable throughout the duration of the
signal. There is a separate function for achieving this, namely
compressor()
AKA flatEnv()
:
= rnorm(500) * seq(1, 0, length.out = 500)
s = compressor(s, samplingRate = 1000, plot = TRUE,
s1 killDC = TRUE, windowLength_points = 50)
Another common modification is to fade the sound in and/or out. One
way to do this is to change the attack (which affects both the beginning
and the end) or to use amplitude anchors. On other occasions, or if your
sound already exists and you want to change it, the way to go about it
is to use a separate function, fade()
. This also gives you
more options, e.g. different attack shapes, while soundgen() defaults to
linear fade-in/out for attack.
# Create a sound with sharp attack
= soundgen(sylLen = 300, pitch = 800, addSilence = 0, attackLen = 10)
s028 # playme(s)
= fade(s028, fadeIn = 50, fadeOut = 100, samplingRate = 16000,
s029 shape = 'logistic', steepness = 1, plot = TRUE)
# playme(s029)
# different fades are available: linear, logarithmic, etc
TIP: attackLen in soundgen is applied only to voiced source, and before it is filtered (i.e., before formants are added). In case of artifacts increase attackLen or apply fade() after synthesizing the sound.
Argument formants
(tab “Tract / Formants” in the app)
sets the formants – frequency bands used to filter the excitation
source. Just as an equalizer in a sound player amplifies some
frequencies and dampens others, aappropriate filters can be applied to a
tonal sound to make it resemble a human voice saying different vowels.
Formants are created in the frequency domain using all-pole models if
all formant amplitudes are positive and zero-pole models if there are
anti-formants with negative amplitudes (Stevens, 2000, ch. 3).
Using presets for callers M1 and F1, you can directly specify a
string of vowels. When you call soundgen
with
formants = 'aouuuui'
or some such character string, the
values are taken from presets$M1$Formants
(or
presets$F1$Formants
if the speaker is “F1” in the Shiny
app). Formants can remain the same throughout the vocalizations, or they
can move. For example, formants = 'ai'
produces a sound
that goes smoothly from [a] to [i], while formants = 'aaai'
produces mostly [a] with a rapid transition to [i] at the very end.
Argument formantStrength
(“Formant prominence” in the app)
adjusts the overall effect of all formant filters at once, and
formantWidth
scales all bandwidths.
= soundgen(formants = 'ai', play = playback) s030
= soundgen(formants = 'aaai', play = playback) s031
Presets give you some rudimentary control over vowels. More subtle control is necessary for animal sounds, as well as for human vowels that are not included in the presets dictionary or for non-default speakers. For such cases you will have to specify at least the frequency of each formant (and optionally, also amplitude, bandwidth, and time stamps for each value). The easiest, and normally sufficient, approach is to specify frequencies only and have soundgen() figure out the appropriate amplitude and bandwidth for each formant. Bandwidth is calculated from frequency using a formula derived from human phonetic research. Namely, above 500 Hz it follows the original formula known as “TMF-1963” (Tappert, Martony, and Fant, 1963), and below 500 Hz it applies a correction to allow for energy losses at low frequencies (Khodai-Joopari & Clermont, 2002). Below 250 Hz the bandwidth starts to decrease again, in a purely empirical attempt to achieve reasonable values even for formant frequencies below ordinary human range. See the internal function soundgen:::getBandwidth() if you are interested and note that for anything but ordinary human voices it may be safer to specify formant bandwidths manually.
= 2 ^ seq(log2(20), log2(20000), length.out = 500)
freqs plot(freqs, soundgen:::getBandwidth(freqs), type = 'l',
log = 'xy', xlab = 'Center frequency, Hz',
ylab = 'Bandwidth, Hz',
main = 'Default formant bandwidths')
abline(v = 250, lty = 3)
abline(v = 500, lty = 3)
Formant amplitudes are normally assumed to be determined by their frequency and bandwidth (see Stevens, 2000), but you can override this by specifying amplitudes explicitly. Note that changing the amplitude (or frequency, or bandwidth) of one formant affects the entire spectrum above it.
For moving formants, provide multiple values, which assumes equal time steps, or specify time points explicitly, where time varies from 0 to 1 (to be scaled appropriately depending on the length of sound). For example:
# shorthand specification with three stationary formants
= c(300, 2500, 3200)
formants
# shorthand specification with two moving formants
= list(f1 = c(300, 900), f2 = c(2500, 1500))
formants
# full specification with two moving formants and non-default amplitude and bandwidth
= list(
formants f1 = list(freq = c(300, 900),
amp = c(30, 10),
width = 120),
f2 = list(time = c(0, .2, 1), # "time" is only needed for non-equidistant anchors
freq = c(2500, 2400, 1500),
amp = 30,
width = c(0, 220, 240)))
Feed these lists into soundgen() to hear what they sound like.
In addition to user-specified formants, higher formants are added
automatically based on the length of vocal tract length estimated from
the user-specified formant frequency values. The function that estimates
vocal tract length is imaginatively called estimateVTL
:
estimateVTL(formants = c(400, 1800, 2550, 4100), plot = TRUE)
## [1] 15.73333
# 17.5 cm
A more general function called schwa()
both estimates
VTL and allows you to compare measured formant frequencies with those
expected for a neutral schwa sound and perform more sophisticated
operations with formants. See ?schwa
for more details if
you are working with vowels.
schwa(formants = c(820, 1320, 2550, 4100), plot = TRUE)
## $vtl_apparent
## [1] 16.08047
##
## $formantDispersion
## [1] 1100.714
##
## $ff_measured
## [1] 820 1320 2550 4100
##
## $ff_schwa
## [1] 550.3571 1651.0714 2751.7857 3852.5000
##
## $ff_relative
## [1] 48.994160 -20.051914 -7.332901 6.424400
##
## $ff_relative_semitones
## [1] 6.903069 -3.874375 -1.318451 1.077947
##
## $ff_relative_dF
## [1] 0.2449708 -0.3007787 -0.1833225 0.2248540
It is usually useful to allow soundgen to create upper formants
automatically based on the estimated VTL, since upper formants are
typically less perceptually salient. As a result, we don’t want to spend
too much time specifying them manually, but still we don’t want them to
be completely absent, either, since that makes for very weak high
frequencies in the spectrum. An approximation based on standard
open-closed tube models is often good enough for calculating the missing
formant frequencies. You can remove them by setting
temperature = 0
or formantDepStoch = 0
, but
note that without higher formants the entire spectrum loses energy at
higher frequencies:
= soundgen(formants = c(800, 1200), play = playback, plot = TRUE) s032
= soundgen(formants = c(800, 1200), formantDepStoch = 0,
s033 play = playback, plot = TRUE)
Another useful method is to specify vocal tract length without any formants. In this case soundgen() approximates a neutral schwa sound for an animal with a vocal tract that looks like a uniform tube of this length. Crude, but often sufficient. A toy example (check presets for some more realistic sounds created using this method):
= soundgen(
s034 sylLen = 800, formants = NULL, rolloff = -6,
vocalTract = c(12, 18, 19), formantCeiling = 5,
play = playback, plot = TRUE)
For very long vocal tracts, it is advisable to increase
formantCeiling
to ~5-10 (i.e., 5 times the Nyquist
frequency), otherwise the filter dampens high frequencies too much. The
default setting is formantCeiling = 2
, which is much faster
and not too inaccurate for human-length vocal tracts. Note that vocal
tract length can be variable, providing an easy way to create parallel
formant transitions.
Unlike the argument “mouth”, variable “vocalTract” parameter
recalculates formant bandwidths (unless these are specified manually)
and is thus more accurate, but it requires you to specify a reasonable
vocal tract length, in cm. If you provide both a few formant frequencies
and a variable vocalTract
, formants are synthesized as
specified at the initial value of vocalTract
(that is, at
the very beginning of the sound) and can deviate afterwards as VTL
changes. Careful with such combinations: make sure the initial
value of the VTL is reasonable for these formant frequencies (any VTL is
fine towards the end of the sound - the user-specified formants will
move accordingly). Examples:
= list(f1 = c(500, 250), f2 = c(1500, 2800), f3 = 3500, f4 = 4300)
formants estimateVTL(formants) # 13.7 cm - so the initial VTL should be close to 13.7
## [1] 13.58204
= soundgen(
s035 sylLen = 800, rolloff = -6, formants = formants,
vocalTract = list(time = c(0, .3, 1), value = c(14, 28, 25)),
play = playback, plot = TRUE, main = 'Good initial VTL')
# wrong: VTL too high for these formants, causing an overlap
= soundgen(
s036 sylLen = 800, rolloff = -6, formants = formants,
vocalTract = list(time = c(0, .3, 1), value = c(18, 28, 25)),
play = playback, plot = TRUE, main = 'Bad initial VTL')
Sometimes it may be useful to view moving formants before
synthesizing the sound, or you may want to create a spectral envelope
and then apply it to another sound with
transplantFormants()
. The way to create and view a spectral
envelope is to call the function that handles all formant processing
under the hood in soundgen, namely getSpectralEnvelope
:
# plotting directly from getSpectralEnvelope() in spectrogram form
= getSpectralEnvelope(nr = 1024, # freq bins in FFT frame (window_length / 2)
s nc = 50, # time bins
samplingRate = 16000,
formants = list(f1 = c(500, 250),
f2 = c(1500, 2800),
f3 = 3500, f4 = 4300),
plot = TRUE,
dur = 1500, # just an example
colorTheme = 'seewave',
lipRad = 6) # lip radiation, dB/octave
Note that, in addition to formants, lip and nose radiation are also handled by this function (see the next section on mouth opening). This has the effect of boosting high frequencies.
TIP When using the app, you can start with a formant preset by typing in a vowel string, and then you can modify it. This way you don’t have to remember the right format. If you edit the list of formants and nothing in the sound seems to be changing, there may be a misprint, missing comma, etc.
For even more advanced spectral filters, you can specify both formants (poles) and antiformants (zeros). This may be useful if you want to create a nasalized sound. The numbering of formants is arbitrary, as long as they are arranged in the right order. For example, if you want to insert a new formant between F1 and F2 without renaming all higher formants, call it “f1.5” or something like that. It is important to use a non-integer number, since otherwise these additional formants will be inappropriately used to estimate the length of vocal tract and adding stochastic formants above the ones you specify (that is, if temperature > 0 and vocalTract = NA).
For example, a slow transition from [a] to [a nasalized] might be coded as follows (note that formant f1.7 has negative amplitude, so f1.5 and f1.7 form a pole-zero pair):
= list(
formants f1 = list(time = c(0, 1), freq = c(880, 900),
amp = c(25, 15), width = c(80, 120)),
f1.5 = list(time = c(0, 1), freq = 600,
amp = c(0, 15), width = 80), # additional pole
f1.7 = list(time=c(0, 1), freq = 750,
amp = c(0, -15), width = 80), # zero
f2 = list(time = c(0, 1), freq = c(1480, 1250),
amp = c(30, 20), width = c(120, 200)),
f3 = list(time=c(0, 1), freq = c(2900, 3100),
amp = 25, width = 200))
= soundgen(sylLen = 1500, play = playback, pitch = 140, formants = formants)
s037 spectrogram(s037, samplingRate = 16000, ylim = c(0, 4), contrast = .5,
windowLength = 10, step = 5, colorTheme = 'seewave')
# long-term average spectrum (less helpful for moving formants but very good for stationary):
# seewave::meanspec(s, f = 16000, wl = 256)
If you look at the filter towards the end of the sound, you can observe the additional zero-pole pair between the first and second formant:
= getSpectralEnvelope(nr = 512, nc = 100, formants = formants)
se plot(as.numeric(rownames(se)), 20 * log10(se[, ncol(se)]),
type = 'l', xlab = 'KHz', ylab = 'dB')
In addition to variable vocal tract length (vocalTract
argument), an even easier shortcut for creating parallel formant
transitions without coding all transitions by hand is provided by the
mouth
argument (in the app, tab “Tract / Mouth opening”).
This can be thought of as a hyper-parameter offering an easy way to
define moving formants within a bout (easy because you don’t need to
know the VTL in cm): all formants simply go down relative to specified
values as the mouth closes and rise as it opens (see Moore, 2016).
In addition, an open mouth has lip radiation, which has the effect of
amplifying higher frequencies. Lip radiation is replaced by nose
radiation when the mouth is completely closed, dampening the higher
frequencies, and the vowel is automatically nasalized using a simple
approximation (Hawkins & Stevens, 1985). Basically, with the mouth
closed we switch from a tube open at one end to a tube closed at both
ends and coupled with a (simplified) nasal cavity. Despite being a crude
model of what really happens when a vocalizing animal closes its mouth,
in many cases mouth
can save you a lot of manual coding of
formants. Here is a simple example, with the mouth gradually opening and
closing again:
= soundgen(sylLen = 1200, play = playback, pitch = 140,
s038 mouth = list(time = c(0, .3, .75, 1),
value = c(0, 0, .7, 0)))
spectrogram(s038, samplingRate = 16000,
ylim = c(0, 4), contrast = .5,
windowLength = 10, step = 5,
colorTheme = 'seewave')
TIP Here and elsewhere, I talk about applying soundgen to the
task of synthesizing non-human sounds. It does work, but be aware that
many computational routines are based on human phonetic research, simply
because there is vastly more data available on human vocal production.
For example, formant bandwidths and spectral consequences of
nasalization are estimated based on human phonetics, but it is far from
clear to what extent these equations are applicable to sounds produced
by non-human mammals. Bird calls are again a whole new ball game. And
once you move on to insects or non-biological sounds, just forget about
hyperparameters like mouth
and code everything at the
lowest possible level.
The standard source-filter model (Fant, 1971) assumes that the
vibration of the vocal folds is independent of the configuration of the
supraglottal vocal tract - that is, that the source and filter are
independent. However, in some situations this assumption does not hold,
and the filter can have a noticeable effect on the vibration of the
vocal folds. An example of such source-filter interaction that is
implemented in soundgen is formant locking, in which the fundamental
frequency or a higher harmonic becomes temporarily “locked” to the
frequency of a formant. The relevant parameter is
formantLocking
(0 = none, 1 = the entire sound, vector form
also accepted). See the internal function
soundgen:::lockToFormants()
for more information and
examples.
In humans, spontaneous formant locking is usually observed in high-pitched sounds like screams, although formant matching can also be performed intentionally in order to maximize sonority, as in soprano singing, or to produce unusual vocal effects such as Tuvinian throat singing. In animal vocalizations, formant locking appears to be particularly common when the vocal tract is long and formants are closely spaced (my speculation). For example, elk bugles often contain a series of stepwise pitch jumps from one formant to the next, a bit like this:
= soundgen(
s039 sylLen = 1500, rolloff = -20,
pitch = c(500, 1000, 1800, 1700, 500),
formants = NULL, vocalTract = 55,
formantLocking = c(0, 1, 1, 1, 1, 0), # except the beginning and end
shortestEpoch = 200, # affects both subharmonics (if any) and formantLocking
noise = -20, # just to make formants visible in this example
temperature = .1,
samplingRate = 22000, pitchSamplingRate = 22000,
play = playback, plot = TRUE, ylim = c(0, 8)
)
For some purposes it may be useful to separate the generation of
glottal source (or another source of acoustic excitation) from its
spectral filtering. You may also need to add formants to an existing
waveform. To do so, you can call the helper function
addFormants()
, which normally works under the hood in
soundgen. The algorithm is to take an STFT, multiply the resulting
spectrum by the filter, and then convert it back to time domain via
inverse STFT. The same function can theoretically be used to perform
inverse filtering - that is, to remove formants from a signal - as long
as you can provide a VERY accurate formant filter. See
?addFormants
for more information.
TIP Bewildered by all these formants, antiformants, VTL, etc?
Good news: you can simply lift formants off a real recording and
“transplant” them onto your synthetic sound. See
?transplantFormants
for more information and
examples.
Soundgen produces tonal sounds by means of generating a separate sine
wave for each harmonic. However, it is very tricky to choose the
appropriate strength of each harmonic. The simplest solution is to make
each higher harmonic slightly weaker than the previous one, say by
setting a fixed exponential decay rate from lower to higher harmonics.
The corresponding parameter in soundgen is rolloff
(in the
app, “Source rolloff, dB/octave”). Unfortunately, this is often not
really good enough, necessitating several more control parameters.
Soundgen allows a lot of flexibility when specifying source spectrum.
You can change the basic rolloff of harmonics per octave, producing a
sharper or more gentle decline of energy over frequencies
(rolloffOct
), adjust rolloff depending on f0, so that
high-pitch sounds will have a steeper rolloff (rolloffKHz
),
or add a parabolic correction (rolloffParab
) that affects
the first rolloffParabHarm
harmonics. Working from R
console, the relevant function is getRolloff
. Its arguments
are well-documented: type ?getRolloff
for help. Here is
just a single example:
# strong F0, rolloff with a "shoulder"
= getRolloff(rolloff = c(-5, -20), # rolloff parameters are vectorized
r rolloffParab = -10, rolloffParabHarm = 13,
pitch_per_gc = c(170, 340), plot = TRUE)
# to generate the corresponding sound:
= soundgen(sylLen = 1000, rolloff = c(-5, -20), rolloffOct = 0,
s040 rolloffParab = -10, rolloffParabHarm = 13,
pitch = c(170, 340), play = playback)
In the app the relevant parameters are found in the tab “Source /
Rolloff”. To develop an intuition for source spectrum settings, I
recommend practicing with disabled formants in the app (set “Formants
prominence” under “Tract / Formants” to 0). This way you can isolate the
effects of source spectrum and use the preview plot for instant feedback
– it shows the rolloff for the lowest and the highest pitch in your
intonation contour. Rolloff parameters are vectorized, but this
functionality is only available from R console. However, rolloff also
varies over time if temperature is above zero (use
tempEffects$specDep
to control the amount of stochastic
variation of rolloff and other spectral characteristics).
Apart from using rolloff
and related parameters to
control the general shape of excitation spectrum, it is also possible to
control each harmonic individually. Say, you have performed inverse
filtering on a recording to estimate glottal source, giving you the
strength of each harmonic in glottal pulses, and now you wish to hear
the sound corresponding to this glottal source. To do so, use the
rolloffExact
argument and supply a matrix of numeric values
on a scale of 0 to 1. Each row is one harmonic, and the sound will
contain only as many partials as there are rows in
rolloffExact
:
= matrix(c(.1, .2, 1, .02, .2, # strength of H1-H5 at time 0
rolloffExact 1, .2, .01, .1, .4), # strength of H1-H5 at time 1000
ncol = 2)
= soundgen(sylLen = 1000, pitch = c(400, 430), formants = NULL,
s041 rolloffExact = rolloffExact,
plot = TRUE, ylim = c(0, 4), play = playback)
If you do not want source spectrum to change over time, a single vector instead of a matrix will do:
= soundgen(sylLen = 1000, pitch = c(400, 430), formants = NULL,
s042 rolloffExact = c(.1, .5, .25, 1, .25, .08, .05, .02),
plot = TRUE, ylim = c(0, 4), play = playback)
If f0 is very low, as in vocal fry or some animal vocalizations like crocodile roaring or elephant rumbling, individual glottal pulses can be both seen on a spectrogram and perceived as distinct percussion-like acoustic events separated by noticeable pauses. Soundgen can create such sounds by switching to a new mode of production: instead of synthesizing continuous sine waves spanning the entire syllable, it creates each glottal pulse individually (each with its full set of harmonics) and then glues them together with pauses in between.
This is a lot slower than continuous sine wave synthesis and mostly
justified for very low-pitched sounds, since with higher pitch there
will be too few points per glottal cycle to sound convincing without
increasing samplingRate
to astronomical values. Also note
that some spectral artifacts may appear. Example:
# Not a good idea: samplingRate is too low
= soundgen(pitch = c(1500, 800), glottis = 75,
s043 samplingRate = 16000, play = playback)
# This sounds better but takes a long time to synthesize:
= suppressWarnings(soundgen(
s044 pitch = c(1500, 800), glottis = 75,
samplingRate = 80000, play = playback,
invalidArgAction = 'ignore'))
# NB: invalidArgAction = 'ignore' forces a "weird" samplingRate value
# to be accepted without question
# Now this is what this feature is meant for: vocal fry
= soundgen(
s045 sylLen = 1500, pitch = c(110, 90), rolloff = -12,
glottis = c(0, 500),
jitterDep = 1, shimmerDep = 20,
# subharmonics not implemented with "glottis"
play = playback)
spectrogram(s045, samplingRate = 16000, heights = c(1, 1))
Soundgen can add frequency jumps, subharmonics, sidebands
(implemented as a modification of subharmonics or as amplitude
modulation), and approximate deterministic chaos by adding strong jitter
and shimmer. These effects basically make the sound appear noisy / harsh
/ rough / unpredictable / etc. Jitter and shimmer are created by adding
random noise to the periods and amplitudes, respectively, of the
“glottal cycles”. Subharmonics could be created by adding rapid
amplitude and/or frequency modulation, but for maximum flexibility
soundgen uses a different - slightly hacky, but powerful - technique of
literally setting up an additional sine wave for each subharmonic. To
achieve sidebands, the amplitude of each subharmonic is set to be a
function of its distance from the nearest harmonic of the f0 stack; the
rate at which g0 harmonics lose energy away from the nearest f0 harmonic
determines the width of sidebands (subWidth
). This way we
can create either subharmonics or narrow sidebands that vary naturally
as f0 changes over time, producing bifurcations and switching between
different subharmonic regimes (see Wilden et al., 2012).
The main limitation of this approach is that it is too
computationally costly to generate variable numbers of subharmonics for
the entire bout. The solution currently adopted in soundgen is to break
longer sounds into so-called “epochs” with a constant number of
subharmonics in each. The epochs are synthesized separately, trimmed to
the nearest zero crossing, and then glued together with a rapid
crossFade()
. This is suboptimal, since it shortens the
sound and may introduce audible artifacts at transitions between epochs.
shortestEpoch
controls the approximate minimum length of
each epoch. Longer epochs minimize problems with transitions, but the
behavior of subharmonics then becomes less variable since their number
is constrained to be constant within each epoch. NB: with short
shortestEpoch
, add ~20-30 ms per transition to the nominal
sound duration in order to compensate for cross-fading the epochs.
To add nonlinear effects stochastically, you can use
nonlinBalance
, which regulates approximately what
proportion of the sound is affected. At temperature > 0,
nonlinBalance
creates a random walk that divides each
syllable into epochs defined by their regime, using two thresholds to
determine when a new regime begins (see Fitch et al., 2002):
Regime 1: no nonlinear effects. If nonlinBalance = 0%, the whole syllable is in regime 1.
Regime 2: subharmonics only. Note that subharmonics are only added to segments with subFreq < f0 / 2.
Regime 3: subharmonics and jitter. If nonlinBalance = 100%, the whole syllable is in regime 3.
To see any effect, you have to set jitterDep, shimmerDep, and subFreq/subDep/subWidth to some positive values. With nonlinBalance < 100%, the result is a stochastic combination of the three regimes (tonal, subharmonics, subharmonics + jitter + shimmer):
= soundgen(
s046 sylLen = 1500, pitch = c(170, 420, 400, 190),
nonlinBalance = 60,
subDep = 10, jitterDep = 1.5, shimmerDep = 25,
play = playback, plot = TRUE, ylim = c(0, 5))
To add nonlinearities non-stochastically (exactly where you want
them), keep nonlinBalance
at the default value of 100% and
specify the nature and timing of nonlinearities manually. To add a
single subharmonic between each pair of f-harmonics (period doubling),
set subRatio = 2
, for period tripling,
subRatio = 3
, etc. This number of subharmonics will be
added regardless of pitch changes. Another way is to set
subFreq
(“Target subharmonic frequency, Hz” in the app),
which gives an approximate g0 target, so that the number of subharmonics
will vary with pitch. The amplitude (loudness) of subharmonics is
controlled by subDep
(“Depth of subharmonics”). All these
parameters can vary within a syllable.
Add one subharmonic regardless of pitch:
= soundgen(subRatio = 2, subDep = c(5, 20),
s047 sylLen = 800, pitch = c(700, 1300), formants = NULL,
play = playback, plot = TRUE, ylim = c(0, 3))
Set target g0, so that the number of subharmonics depends on pitch:
= soundgen(subFreq = 400, subDep = c(5, 20),
s048 sylLen = 800, pitch = c(700, 1300), formants = NULL,
play = playback, plot = TRUE, ylim = c(0, 3))
Sidebands are best demonstrated with high-pitched sounds and low
subharmonic frequencies. For example, chimpanzees emit piercing screams
with narrow subharmonic bands. If we set subFreq
to 75 Hz
and subWidth
to 130 Hz, subharmonics literally form a band
around each harmonic of the main stack, creating a very distinct,
immediately recognizable sound quality:
= soundgen(
s049 sylLen = 800,
pitch = list(time=c(0, .3, .9, 1),
value = c(1200, 1547, 1487, 1154)),
rolloff = -3, rolloffKHz = 0,
# gradually increasing width of sidebands at 0-600 ms
subFreq = 75, subDep = 25,
subWidth = data.frame(time = c(0, 600, 650, 800),
value = c(0, 130, 0, 0)),
vocalTract = 12, mouth = c(.1, .8, .1),
temperature = .001,
pitchSamplingRate = 22050, samplingRate = 22050,
play = playback, plot = TRUE, ylim = c(0, 5))
Another way to create sidebands is to add amplitude modulation (AM). Perfectly sinusoidal AM creates a simple pair of extra harmonics, while non-sinusoidal AM creates sidebands:
= soundgen(
s050 sylLen = 800,
pitch = list(time=c(0, .3, .9, 1),
value = c(1200, 1547, 1487, 1154)),
rolloff = -3, rolloffKHz = 0,
# gradually increasing width of sidebands at 0-600 ms
amFreq = 75, amShape = .1,
amDep = list(time = c(0, 600, 650, 800),
value = c(0, 100, 0, 0)),
vocalTract = 12, mouth = c(.1, .8, .1),
temperature = .001,
pitchSamplingRate = 22050, samplingRate = 22050,
play = playback, plot = TRUE, ylim = c(0, 5))
TIP The parameters regulating nonlinear effects are vectorized,
so you can write subDep = c(0, 130), jitterDep = c(0, 1)
,
etc., or use the “anchor format” as above (console only, not available
in the app)
As for jitter in pitch regime 3, it wiggles both f0 and g0 harmonic
stacks, blurring the spectrum. Parameter jitterDep
(“Jitter
depth, semitones” in the app) defines how much the pitch fluctuates,
while jitterLen
(“Jitter period, ms”) defines how rapid
these fluctuations are. Slow jitter with a period of ~50 ms produces the
effect of a shaky, unsteady voice. It may sound similar to a vibrato,
but jitter is irregular. Rapid jitter with a period of ~1 ms, especially
in combination with subharmonics, may be used to imitate deterministic
chaos, which is found in voiced but highly irregular animal sounds such
as barks, roars, noisy screams, etc. This works best for high-pitched
sounds like screams. Shimmer is similar to jitter, except that it
defines random fluctuations of the amplitude rather than frequency. It
is controlled by two arguments, shimmerDep
(percent) and
shimmerLen
(ms).
= soundgen(jitterLen = 40, jitterDep = 1, # shaky voice
s051 shimmerLen = 30, shimmerDep = 30,
sylLen = 1000, pitch = c(150, 170),
play = playback, plot = TRUE, ylim = c(0, 3))
= soundgen(jitterLen = 1, jitterDep = 1, # harsh voice
s052 shimmerLen = 1, shimmerDep = 10,
sylLen = 1000, pitch = c(150, 170),
play = playback, plot = TRUE, ylim = c(0, 3))
Jitter + shimmer + subharmonics work well together. For example, barks of a small, annoying dog can be roughly approximated with this minimal code (ignoring respiration to keep things simple):
= soundgen(repeatBout = 2, sylLen = 140, pauseLen = 100,
s053 vocalTract = 8, formants = NULL, rolloff = 0,
pitch = c(1100, 1600, 1100), mouth = c(0, 0.5, 0),
jitterDep = 1, subDep = 60, play = playback)
Note that jitter is random variation around a target value, but
soundgen also has relatively slow random pitch drift, which is
implemented as a random walk and can thus wander into values quite far
removed from the target. Random pitch drift is added whenever
temperature > 0. Use tempEffects
to regulate its amount
and frequency:
# slight and slow (slightly unsteady voice)
= soundgen(
s054 sylLen = 1500, pitch = 300,
tempEffects = list(pitchDriftDep = 1, pitchDriftFreq = .5),
play = playback, plot = TRUE, ylim = c(0, 2))
# strong and rapid (trembling voice, similar to jitter)
= soundgen(
s055 sylLen = 1500, pitch = 300,
tempEffects = list(pitchDriftDep = 5, pitchDriftFreq = 5),
play = playback, plot = TRUE, ylim = c(0, 2))
# both drift and jitter (trembling voice ending with some "chaos")
= soundgen(
s056 sylLen = 1500, pitch = 300,
tempEffects = list(pitchDriftDep = 5, pitchDriftFreq = 5),
jitterDep = c(0, 0, 0, 2),
play = playback, plot = TRUE, ylim = c(0, 2))
There is no way to synthesize true deterministic chaos with residual harmonic structure in soundgen. However, there are several roundabout ways to achieve a comparable effect. As already mentioned, strong jitter and shimmer create harsh sounds that are perceptually similar to deterministic chaos, especially for higher f0 values:
= soundgen(
s057 sylLen = 1200,
pitch = list(
time = c(0, 110, 111, 180, 350, 940, 941, 1100, 1200),
value = c(700, 1150, 1550, 2000, 2240, 1940, 1180, 900, 500)),
temperature = 0.05, tempEffects = list(pitchDep = 0),
jitterDep = list(time = c(0, 200, 201, 900, 901, 1200),
value = c(0, 0, 1.7, 1.2, 0, 0)),
formants = c(900, 1300, 3300, 4300),
attackLen = c(10, 200),
samplingRate = 44100, play = playback, plot = TRUE, ylim = c(0, 5))
## pitchSampingRate should be much higher than the highest pitch; resetting to 22400 Hz
Another method is to encode very rapid pitch jumps, say between f0 and a formant or between harmonically related values, like this:
= soundgen(
s058 sylLen = 1200,
pitch = list(
time = c(0, 80, 81, 230, 231, 385,
# 500 time anchors here - an episode of "chaos"
seq(385, 850, length.out = 500),
851, 1020, 1021, 1085),
value = c(700, 1130, 1000, 1200, 1860, 1840,
# random f0 jumps b/w 1.2 & 1.8 KHz
sample(c(1200, 1800), size = 500, replace = TRUE),
1620, 1540, 1220, 900)),
temperature = 0.05,
tempEffects = list(pitchDep = 0),
jitterDep = .3,
rolloffKHz = 0, rolloff = 0, formants = c(900, 1300, 3300, 4300),
samplingRate = 44100, play = playback, plot = TRUE, ylim = c(0, 5))
## pitchSampingRate should be much higher than the highest pitch; resetting to 18600 Hz
Incidentally, you can use similar tricks for introducing variation in
any soundgen parameter. For example, you can use runif()
or
rnorm()
to randomly vary things like mouth opening, pitch,
amplitude. That’s the best part of working in R!
= silence
s059 # run several times to appreciate the randomness
for (i in 1:5) s059 = c(s059, soundgen(
sylLen = 800,
mouth = rnorm(n = 5, mean = .5, sd = .3)
), silence)# playme(s059)
Sometimes it may be necessary to control precisely the timing of each
nonlinear regime. For example, in an experiment a sound containing
nonlinear effects may need to be synthesized repeatedly, varying one
parameter and preserving everything else, including nonlinear regimes.
To control the timing of nonlinear effects manually, set
nonlinBalance
to 100 (the entire vocalization) and vary the
strength of nonlinear effects with their vectorized “depth”
settings:
= soundgen(
s060 # nonlinear settings
jitterDep = c(0, 0, 1.5, .5), shimmerDep = c(0, 0, 15, 5),
# settings for high precision
temperature = .001, dynamicRange = 120,
samplingRate = 22050, pitchSamplingRate = 22050,
# other settings
sylLen = 1000, pitch = c(240, 200),
rolloff = c(-20, -18, -23, -28) + 4, vibratoDep = .2,
formants = c(800, 1400, 2500, 3700, 5000, 6800),
noise = list(time = c(0, 340, 900, 1000),
value = c(-60, -45, -60, -80) + 10),
rolloffNoise = 0,
mouth = c(.55, .5, .45, .6),
play = playback, plot = TRUE, ylim = c(0, 4)
)
TIP For analytical-precision work, set
pitchSamplingRate
to the same (high) value as
samplingRate
, say 22050. By default,
pitchSamplingRate
is much lower to speed up the synthesis,
but this way sound duration can vary considerably depending on nonliear
regimes, especially in sounds like screams with highly variable
pitch.
In the example above jitterDep = c(0, 0, 1.5, .5)
means
that there is no jitter roughly in the first half of the voiced
fragment, then a jitter of 1.5 semitones, and then .5 semitones towards
the end. For more precision, use the “anchor format”. The same goes for
all vectorized parameters: jitterLen, shimmerDep, shimmerLen, subFreq,
subDep, rolloff settings, etc. For example, to turn on jitter abruptly
at 300 ms and turn it off again at 500 ms, and to have shimmer only
between 600 and 800 ms, we can modify the code as follows (it still
won’t be precise down to a millisecond, though):
= soundgen(
s061 # nonlinear settings
jitterDep = list(
time = c(0, 300, 301, 500, 501, 1000),
value = c(0, 0, 1.5, 1.5, 0, 0)
), shimmerDep = list(
time = c(0, 600, 601, 800, 801, 1000),
value = c(0, 0, 40, 40, 0, 0)
),# settings for high precision
temperature = .001, dynamicRange = 120,
samplingRate = 22050, pitchSamplingRate = 22050,
# other settings
addSilence = 0, # easier to check timing
sylLen = 1000, pitch = c(240, 200),
rolloff = c(-20, -18, -23, -28), vibratoDep = .2,
formants = c(800, 1400, 2500, 3700, 5000, 6800),
noise = list(time = c(0, 340, 900, 1000),
value = c(-60, -45, -60, -80) + 30),
rolloffNoise = -8,
mouth = c(.55, .5, .45, .6),
play = playback, plot = TRUE, ylim = c(0, 4)
)
Here is another method of controlling the timing of nonlinear
phenomena. When nonlinBalance < 100
, soundgen divides a
sound into different nonlinear regimes by generating a random walk,
which also controls the drift of some other control parameters. Setting
temperature at nearly zero (say, at 0.001) removes random variation in
most control parameters, but the random walk for nonlinear effects still
remains random. To standardize that random walk as well, use
nonlinRandomWalk
. nonlinRandomWalk
should be a
vector containing 0, 1, and 2, where 0 = no nonlinearities, 1 =
subharmonics, and 2 = subharmonics + jitter + shimmer. The number and
order of 0/1/2 determines which nonlinear regime is active at which
time. For example, this will make a sound with no effect in the first
third, subharmonics in the second third, and jitter in the final third
of the total duration:
= c(rep(0, 100), rep(1, 100), rep(2, 100))
rw_bin = soundgen(sylLen = 800, pitch = 300, temperature = 0.001,
s062 subFreq = 100, subDep = 70, jitterDep = 1,
nonlinRandomWalk = rw_bin,
play = playback, plot = TRUE, ylim = c(0, 4))
Make nonlinRandomWalk
a fairly long vector for greater
precision, i.e. not just c(0, 1)
- because of the way
approx
works, that will NOT split the sound into 50% with
no nonlinear effects and 50% with subharmonics. Instead, write
c(rep(0, 50), rep(1, 50))
or some such.
You can also generate an actual random walk and then use it in several sounds to make sure their nonlinear effects have exactly the same timing. For example, here are two sounds with different pitch levels, but identical otherwise, including identical nonlinear regimes:
# set up a random walk (repeat until satisfied with the contour)
= getRandomWalk(len = 1000, rw_range = 100,
rw trend = c(0.5, -0.5), rw_smoothing = .95)
= getIntegerRandomWalk(rw, minLength = 100, plot = TRUE)
rw_bin # synthesize two sounds with identical nonlinear effects but different f0
= soundgen(sylLen = 800, pitch = 300, temperature = 0.001,
s063 subFreq = 100, subDep = 20, jitterDep = 1,
nonlinRandomWalk = rw_bin,
play = playback, plot = TRUE, ylim = c(0, 4))
= soundgen(sylLen = 800, pitch = 500, temperature = 0.001,
s064 subFreq = 100, subDep = 20, jitterDep = 1,
nonlinRandomWalk = rw_bin,
play = playback, plot = TRUE, ylim = c(0, 4))
In addition to the tonal (harmonic, voiced) component, which is synthesized as a stack of harmonics (sine waves), soundgen produces turbulent noise (unvoiced component). This noise can be added to the voiced component to create breathing, sniffing, snuffling, hissing, gargling, etc. It is often appropriate to include at least some noise in synthetic vocalizations, if only because it is more natural to have noise instead of harmonics in the upper part of the source spectrum.
The perceptual quality of turbulent noise depends on its spectral
composition, which is controlled by two soundgen arguments:
formantsNoise
and rolloffNoise
(in the app,
use “Tract / Unvoiced type”). Noise is generated as white noise with
spectral rolloff given by rolloffNoise
(“Noise rolloff,
dB/octave” in the app) above a certain cutoff value
(flatSpectrum
, the default is currently 1200 Hz). The
timing of the unvoiced component relative to the voiced component is
controlled by the argument noise
, which is discussed in the
next section. There are two basic types of turbulent noise in
soundgen:
rolloffNoise
setting,
while formantsNoise
is NULL
or
NA
. This is useful for adding noise that originates deep in
the throat, close to the vocal cords. To generate breathing, specify
noise
, but leave formantsNoise
blank (NA,
which is its default value). Soundgen then assumes that the unvoiced
component should have the same formant structure as the voiced
component.= soundgen(
s065 sylLen = 500,
noise = list(time = c(0, 800), value = c(-20, -10)),
formantsNoise = NA, # breathing - same formants as for voiced
play = playback, plot = TRUE)
# observe that the voiced and unvoiced components have exactly the same formants
formantsNoise
) manually in exactly
the same format as for the voiced component
(formants
).= soundgen(
s066 sylLen = 200, pitch = c(150, 120),
noise = list(time = c(180, 250, 400), value = c(-20, -10, -50)),
# specify noise filter ≠ voiced filter to get ~[s]
formantsNoise = list(f1 = list(freq = 7000, amp = 40, width = 1500)),
rolloffNoise = 0,
play = playback, plot = TRUE)
# observe that the voiced and unvoiced components have different formants
TIP: pitch = NA
or NULL
removes the
voiced component, so that only turbulent noise is synthesized. In the
app, untick the box Intonation / Intonation syllable / “Generate voiced
component?”
If formantsNoise = NA
or NULL
(i.e., if
this is aspiration noise), formant structure is calculated based on
vocal tract length, and then extra stochastic formants are added as
usual. For example, to create simple sighs, you can just specify the
length of your creature’s vocal tract:
= soundgen(
s067 vocalTract = 15.5, # ~human throat (15.5 cm)
formants = NULL, attackLen = 200, play = playback,
noise = list(time = c(0, 800), value = c(40, 40)))
# NB: since there is no voiced component, we control syllable length
# by specifying the appropriate noise$time, in this case 0 to 800 ms
= soundgen(
s068 vocalTract = 30, # a large animal
formants = NULL, attackLen = 200, play = playback,
sylLen = 800, noise = 40) # another way to specify the length
# NB: voiced component is not generated if noise$value >= 40 dB
= soundgen(
s069 vocalTract = 100, invalidArgAction = 'ignore', # a whale
formants = NULL, attackLen = 200, play = playback,
sylLen = 800, pitch = NULL, noise = 0)
# Another way to remove the voiced component is to write pitch = NULL
In contrast, if formantsNoise
are specified explicitly
(i.e., if this is not aspiration noise), breathing noise is by default
NOT enriched with stochastically added formants. To avoid losing all
high-frequency energy in your noise, make sure you add a sufficient
number of formants in formantsNoise
, ideally all the way up
to Nyquist frequency (half the sampling rate). Alternatively, you can
explicitly specify vocalTract
, and then extra formants will
be added to the unvoiced component. Compare:
# only two specified formants
= soundgen(pitch = NULL,
s070 formantsNoise = c(1000, 2000),
noise = 40, sylLen = 800,
play = playback, plot = TRUE)
# two specified formants plus extra formants based on vocalTract
= soundgen(vocalTract = 15.5,
s071 pitch = NULL,
formantsNoise = c(1000, 2000),
noise = 40, sylLen = 800,
play = playback, plot = TRUE)
The excitation source for the unvoiced component can be synthesized
as white noise (if rolloffNoise = 0
) or as turbulent noise
with a spectrum that linearly (not exponentially!) loses power above
noiseFlatSpec
(the default is 1200 Hz). The parameter
rolloffNoise
thus controls the source spectrum of the
unvoiced component:
= soundgen(vocalTract = 17.5,
s072 noise = 40, rolloffNoise = c(5, -20),
formants = NULL, attackLen = 200,
play = playback, plot = TRUE)
# NB: noise amplitude may change as rolloffNoise changes
In the shiny app, the tab “Source / Unvoiced timing” is for
specifying the amplitude contour of the unvoiced component. In soundgen,
the relevant argument is
noise = data.frame(time = ..., value = ...)
. It sets the
timing and loudness of turbulent noise relative to the voiced component
of a typical syllable. Starting from soundgen v1.4, you have three
options in terms of how the amplitudes of the voiced and unvoiced
components are compaired:
noiseAmpRef = 'f0'
, noise$value
gives
the maximum amplitude of noise relative to the maximum amplitude of the
first harmonic (f0) of the voiced component. Until soundgen v1.4, this
was the only available option. This setting makes the balance between
harmonics and noise dependent on source spectrum (rolloff settings) and
formants of both components. Example:= soundgen(noiseAmpRef = 'f0', rolloff = -1,
s073 noise = list(time = c(-100, 400), value = c(0, 0)),
play = playback)
= soundgen(noiseAmpRef = 'f0', rolloff = -15,
s074 noise = list(time = c(-100, 400), value = c(0, 0)),
play = playback)
noiseAmpRef = 'source'
, noise$value
gives the maximum amplitude of noise relative to the maximum amplitude
of the unfiltered voiced component (“glottal source”). In other words,
you are specifying how loud the noise is relative to the source of
excitation, but the actual balance between harmonics and noise can vary
depending on the formant structure. Example:# Harmonics-noise balance doesn't depend on rolloff...
= soundgen(noiseAmpRef = 'source', rolloff = -15, rolloffNoise = 0,
s075 noise = list(time = c(-100, 400), value = c(0, 0)),
play = playback)
= soundgen(noiseAmpRef = 'source', rolloff = -1, rolloffNoise = -20,
s076 noise = list(time = c(-100, 400), value = c(0, 0)),
play = playback)
# ...but it does depend on the formant structure
= soundgen(noiseAmpRef = 'source', formants = 'a',
s077 noise = list(time = c(-100, 400), value = c(0, 0)),
play = playback)
= soundgen(noiseAmpRef = 'source', formants = 'u',
s078 noise = list(time = c(-100, 400), value = c(0, 0)),
play = playback)
noiseAmpRef = 'filtered'
, noise$value
gives the maximum amplitude of noise relative to the maximum amplitude
of the filtered voiced component (after adding formants). The balance
between harmonics and noise therefore doesn’t depend on either rolloff
or formants. This is the default option. Example:= soundgen(noiseAmpRef = 'filtered', formants = 'a',
s079 noise = list(time = c(-100, 400), value = c(0, 0)),
play = playback)
= soundgen(noiseAmpRef = 'filtered', formants = 'u',
s080 noise = list(time = c(-100, 400), value = c(0, 0)),
play = playback)
TIP If you need a very precise balance of harmonics and noise based on another normalization (RMS amplitude, minimum or median instead of maximum amplitude, etc.), you can always synthesize the harmonic and noise components separately and then simply add them up at whatever amplitudes you like
If you want noise to be time-locked to the voiced component, make the
argument noise
a numeric vector. Anchor format with
time = c(), value = c()
is more flexible, but note that it
may cause the noise and voiced components to be slightly out of sync
(which may be useful if you want noise to extend beyond the voiced
segment).
= soundgen(nSyl = 2,
s081 noise = c(-10, 0),
plot = TRUE, ylim = c(0, 4), play = playback)
Turbulent noise is allowed to fill the pauses between syllables, but not between bouts. For example, in this two-syllable bout noise carries over after the end of each voiced component, since syllable duration is 120 ms and the last breathing time anchor is 209 ms:
= soundgen(
s082 nSyl = 2, sylLen = 120, pauseLen = 120,
temperature = 0.001, rolloffNoise = -2,
noise = list(time = c(39, 56, 209),
value = c(-40, 0, -20)),
formants = list(f1 = c(860, 530), f2 = c(1280, 2400)),
formantsNoise = list(f1 = c(420, 1200)),
plot = TRUE, ylim = c(0, 4), play = playback)
Note that in the previous example formantsNoise
defines
the change of filter for the unvoiced components over the entire bout,
i.e. across multiple syllables. This is similar to the way
formants
define the global change in formants across
syllables. In contrast, if you have multiple bouts with one syllable in
each, the change of unvoiced filter plays out within each bout, and the
pause between the bouts is counted from the end of the unvoiced
component, without any overlap between bouts. Compare the example above
to the following (the only change is to use repeatBout
instead of nSyl
). Observe the behavior of
formantsNoise
(moving within each bout) and the duration of
the pause between the syllables (~30 ms) and between the bouts (120
ms):
= soundgen(
s083 repeatBout = 2, nSyl = 2,
sylLen = 120, pauseLen = 120,
temperature = 0.001, rolloffNoise = -2,
noise = list(time = c(39, 56, 209),
value = c(-40, 0, -20)),
formants = list(f1 = c(860, 530), f2 = c(1280, 2400)),
formantsNoise = list(f1 = c(420, 1200)),
plot = TRUE, ylim = c(0, 4), play = playback)
Both the timing and the amplitude of noise anchors are defined
relative to the voiced component. Because noise can extend beyond voiced
fragmets, however, time anchors for noise MUST be specified in ms
(unlike all the other contours, which also accept time anchors on any
arbitrary scale, say 0 to 1). If the noise starts before the voiced
part, the first time anchor will be negative. This is easier to
visualize in the app, which provides a preview. From R console, you can
also preview the noise amplitude contour implied by your anchors by
calling getSmoothContour
, for example:
= getSmoothContour(anchors = list(time = c(-50, 200, 300),
a value = c(-80, 20, -80)),
voiced = 200,
normalizeTime = FALSE, # keep time in ms
plot = TRUE, ylim = c(-80, 40), main = '')
TIP: if the voiced part is shorter than
permittedValues['sylLen', 'low']
, it is not synthesized at
all, so you only get the unvoiced component (if any). The voiced part is
also not synthesized if the noise is at its loudest, namely
permittedValues['noiseAmpl', 'high']
(40 dB)
To achieve a complex vocalization, sometimes it may be necessary - or
easier - to synthesize two or more sounds separately and then combine
them. If the components are strictly consecutive, you can simply
concatenate them with c()
. If there is no silence in
between, it is safer to use crossFade()
, otherwise this can
introduce transients like clicks between the two sounds:
par(mfrow = c(1, 2))
= sin(2 * pi * 1:5000 * 100 / 16000) # pure tone, 100 Hz
sound1 = sin(2 * pi * 1:5000 * 200 / 16000) # pure tone, 200 Hz
sound2
# simple concatenation
= c(sound1, sound2)
comb1 # playme(comb1) # note the click
plot(comb1[4000:5500], type = 'l', xlab = '', ylab = '')
# note the abrupt transition, which creates the click
# spectrogram(comb1, 16000)
# cross-fade
= crossFade(sound1, sound2, samplingRate = 16000, crossLen = 50)
comb2 # playme(comb2) # no click
plot(comb2[4000:5500], type = 'l', xlab = '', ylab = '')
# gradual transition
# spectrogram(comb2, 16000)
par(mfrow = c(1, 1))
Here is a more elaborate example, in which two components of the same syllable are so different that it’s easier to synthesize them separately and then cross-fade, rather than to try and find a set of parameters that will generate the entire syllable in one go:
= soundgen(sylLen = 1400,
cow1 pitch = list(time = c(0, 11/14, 1),
value = c(75, 130, 200)),
temperature = 0.1,
rolloff = -6, rolloffOct = -3, rolloffParab = 12,
mouthOpenThres = 0.6,
formants = NULL, vocalTract = 36.5,
mouth = list(time = c(0, 0.82, 1),
value = c(0.6, 0, 1)),
noise = list(time = c(0, 1400),
value = c(-45, -45)),
rolloffNoise = -4, addSilence = 0)
= soundgen(sylLen = 310, pitch = c(359, 359),
cow2 temperature = 0.05,
subFreq = 150, subDep = 70, jitterDep = 1.3,
rolloff = -6, rolloffOct = -3, rolloffKHz = -0,
formants = NULL, vocalTract = 36.5,
noise = list(time = c(0, 26, 317, 562),
value = c(-80, -33, -32, -80)),
rolloffNoise = -6,
attackLen = 0, addSilence = 0)
= crossFade(cow1 * 3, cow2, # adjust the relative volume by scaling
s084 samplingRate = 16000, crossLen = 150)
# playme(s084, 16000)
spectrogram(s084, 16000, ylim = c(0, 4))
If you want the two sounds to overlap without a cross-fade, you can
use addVectors()
, which simply makes sure two waveforms are
padded with zeros to the same length and overlapped intelligently. Note
that in this case cross-fading is not appropriate, so it may be safer to
apply fade-in/out to both sounds to soften the attack. For example, here
is how to add some chirping of birds in the background:
= 50000 # >10 times the highest pitch
samplingRate = soundgen(sylLen = 700, pitch = 250:180,
sound1 formants = 'aaao', addSilence = 100,
samplingRate = samplingRate, play = playback)
= soundgen(nSyl = 2, sylLen = 150,
sound2 pitch = 4300:2200, attackLen = 10,
formants = NA, temperature = .001,
pitchCeiling = samplingRate, pitchSamplingRate = samplingRate,
addSilence = 0, play = playback)
## Resetting samplingRate to 50000 Hz because of high pitch
= .1 + .15 # silence + 150 ms
insertionTime = insertionTime * samplingRate
insertionPoint = addVectors(sound1,
s085 * .05, # to make sound2 quieter relative to sound1
sound2 insertionPoint = insertionPoint)
# sound1 and sound2 have attack of 50 and 10 ms, so no clicks
# playme(s085, samplingRate)
spectrogram(s085, samplingRate, windowLength = 10, ylim = c(0, 5),
contrast = .5, colorTheme = 'seewave')
Sometimes it is desirable to combine the characteristics of two different stiimuli, producing some kind of intermediate form - a hybrid or blend. This technique is called morphing, and it is employed regularly and successfully with visual stimuli, but not so often with sounds, because it turns out to be rather tricky to morph audio. Since soundgen creates sounds parametrically, however, morphing becomes much more straightforward: all we need to do is define the rules for interpolating between all control parameters. For example, say we have sound A (100 ms) and sound B (500 ms), which only differ in their duration. To morph them, we could generate five otherwise identical sounds that are 100, 200, 300, 400, and 500 ms long, giving us the originals and three equidistant intermediate forms - that is, if we assume that linear interpolation is the natural way to take perceptually equal steps between parameter values.
In practice this assumption is often unwarranted. For example, the
natural scale for pitch is log-transformed: the perceived distance
between 100 Hz and 200 Hz is 12 semitones, while from 200 Hz to 300 Hz
it is only 7 semitones. To make pitch values equidistant, we would need
to think in terms of semitones, not Hz. For other soundgen parameters it
is hard to make an educated guess about the natural scale, so the most
appropriate interpolation rules remains obscure. For best results,
morphing should be performed by hand, pre-testing each parameter of
interest and creating the appropriate formulas for each morph. However,
for a “quick fix” there is an in-built function, morph
.
morph
takes two calls to soundgen
(as a
character string or a list of arguments) and creates several morphs
using linear interpolation for all parameters except pitch and formant
frequencies, which are log-transformed prior to interpolation and then
exponentiated to go back to Hz. The morphing algorithm can also deal
with arbitrary contours, either by taking a weighted mean of each curve
(method = 'smooth'
) or by attempting to match and morph
individual anchors (method = 'perAnchor'
):
= data.frame(time=c(0, .2, .9, 1), value=c(100, 110, 180, 110))
a = data.frame(time=c(0, .3, .5, .8, 1), value=c(300, 220, 190, 400, 350))
b par(mfrow = c(1, 3))
plot (a, type = 'b', ylim = c(100, 400), main = 'Original curves')
points (b, type = 'b', col = 'blue')
= soundgen:::morphDF(a, b, nMorphs = 15, method = 'smooth',
m plot = TRUE, main = 'Morphing curves')
= soundgen:::morphDF(a, b, nMorphs = 15, method = 'perAnchor',
m plot = TRUE, main = 'Morphing anchors')
par(mfrow = c(1, 1))
Here is an example of morphing the default neutral [a] into a dog’s bark:
= suppressMessages(morph(formula1 = list(repeatBout = 2),
m # equivalently: formula1 = 'soundgen(repeatBout = 2)',
formula2 = presets$Misc$Dog_bark,
nMorphs = 5, playMorphs = playback))
# use $formulas to access formulas for each morph, $sounds for waveforms
# m$formulas[[4]]
# playme(m$sounds[[3]])
= c(unlist(m$sounds))
s086 # playme(s086)
TIP Morphing a completely unvoiced into a voiced sound is currently not implemented. Add a very quiet voiced component to avoid glitches. Also try to make formants and formantsNoise compatible in both formulas: either leave both NULL or specify both in the same way (e.g. with or without explicitly defined amplitudes and bandwidths)
When synthesizing a new sound with the function
soundgen()
, a serious challenge is to find the values of
all its many arguments that will together produce the result you want.
Below I discuss three methods for adjusting soundgen settings: (1)
manual matching by ear, (2) matching by acoustic analysis, and (3)
matching by formal optimization.
If the sound you are trying to create exists only in your imagination, there is nothing for it but to tinker with argument values until a satisfactory result is achieved. Even if you have an existing audio recording that you wish to duplicate, the fastest and surest way to find the appropriate soundgen settings - in my experience - is to do it manually, using soundgen_app() and/or typing and editing R scripts with calls to soundgen(). I prefer to work with scripts and match everything by ear, using Audacity for visualization.
There is a separate vignette on manually matching an existing sound. Since it contains a lot of audio files, it is not published with the package, but you can access it on the project’s webpage at http://cogsci.se/soundgen/matching/matching.html. Here is a condensed version:
repeatBout
. If syllables are repetitive but
not identical, with an overall drift of f0 and formants, set
nSyl
. Note that sylLen
and
pauseLen
refer to the duration of voiced segments and
pauses between them - unvoiced segments do not count. If the syllables
are very different, synthesize them one by one with separate calls to
soundgen() and then concatenate as described in section 3 (“Combining
two sounds”). Biphonic sounds with more than one fundamental frequency
can be synthesized separately and overlaid with
addVectors()
.pitch = 440
for flat intonation, pitch = c(440, 300)
for a linear
slide, pitch = c(300, 440, 300)
for a rising-falling
contour, or
pitch = data.frame(time = c(0, .1, 1), value = c(300, 440, 300))
for more complex contours with values specified at arbitrary time
points. For multiple syllables, describe how f0 changes across syllables
using pitchGlobal
. Remember that you don’t need to manually
code every tiny fluctuation of f0: you can also add (regular) vibrato,
(irregular) jitter with large jitterLen
, or increase the
effect of temperature on f0 with
tempEffects = list(pitchDriftDep = ..., pitchDriftFreq = ..., pitchDep = ...)
.
If you do want to repeat very precisely the pitch contour of an existing
recording, I would recommend extracting a manually corrected pitch
contour with pitch_app()
or PRAAT.formants = NULL, vocalTract = my-best-guess-in-cm
(for
humans vocal tract length is between 10 and 20 cm). If you can hear or
see the first few formants, specify them using as few anchors as
possible, always starting at F1. For example, for stationary F1-F3 type
formants = c(600, 1700, 3000)
(f4 and above will be added
automatically based on the estimated vocal tract length); for moving F1,
type
formants = list(f1 = c(500, 700), f2 = 1700, f3 = 3000
; for
more complicated cases, see the section on formants above. Remember that
formant transitions apply to the entire bout, i.e. across multiple
syllables if nSyl > 1
. If formant tracks are roughly
parallel (e.g. all formants descend together), it’s easier to write
stationary formants and add something like
mouth = c(0.6, 0.4)
.nonlinBalance
to a
positive number.noise
.
Often the formant structure of turbulent noise is similar enough to the
voiced component to leave the default formantsNoise = NULL
;
if not, specify formantsNoise
separately. A bit of
breathing provides excellent glue between syllables - set the last value
of noise$time
to more than sylLen
to extend
breathing beyond the voiced part.rolloff
and
rolloffNoise
. Plot the long-term average spectrum of the
target and of the candidate sound using seewave::meanspec()
and try to match the two spectra. Don’t start with this until you are
satisfied that you have got the formants right, because spectral slope
depends strongly on formant frequencies. Keep in mind that rolloff is
hardly ever stable throughout the sound. It’s very common to have
something like rolloff = c(-14, -9, -8, -15)
to have a
brighter sound with strong harmonics in the middle of a call, with
softer and breathier voice quality at the beginning and end.ampl
, attack with
attackLen
, and/or amplitude modulation with
amDep = ..., amFreq = ...
. This is best done once you are
happy with other settings, since amplitude envelope is affected by the
chosen values of f0, formants, noise, and rolloff.temperature
and
tempEffects = list(...)
.TIP Every time you change something, call
soundgen(...your-pars..., play = TRUE, plot = TRUE)
to get
immediate visual and auditory feedback
In addition to manual matching, there are two ways to find the optimal values of control parameters semi-audomatically: (1) perform acoustic analysis of the target sound to guide the choice of soundgen settings, and (2) automatically optimize some soundgen settings to match the target. Below are some tools and tips for doing this.
DISCLAIMER: what follows is work in progress, not guaranteed to produce the desired results. Above all, don’t expect a magic bullet that will completely solve the matching problem without any manual intervention
The first thing you might want to do with your target audio recording
is to analyze it acoustically and extract precise measurements of
syllable number and duration, pitch contour, and formant structure. You
can use any tool of your choice to do this, including soundgen’s
functions segment
and analyze
, which are
described in the vignette on acoustic analysis. Once you have the
measurements, you can convert them into appropriate values of soundgen
arguments. An even easier solution is to use the function
matchPars
without optimization (maxIter = 0
),
which will perform a quick acoustic analysis and translate the results
into soundgen settings, as follows:
= soundgen(repeatBout = 3, sylLen = 120, pauseLen = 70,
s087 pitch = c(300, 200),
rolloff = -5, play = playback)
# playme(s087) # we hope to reproduce this sound
= matchPars(target = s087,
m1 samplingRate = 16000,
maxIter = 0) # no optimization, only acoustic analysis
## [1] "Failed to improve fit to target! Try increasing maxIter."
# ignore the warning about failing to improve the fit: we don't want to optimize yet
# m1$pars contains a list of soundgen settings
= do.call(soundgen, c(m1$pars, list(play = playback, temperature = 0.001)))
s088 # playme(cand1)
Without optimization, we simply match soundgen parameters based on
acoustic analysis. In particular, matchPars()
calls
segment()
and analyze()
to get some basic
descriptives of the target sound and to choose the appropriate settings
for soundgen
based on these measurements. If you are very
lucky, this might in fact accurately match the temporal
structure, pitch, and (stationary) formants of your target. Most likely,
it won’t. In particular, for animal vocalizations a better option is
often to estimate the vocal tract length from the dispersion of a few
consecutive formants you can identify on the spectrogram (use
estimateVTL()
) and set
vocalTract = your_estimate, formants = NULL
.
At this point you can copy-paste your call to soundgen
into the Shiny app and start adjusting these settings in an interactive
environment, rather than from the console. For example, to use the
parameters in m1$pars
, type
call('soundgen', m1$pars)
, remove the “list()” part from
the output, and you have your formula:
call('soundgen', m1$pars)
# copy-paste from the console and remove "list(...)" to get your call to soundgen():
# soundgen(samplingRate = 16000, nSyl = 3, sylLen = 79, pauseLen = 114,
# pitch = list(time = c(0, 0.5, 1), value = c(274, 253, 216)),
# formants = list(f1 = list(freq = 821, width = 122),
# f2 = list(freq = 1266, width = 36),
# f3 = list(freq = 2888, width = 117)))
Load this formula into the Shiny app. To do so, run
soundgen_app()
, click “Load new preset” on the right-hand
side of the screen, copy-paste the formula above (no quotes), and click
“Update sliders”. If all goes well, all the settings should be updated,
so that clicking “Generate” should produce the same sound as
cand1
above. Now you can tinker with the settings in the
app, improving them further.
TIP It can be very helpful to have the Shiny app running, while also having access to R console. Start two R sessions to achieve that
Let’s assume that you have a working version of your candidate sound, which resembles the target in terms of its temporal structure, pitch contour, and perhaps even the formant structure. You can also add some non-tonal noise manually in the app, experiment with effects like subharmonics and jitter, and make other modifications. But the number of possible combinations of soundgen settings is enormous, making the process of matching the target sound very time-consuming. You can sometimes speed things up by using formal optimization.
The same function as above, matchPars
, offers a simple
way to optimize several parameters by randomly varying their values,
generating the corresponding sound, and comparing it with the target.
The currently implemented version uses simple hill climbing and is best
regarded as experimental.
= matchPars(target = s087,
m2 samplingRate = 16000,
pars = 'rolloff',
maxIter = 100)
# rolloff should be moving from default (-9) to target (-5):
sapply(m2$history, function(x) {
paste('Rolloff:', round(x$pars$rolloff, 1),
'; fit to target:', round(x$sim, 2))
})do.call(soundgen, c(m2$pars, list(play = playback, temperature = 0.001)))
Anikin, A. (2019). Soundgen: an open-source tool for synthesizing nonverbal vocalizations. Behavoir Research Methods, 51(2), 778-792. https://doi.org/10.3758/s13428-018-1095-7
Fant, G. (1971). Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations (Vol. 2). Walter de Gruyter.
Hawkins, S., & Stevens, K. N. (1985). Acoustic and perceptual correlates of the non‐nasal–nasal distinction for vowels. The Journal of the Acoustical Society of America, 77(4), 1560-1575.
Johnson, K. (2011). Acoustic and auditory phonetics, 3rd ed. Wiley-Blackwell.
Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. The Journal of the Acoustical Society of America, 67(3), 971-995.
Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), 820-857.
Fitch, W. T., Neubauer, J., & Herzel, H. (2002). Calls out of chaos: the adaptive significance of nonlinear phenomena in mammalian vocal production. Animal Behaviour, 63(3), 407-418.
Khodai-Joopari, M., & Clermont, F. (2002). A Comparative study of empirical formulae for estimating vowel-formant bandwidths. In Proceedings of the 9th Australian International Conference on Speech, Science, and Technology (pp. 130-135).
Moore, R. K. (2016). A Real-Time Parametric General-Purpose Mammalian Vocal Synthesiser. In INTERSPEECH (pp. 2636-2640).
Stevens, K. (2000). Acoustic phonetics. MIT press.
Sueur, J. (2018). Sound analysis and synthesis with R. Heidelberg, Germany: Springer.
Tappert, C. C., Martony, J., & Fant, G. (1963). Spectrum envelopes for synthetic vowels. Speech Transm. Lab. Q. Progr. Status Rep, 4, 2-6.
Wilden, I., Herzel, H., Peters, G., & Tembrock, G. (1998). Subharmonics, biphonation, and deterministic chaos in mammal vocalization. Bioacoustics, 9(3), 171-196.