RealSinger
Technical background
|
 |
Note:
This page is only a brief overview of the methods
used by RealSinger to produce a voice.
It is not necessary to read this chapter to use RealSinger.
This chapter is indended to answer
technical questions some users might have about the internal
algorithms,
and is not needed to use the product. |
|
Introduction
|
 |
To synthesize a realistic singing voice, the first idea that comes to
the mind of the programmer is to use a collection of recorded phonemes
to generate the voice.
Three problems quickly become apparent:
-
The algorithm used must be able to generate the phoneme at any pitch
(fundamental
frequency). Recording every phoneme at all possible pitches is not
feasible,
because it would lead to a long and complex recording process, as well
as
huge voice files.
-
The algorithm must be able to elongate, or stretch, the phoneme to any
duration.
-
The algorithm must be able to generate a smooth transition from one
phoneme to
another, in order to simulate the coarticulation phenomenon (the next
phoneme
starts to be heard before the current one is completely terminated).
A solution can be found for each of these problems in the published
computer literature.
Efficient algorithms
have been developed to solve problems 1 and 2. They process
the recorded sample's digital data directly, allowing the programmer to
change its frequency (pitch) as well as its duration.
These algorithms are used in most popular sound editors to change
the pitch and speed of a sound file independently. They are also used
successfully
in speech synthesis, because speech frequency (pitch) variations are
quite
small.
However, in the case of the sung voice, these algorithms cannot be
used,
because they are not efficient when the pitch shift is too large. The
result
is not "wrong" as such, but the voice is distorted, just as when
playing
a magnetic tape at too high a speed (chipmunk voice).
For problem 3, a common solution is to record not only
the individual phonemes of a language, but all possible combinations of
two or three
phonemes (diphonemes/triphonemes). This system stores the
coarticulation
effect and makes the synthesized voice more realistic. However, here
again,
the recording process is quite difficult and extensive, sometimes
requiring several hours of work for the
speaker or singer. The resulting voice file is often quite
large (several megabytes).
RealSinger uses original algorithms to solve all three
of these problems
at the same time, by manipulating
frequency spectra.
Some speech synthesizers have tried to use voice frequency spectra
to generate voice in the past.
However, this method proved to be difficult
to implement, because recreating a signal from a processed spectrum
using an inverse Fast Fourier Transform (IFFT) requires that the
"phase" values be reajusted properly. If they are not properly
adjusted, consecutive pieces of signal won't
join and an unwanted background noise will be heard.
Voice spectrum
|
 |
In speech or song, the glottal source waveform (the sound produced by
vocal chords when excited by the air stream from the lungs) is a
combination
of harmonics (frequency multiples of the fundamental frequency f0).
On a power/frequency graph, this glottal source sound
looks like
a comb, with each tooth of the comb located at a frequency that is a
multiple
of the fundamental f0:
When the voice pitch increases, the fundamental
frequency f0 shifts to the right (higher
frequency), and the frequency offset between two consecutive harmonics
increases too, to remain equal to f0.
In passing through the vocal tract, some frequencies are enhanced by
cavity resonances, and others are softened. The result is that certain
harmonics are loud, and others are softer. This vocal tract spectrum
depends on the phoneme being said
or sung, and is more or less unchanged when frequency (pitch) increases
or decreases.
The convolution of these two spectra (glottal source and vocal tract)
gives the resulting spectrum, in which the listener can determine both
the phoneme (what is said) and the pitch (sung note).
RealSinger basics
|
 |
The aim of RealSinger is, for each phoneme of a given language, to
apply
a deconvolution to the recorded signal in order to separate the glottal
source and vocal tract spectra. Then it stores only the vocal tract
spectrum,
and will apply a generated glottal source to this spectrum to simulate
the original recorded phoneme being sung at any pitch.
Learning process
|
 |
-
The speaker is asked to pronounce a word for each phoneme of the chosen
language.
-
Each word is recorded as regular sound data.
-
Then, the phoneme is isolated within the word and the signal is cropped
to keep only this part.
-
An average frequency spectrum of the sound is computed.
-
This spectrum is deconvoluted to delete the glottal source influence,
and
keep only the vocal tract resonator frequency curve.
-
This pseudo-spectrum is stored (less than 100 floating-point values for
each phoneme).
-
For time-varying phonemes like plosives, several pseudo-spectra are
stored
to keep information about changes in the spectrum.
This algorithm enables RealSinger to store only a few values for each
phoneme, which
means very short voice files (less than 40 Kb once compressed).
Generating the voice
|
 |
-
For each phoneme to be sung, the matching pseudo-spectrum is extracted.
In transitional sections between two phonemes, both pseudo-spectra are
distorted,
then merged together, to simulate the coarticulation process.
-
A synthetic glottal source is generated, at the required pitch. The
glottal
source spectrum can be easily modified to change the overall
voice
timbre (for equalization or for applying various vocoder effects).
-
This source is re-convoluted with the phoneme pseudo-spectrum.
-
The spectrum is then processed by a phase-free inverse transform to
generate regular sound data.
This algorithm simulates coarticulation effects. Therefore it is not
necessary
to record the whole set of diphonemes or triphonemes. Only pure
phonemes
are required.
|