This page is only a brief overview of the methods
used by RealSinger to produce a voice.
It is not necessary to read this chapter to use RealSinger.
This chapter is indended to answer
technical questions some users might have about the internal
and is not needed to use the product.
To synthesize a realistic singing voice, the first idea that comes to
the mind of the programmer is to use a collection of recorded phonemes
to generate the voice.
Three problems quickly become apparent:
A solution can be found for each of these problems in the published
The algorithm used must be able to generate the phoneme at any pitch
frequency). Recording every phoneme at all possible pitches is not
because it would lead to a long and complex recording process, as well
huge voice files.
The algorithm must be able to elongate, or stretch, the phoneme to any
The algorithm must be able to generate a smooth transition from one
another, in order to simulate the coarticulation phenomenon (the next
starts to be heard before the current one is completely terminated).
have been developed to solve problems 1 and 2. They process
the recorded sample's digital data directly, allowing the programmer to
change its frequency (pitch) as well as its duration.
These algorithms are used in most popular sound editors to change
the pitch and speed of a sound file independently. They are also used
in speech synthesis, because speech frequency (pitch) variations are
However, in the case of the sung voice, these algorithms cannot be
because they are not efficient when the pitch shift is too large. The
is not "wrong" as such, but the voice is distorted, just as when
a magnetic tape at too high a speed (chipmunk voice).
For problem 3, a common solution is to record not only
the individual phonemes of a language, but all possible combinations of
two or three
phonemes (diphonemes/triphonemes). This system stores the
effect and makes the synthesized voice more realistic. However, here
the recording process is quite difficult and extensive, sometimes
requiring several hours of work for the
speaker or singer. The resulting voice file is often quite
large (several megabytes).
RealSinger uses original algorithms to solve all three
of these problems
at the same time, by manipulating
Some speech synthesizers have tried to use voice frequency spectra
to generate voice in the past.
However, this method proved to be difficult
to implement, because recreating a signal from a processed spectrum
using an inverse Fast Fourier Transform (IFFT) requires that the
"phase" values be reajusted properly. If they are not properly
adjusted, consecutive pieces of signal won't
join and an unwanted background noise will be heard.
In speech or song, the glottal source waveform (the sound produced by
vocal chords when excited by the air stream from the lungs) is a
of harmonics (frequency multiples of the fundamental frequency f0).
On a power/frequency graph, this glottal source sound
a comb, with each tooth of the comb located at a frequency that is a
of the fundamental f0:
When the voice pitch increases, the fundamental
frequency f0 shifts to the right (higher
frequency), and the frequency offset between two consecutive harmonics
increases too, to remain equal to f0.
In passing through the vocal tract, some frequencies are enhanced by
cavity resonances, and others are softened. The result is that certain
harmonics are loud, and others are softer. This vocal tract spectrum
depends on the phoneme being said
or sung, and is more or less unchanged when frequency (pitch) increases
The convolution of these two spectra (glottal source and vocal tract)
gives the resulting spectrum, in which the listener can determine both
the phoneme (what is said) and the pitch (sung note).
The aim of RealSinger is, for each phoneme of a given language, to
a deconvolution to the recorded signal in order to separate the glottal
source and vocal tract spectra. Then it stores only the vocal tract
and will apply a generated glottal source to this spectrum to simulate
the original recorded phoneme being sung at any pitch.
This algorithm enables RealSinger to store only a few values for each
means very short voice files (less than 40 Kb once compressed).
The speaker is asked to pronounce a word for each phoneme of the chosen
Each word is recorded as regular sound data.
Then, the phoneme is isolated within the word and the signal is cropped
to keep only this part.
An average frequency spectrum of the sound is computed.
This spectrum is deconvoluted to delete the glottal source influence,
keep only the vocal tract resonator frequency curve.
This pseudo-spectrum is stored (less than 100 floating-point values for
For time-varying phonemes like plosives, several pseudo-spectra are
to keep information about changes in the spectrum.
Generating the voice
This algorithm simulates coarticulation effects. Therefore it is not
to record the whole set of diphonemes or triphonemes. Only pure
For each phoneme to be sung, the matching pseudo-spectrum is extracted.
In transitional sections between two phonemes, both pseudo-spectra are
then merged together, to simulate the coarticulation process.
A synthetic glottal source is generated, at the required pitch. The
source spectrum can be easily modified to change the overall
timbre (for equalization or for applying various vocoder effects).
This source is re-convoluted with the phoneme pseudo-spectrum.
The spectrum is then processed by a phase-free inverse transform to
generate regular sound data.