Voice technical background
Sung voice synthesis
In voice synthesis, for speech as well as for singing,
three main methods can be used:
vocal tract simulation,
connection of recorded elements,
Vocal tract simulation
Historically, this is the oldest method. The very first speech
synthesis was designed for a mechanical automaton, using a collection
of tubes and valves to simulate a vocal tract.
The computer models of this process haven't given a convincing result
to date, because of its extreme complexity.
Connection of recorded elements
A singer or a speaker is digitally recorded,
in order to store
the whole set of phonemes (or groups of phonemes).
Then these samples are connected in sequence to rebuild the voice. Complex
algorithms are used to alter the recorded phonemes and make them
the vocal intonation (prosody).
This method provides excellent results for standard
However, the algorithms are poorly
adapted to generating a singing voice, because of the much wider
frequency ranges. Another drawback of this method is the need for
very large voice
To define another voice, it is necessary to record
another speaker/singer. Furthermore, the whole set of phonemes
for each language must
be recorded separately.
To create multilingual software, it is thus necessary to record several
different speakers/singers, and to store these samples in a huge
file, often several megabytes in size.
This synthesis is based on the analysis of vocal sound.
Acousticians have determined that vocal tract resonances amplify a
small number of frequency
ranges, related to the spoken phoneme. These frequency ranges have been
A formant is characterized by its frequency (pitch), its bandwidth
(width of frequency range) and its energy (strength).
In electronics or computing, a formant can be
simulated by a resonant bandpass filter.
In the early 1960s, the first devices used electronic filters to
recognizable phonemes. Acousticians then realized that only three to
six formants are sufficient to generate a phoneme with acceptable
quality. The advantage of this method is that only a small amount of
data is required
to generate a phoneme, and it is far easier to modify these data
slightly to produce another voice timbre.
However, the result is generally less realistic than with recorded
This third method is used in Virtual Singer.