4
5
with the same fundamental idea. This is the idea
that by analyzing the structure and tone quality of
the human voice, we can then attempt to simulate
it. As a representative example, let’s look at
“formant synthesis.” Formants are the spectral
peaks of the sound spectrum (the distribution of
the volume of each frequency band) of the voice.
The idea is that you can simulate human
pronunciation (the vocal cords and the movement
of the mouth) by supplying these peak
movements to a basic sound source.
“Concatenative synthesis” is another method
that spread quickly due to the shrinking costs of
digital technology. This method involves linking
fragments of recorded (sampled) voices to
synthesize vocals. Vocaloid’s system is basically a
type of concatenative synthesis which produces
more music like results. This system achieves this
effect by similarly connecting vocal fragments, and
at the same time making adjustments to each
frequency zone.
Formant synthesis and the robot
voice
As for an example of a device that is closer to the
concept of “formant synthesis,” the “Vocoder” is a
device that is familiar to many in the music world.
The idea for this device was originally formed in the
late 1920s at Bell Labs. At the time, it was used as a
voice compression technology for sending a clear
voice transmission through a telegraph cable’s small
bandwidth. The technology was used mainly for
purposes of military communication, due to the
limitation of cost reduction with the technology at
the time, as well as the fact that this was the period
encompassing World War II. Production costs were
reduced as semiconductor technology advanced in
the late 1960s, and instruments and effect
processors that gave the human singing voice a
robot-like effect grew popular. Vocoder technology
as a means of voice compression was later used to
improve voice clarity in cellphones. This technology
is still being developed today.
Similarly, a type of effects processor called a talk
box, which uses the structure of the human mouth
itself as a physical filter, has become very popular
in musical genres such as rock and funk. These
devices, however, are simply effects processors
that process sound by using the movements of the
human mouth. They don’t quite belong in the
same category of “vocal synthesizers” as Vocaloid
does, because they do not generate singing voices
on their own.
The birth of Vocaloid
Starting with the Yamaha PLG100-SG in 1997,
which mounted the formant singing sound source
as a plug-in board for a desktop music sound
module, there have been examples in the past of
vocal synthesizers sold as instruments. However, in
2000, a project called “Daisy” which payed homage
to “Daisy Bell” started. In 2003, they released sound
generating software called “Vocaloid” and
everything changed. They adopted a unique
concatenative synthesis system created by
breaking down data of recorded voices into
fragments (phonemes), then adjusting and editing
these fragments to compile a database. In this way,
they were able to achieve smooth vocal synthesis.
Vocaloid was praised for its natural vocal
expression and its user-friendly software. It
became widely acknowledged, particularly by
users dedicated to desktop music. In 2007,
Vocaloid 2 was announced. In the same year, the
more character oriented “Hatsune Miku” was
developed by Crypton Future Media.
Pocket Miku’s built in “eVocaloid”
technology
VOCALOID 3 was released in 2011. Its
concatenative vocal synthesis engine made even
more natural vocal expressions possible, and many
character voices appeared in a library of singing
voices stocked with vocal fragments. Meanwhile,
sound chips used in hardware, such as those that
produce ring tones in cell phones, have become
widespread and continue to develop. Pocket Miku
is equipped with the newest of such chips, the
Yamaha NSX-1. In addition to functioning as a
sound chip, NSX-1 is equipped with an “eVocaloid”
sound generator. This sound generator puts to use
Vocaloid technology which was previously only
used as sound generating software for personal
computers and similar devices. Pocket Miku
brought one further modification to NSX-1.
Whereas previous Vocaloid systems required
programming on software called “score editors”
beforehand, with this modification Pocket Miku is
the first product in the world that enables you to
perform real-time on Vocaloid. Pocket Miku is
battery operated with a built in speaker. By simply
sliding your stylus across the carbon keyboard,
Hatsune Miku will sing for you anywhere. Go
ahead and try it out!
Score information
Musical notes/lyrics/
musical expressions
Singer library
Synthesizer
database
Expression
database
Score editor
Synthesizer engine
Synthesized vocals
Score information
input interface
Convert score
information
into audio signals
Extract from recorded
vocals of actual singers
WAHHA GO GO
“WAHHA GO GO,” a machine that laughs like a human, was developed in 2009 by Maywa
Denki. Powered by a flywheel and bellows, the device imitates the movements of human
vocal cords and the opening and closing of the mouth, resulting in changes in the
formant (voice quality) and the amount of air (voice volume).
© Yoshimoto Kogyo co.,ltd. / Maywa Denki
Formants and the Vocoder
The peak movements of a formant have a significant relationship with the vocal chords
and movements of the mouth when a person uses their voice. When similar sounds are
produced, their formants peak near the same frequencies. The Vocoder is a development
of audio compression technology that reproduces formants by generating them from the
receiving end. It uses multiple bandpass filters to detect the extent of the peak of each
frequency zone.
The construction of “Vocaloid”
“Vocaloid” uses the specialized software “Score Editor” to input score information. This
made it easier than ever before to produce rich and natural vocal expressions.
Hatsune Miku
Going beyond the boundaries
of instruments and
synthesizers, this creation
swept the charts and flooded
cities with stunning visuals.
“Hatsune Miku” is considered
to be the first virtual idol to be
recognized world wide. She
sparked a social phenomenon,
in which she stands center
stage.
500
1000
1500
First Formant
The “A” formant Multiple peaks can be confirmed.
Vocal cords
Second Formant
VOCALOID is a registered trademark of the Yamaha Corporation.
eVocaloid is a trademark of the Yamaha Corporation.
Illustration by KEI