Most people can tell the difference between an artificial voice and a real one, no matter how authentic the former attempts to be. Google’s DeepMind team has now come up with a new AI dubbed WaveNet to erase this gap and usher us into a more human-sounding future.
Machine-human interactions have come a long way in the past few years thanks to the arrival of digital assistansts like Siri and Cortana, in addition to advanced text-to-speech (TTS) software. However, generating artificial speech through computers usually involves concatenative TTS wherein a large collection of speech fragments are recorded from a single individual.
These recordings are then mixed-and-matched depending on the sentence. While effective, its main drawback lies in the fact that such AI cannot modify the tone or emotion when speaking, making it sound slightly off. The alternative to this is a computer-based approach which reproduces words electronically.
Again, the problem here is that such systems sound very robotic. Google is now looking to resolve this dilemma by feeding its WaveNet AI raw audio waveforms of human speakers one sample at a time. This differs from previous approaches since waveforms are the visual representations of sound.
Google says this apparently yields more natural speech and can even be used to compose music. The DeepMind team even supplied it with classic piano compositions to test the latter and were greeted with a few intriguing piano samples of its own making.
DeepMind goes as far as to claim that WaveNet reduces the distance between state of the art and human-level performance by nearly 50%. It’s basing this assertion on a blind test where English and Mandarin speakers were asked to rate how realistic the AI’s speech sounded to them.