Suno AI’s Bark generative AI audio model can generate sounds in addition to voices in many languages.
The generation of sounds within a speech is flexible, using instructions in the text prompt to the voice model, such as [laugh] gold [gasp]. Suno AI lists a number of such sound instructions, but says it finds new ones every day. In my initial tests, the instructions were not entirely reliable. Also, Bark cannot bark yet. But it’s still a lot of fun.
Hey fellow The Decoder readers. The AI voice quality of Bark isn’t the best, but you can enter funny sound effects like [gasps], [laughs] or even [music] ♪ singing a Song about AGI ♪. [clears throat] But it can’t [bark]!
Bark currently supports 13 languages, including English, German, Spanish, French, Japanese, and Hindi. Suno AI says that the English voice output sounds the best, but that voices in other languages should sound better with further scaling. More languages are in the works.
One untrained feature: similar to the impressive elevenlabs voice AIan English voice speaks German text with an English accent.
Bark does without phonemes
Unlike Microsoft’s VALL-E, which the Bark team cites as an inspiration along with AudioLM, Bark avoids the use of abstracted speech sounds, known as phonemes, and instead embeds text prompts directly into higher-level semantic tokens. This allows Bark to generalize beyond spoken language to other sounds or music that appear in the training data.
A second model then converts these semantic tokens into audio codec tokens to generate the full waveform. For compression, the team uses Meta’s powerful AI audio compression method Encodec.
The Bark team is making a demo version of their software available for free on Github. The demo cannot be used commercially, and Bark also requires Transformer language models with more than 100 billion parameters. Suno AI plans to offer its own generative audio AI models in the future and has started a waiting list.
More Emotional AI Voices: Meta and Google led the way
Meta itself also unveiled a large, unsupervised generative AI model for voice generation. Similar to Bark, the Generative Spoken Language Model (GLSM) has learned to produce human sounds like laughing, yawning, or crying in addition to pure speech. This makes the supposedly cold AI voices sound much more human. With AudioGen, Meta also has an AI model for pure audio effects from text input.
GLSM example: Original neutral
GLSM example: AI generated with laughter
This brings back memories of Google’s legendary phone AI Duplex, which sounded almost as natural as a human by imitating human sounds for pauses in speech such as ”uhm”. The unveiling of Duplex sparked a debate about whether a computer voice should remain unrecognized and thus fool people, or whether it should reveal itself. Google chose the latter, but the product has yet to make a major breakthrough. Still, there are more than enough AIs that can fool people today.