New AI speech tech aims to fix accent problems in multiple languages by 2025

New AI speech systems can have accent problems when speaking different languages. This is like trying to speak two languages at once and mixing them up.

NEW APPROACHES TARGETING SPEECH OUTPUT FLAWS

Researchers are wrestling with the intricacies of Large Language Model (LLM) powered text-to-speech (TTS) systems, seeking to untangle persistent issues that mar the simulated voice. A key sticking point identified is 'accent leakage' in systems designed to speak multiple languages. Unlike older TTS methods, the newer LLM variants generate speech piece by piece, without a firm grip on timing, which paradoxically allows for a more natural sound.

Recent work points to methods like low-rank adaptation, adding more varied data, and "chain-of-thought" reasoning as avenues for creating accent-free speech in different tongues, enhancing expressiveness, and building more dependable speech generation.

This effort addresses the fundamental challenge of creating synthetic voices that not only sound lifelike but can also handle diverse linguistic demands without unintended linguistic "bleed-through."

DEEPENING THE BREAKDOWN: ALIGNMENT AND AUDIO CODECS

One line of inquiry centers on improving the synchronization, or "alignment," between the text being processed and the resulting speech sounds. A paper presented at 'Interspeech 2024' by Paarth Neekhara and colleagues proposes learning "monotonic alignment" to bolster the resilience of LLM-based speech synthesis.

Read More: New PhDs in Digital Twins and AI for Sustainability at UCL and QMUL

Another project, dubbed 'T5-TTS,' is experimenting with different audio compression techniques – specifically, Encodec, DAC, and Mel-FSQ. This work looks at how these audio "codecs" affect the output when the system encounters new sentences from speakers it has already heard. The results are presented as audio samples, inviting listeners to discern the differences.

BACKGROUND AND THE EVOLVING LANDSCAPE

The shift towards LLM-based TTS marks a departure from earlier, more rigid systems. The autoregressive nature of LLMs, while introducing challenges like accent leakage and temporal control, is also credited with producing a more human-like vocal quality. This pursuit of 'naturalness' is a constant undercurrent in the development of synthetic speech.

The ongoing research, appearing across platforms like 'AOL' and academic archives, signifies a concerted push to refine these powerful AI tools, moving them beyond mere functionality towards a more nuanced and robust form of vocal simulation. The very definition of "quality" and "robustness" in this field appears to be in constant flux.

Read More: Microsoft Copilot's Use: Fun or Work? Users Question Terms of Service

Frequently Asked Questions

Q: What is the main problem with new AI speech technology that researchers are trying to fix?
The main problem is 'accent leakage,' where AI voices designed to speak multiple languages accidentally use accents from other languages. Researchers are working on methods to make the voices clearer and more natural in each language.
Q: How are researchers trying to solve the accent problem in AI speech?
They are using new methods like 'low-rank adaptation,' adding more different kinds of speech data, and using 'chain-of-thought' reasoning. They are also looking at improving the timing and synchronization between text and speech sounds.
Q: What is 'monotonic alignment' in AI speech synthesis?
'Monotonic alignment' is a new way to make sure the AI speech sounds match the text exactly in order. This helps make the AI voices more reliable and less likely to have errors or strange sounds.
Q: What are audio codecs and why are they important for AI speech?
Audio codecs are like ways to compress and play sound. Researchers are testing different codecs like Encodec and DAC to see how they affect the quality of AI-generated speech, especially when the AI hears new sentences from people it already knows.
Q: Why is this research important for the future of AI voices?
This research aims to create AI voices that sound very human-like and can handle many languages without mixing accents. This will make AI tools more useful and natural for people to interact with in the future.