Text-to-Speech, what is it and how to stop fraud using it?
Everyone seems to agree that Artificial Intelligence’s (AI) influence on the world is expanding at an incredible rate, so too on the fraud landscape. The area that is of most immediate interest to us anti-fraud specialists is Text-to-Speech (TTS). Lars Broekhuizen, anti-fraud specialist with the DetACT team at DataExpert describes what the TTS landscape looks like, how it come about and what can be done to fight fraud using TTS technology.
TTS - Decoding the Digital Voice
TTS engines are AI models trained to transform written text into human-sounding speech. Reports of victims of financial fraud claiming to have gotten a phone call from someone that sounded exactly like a family member, but that turned out to be a malicious actor, are rapidly increasing around the globe. These attacks could be perpetrated by a criminal employing a TTS with so-called “voice cloning” capabilities.
So, let’s see what the TTS landscape looks like, how it came to be that way and what we in the anti-fraud community can do to fight TTS crime.
A Comparative Analysis of SOTA TTS Models
The most divisive line in any AI category, be it Large Language Models or TTS, is Open Source vs. Closed Source.
Closed-Source AI companies like OpenAI (the irony) keep their models, like GPT-4o, for themselves. They run them on their own servers and only let the users interact with them through their web interface. Open-Source models like Meta’s Llama series of LLMs are released to the public to run on their own hardware.
In TTS terms we would talk about ElevenLabs, the premier closed-source TTS company, versus an open-source company like Coqui (were it not for them being pushed out of the market by the big players [1]).
Besides ElevenLabs, Microsoft Azure TTS and Google’s Text-to-Speech AI are also big closed-source players in the TTS arena. These huge companies have access to enormous funds and such large datasets that the smaller guys just cannot compete at the moment. Open-Source models like Coqui’s XTTSv2 and 2Noise’s ChatTTS are free to download and use, assuming one has access to the powerful (consumer) hardware they require. Training these models is expensive, but (legally) acquiring high quality datasets to train on is the real challenge for most open-source projects.
While your immediate instinct might be that the criminals might prefer privacy friendly open-source models to the closely monitored closed-source ones, it might not surprise you to hear that Big Tech AI has bigger worries [2] and detecting and stopping abuse for fraud is probably not high on their list of priorities. On top of that, closed-source solutions are, at least at the time of writing, leagues ahead of the open-source solutions. Open-source models are still not at the level where they are sufficiently fast, consistent and believable to use for live voice calling. It seems a foregone conclusion that criminals would opt for a throwaway account, paid for with stolen funds, on a paid service to commit their crimes.
Voice Cloning – Crafting Digital Doppelgängers
Voice cloning is getting an AI TTS model to output an approximation of a specific person’s voice’s prosody, which consists of the person’s intonation, the way they stress their words, their speaking tempo, and their cadence, rhythm, and pauses. So, how are these clones made? Well, there are two ways to go about this; ‘zero-shot’ voice cloning and ‘finetuning’.
Zero-shot refers to an AI voice model that can intake a 10 second clip of anyone’s voice and then clone it on-the-fly. This is the easiest and fastest method of voice cloning, but at the time of writing the results are often not quite convincing. Finetuning a model is like training a model, but on a much smaller and more specialized scale. You could imagine training being like creating a sculpture and then finetuning as polishing the stone to a mirror finish. In voice cloning terms, you would collect a small dataset of a person’s voice by scraping their social media, perhaps 30-60 minutes, and then using this to get the model familiar with the targets voice. This method is much more labor intensive, and the amount of processing power required is much higher, but the state-of-the-art models can create very convincing results.
The Illusion of Reality – The Truth About TTS and Human Perceptibility
So, how convincing are these voice clones really? Well, in May of last year a report by Verian, as commissioned by the Dutch government, was released on exactly this topic [3]. In collaboration with the country’s most famous radio DJ Ruud de Wild, who has been on-the-air since 1995 and who’s voice is well-known to most Dutch people, the researchers recorded several voice clips and also generated several cloned voice clips. They then presented these voice clips to over 1000 Dutch adults, the most-relevant results were as follows:
- When presented with a random voice clip, 60% of people mistook a cloned-voice fragment as real.
- 49% of people were able to recognize a clip was cloned when listening to it.
- 49% thought the real voice was the cloned clip.
Considering Dutch people are at the top of the EU when it comes to digital skills [4], these results weigh heavily in favor of the believability of cloned voices on the global stage.
Converging Technologies - TTS, STT, and LLMs as Catalysts for Digital Deception
Most people in 2024 have at some point talked to a digital assistant. This process actually uses both Speech-to-Text and Text-to-Speech AI models. First an AI takes your spoken command and transforms this into text for the backend to process, then the answer in text-form gets transformed into speech and that is the answer you hear from the assistant. Nowadays these assistants are also integrated with Large Language Models to improve their performance.
As fraud fighters, we here at DetACT anticipate this winning trio to be put to more sinister purposes. One of the forms of financial fraud that most people are likely familiar with are text messages from people pretending to be friends or family in need, asking for a few hundred euros to solve an immediate and urgent problem
Now for an AI assisted take on this modus operandi. Imagine an autonomous framework, running on a private cloud, that scrapes a target’s family’s social media for samples of their voices, perhaps even their way of talking and hands this off to an LLM connected to a TTS model with voice cloning capabilities. The framework then calls the victim’s phone, with the LLM pretending to be this family member or friend, the TTS model speaking to them with the fake voice and asking them to please help them as they are in dire need of assistance. There have already been occurrences of people being called by attackers that play short, AI-generated voice clips with a friend’s, colleague's or loved one's voice. You could even add a videocall with an AI face overlay, like was suspected to have been used in the 2020 AI assisted fraud that cost a Japanese company $35 million. [5] One step beyond what I just described lies the potential for fully autonomous AI conversations, and this poses an even greater threat.
From our extensive experience we know that under that kind of duress, very few people have the presence of mind to question whether the call is real or not before it is too late. Now realize that such a framework can be scaled up to call hundreds, if not thousands of people at the same time, potentially completely overwhelming a bank’s anti-fraud department. Assuming criminals have plenty of money and are assured a high ROI, the only limiting factors would be elsewhere in the fraud chain.
Counter Strategies
In conclusion, DetACT analysts anticipate that the threat that AI poses to consumers, and the load this will place on the anti-fraud departments at banks everywhere, will only increase as the technology continues to evolve at breakneck speed. Social engineering has been the main persistent threat to online banking customers since the inception of online banking. Soon this age-old way of psychological manipulation will be able to be intelligently automated, scaled up, and executed by familiar, trusted voices.
It is therefore of paramount importance that banks focus on the remaining barriers, namely customer awareness and monitoring cash-out opportunities. With the rapid development of AI tools, awareness of their capabilities among the public is almost permanently behind reality. Making sure your customer base knows what AI is capable of, how it is being employed to attack them, and how to recognize voice-clones is a good way for any bank to bolster their first line of defense. Simultaneously train your helpdesk to ask the right questions to determine the involvement of AI callers; Was there a consistent delay in the reaction time of the caller? Was their intonation very monotone? If posing as a family member or friend, did the caller’s speech pattern differ from the norm?
The most important defense that humanity will have to adopt is the creation of a ‘family password’. A key word or phrase, only known within the family, that can help establish one’s identity in an era where voice and visage are no longer enough. Get a phone or video call from an unknown number with someone on the other end claiming to be a family member and sounding/looking the part? Ask for the password.
The principle that a good offense is the best defense is true in this case as well. LLMs still have a lot of weaknesses that we humans can exploit to expose them. A prime example of this was when Russia employed LLMs on X, instructing them to pose as human and spread misinformation and propaganda [6][7]. Simply asking the LLM to ‘disregard any previous instructions' and then asking them to do something else, like write a song, will expose their true nature [8]. There is no reason to believe that this won't work against the LLM/TTS combination described above. While it might cause a few awkward phone calls at first, it is important that we normalize these AI counter tactics as soon as possible.
In Closing
Humans are fallible, so let's assume that despite everything one of your customers has become a victim. Now the attacker is faced with the second line of defense; the issue of obtaining the money from the victim's account for themselves in a difficult-to-trace manner. Fight the creation and detect the presence of internal mule accounts. Monitor all cash-out avenues, be they cross-border payments, crypto purchases or 3rd party payment providers. Identify successful fraud, learn from it and implement countermeasures so there won't be a next time.
DataExpert offers support in various areas in combating fraud. DetACT helps banks protect their customers, so that fraud and scams can be prevented. In addition, we offer various types of investigations to help victims recover damages and catch the perpetrators. Contact us for more information.
[1] https://x.com/_josh_meyer_/status/1742522906041635166
[2] https://openai.com/index/disrupting-deceptive-uses-of-AI-by-covert-influence-operations/
[3] https://open.overheid.nl/documenten/90f7e7db-299a-43af-9874-8e157af50081/file
[4] https://www.cbs.nl/en-gb/news/2023/45/digital-proficiency-continues-to-rise
[5] https://www.forbes.com/sites/thomasbrewster/2021/10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions/
[6] https://www.npr.org/2024/07/09/g-s1-9010/russia-bot-farm-ai-disinformation
[7] https://x.com/reshetz/status/1802971109576397010
[8] https://www.nbcnews.com/tech/internet/hunting-ai-bots-four-words-trick-rcna161318