You know what’s wild? The world of AI is moving so fast these days, it’s genuinely hard to keep up. Every week, there’s something new and shiny promising to change everything. But sometimes, the really cool stuff slips under the radar. And honestly, that’s a bit of a crime when you stumble upon something as awesome as Llasa 3B, one of the cool Text-to-speech and Voice cloning AI model.
Seriously, have you heard of it? Probably not, right? That’s what’s so crazy! This thing is a total game-changer in the open-source AI world, and it’s kind of chilling in the shadows. We’re talking about a text-to-speech model that’s not just good, it’s scarily realistic. And get this, it can clone voices with just a tiny snippet of audio. Like, seconds tiny.
Intrigued? You should be. Let’s get into what makes Llasa 3B so special, and why you should be paying attention to this incredible piece of tech.
Table of contents
So, What Exactly Is Llasa 3B?
Okay, so Llasa 3B is essentially a fine-tuned version of the Llama 3B model. Now, if you’re not super deep into the AI jargon, Llama 3B is a powerful language model. Think of it as the brains behind a lot of cool AI stuff. What the folks at HKUST-Audio have done is take this Llama 3B model and tweaked it specifically for text-to-speech.
But here’s the kicker – they didn’t just stop at making it speak. They’ve made it speak incredibly naturally. We’re talking about nuanced speech, the kind that captures emotion, tone, and all those little human quirks that make voices sound, well, human.
And the secret sauce? Apparently, it’s something called xcodec2. This is an “audio tokenizer” basically, it’s how the AI understands and processes audio. From what I gather, xcodec2 is super efficient, breaking down audio into tokens at a rapid pace. This efficiency is probably a big part of why Llasa 3B is so quick and responsive.
If you want to get your hands dirty and see the code for yourself, it’s all up on GitHub. And if you just want to play around and hear it in action, there’s a demo on Hugging Face Spaces. Seriously, go check it out after reading this – you won’t be disappointed.

Voice Cloning: Is This Real Life?
Right, let’s talk about the voice cloning. Because this is where Llasa 3B goes from “impressive” to “mind-blowing.” The claim is, and the demos back it up, that it can clone a voice using just a few seconds of audio. Like, imagine giving it a 5-second clip of someone speaking, and then it can generate speech in their voice.
Sounds like science fiction, doesn’t it? But it’s real. there are some amazing examples, cloned voices like “Alex,” “Amelia,” and “Russel” using sample audio, and the results are genuinely uncanny.
Check out these examples, whipped up using voices from ElevenLabs (just to show off the tech, these aren’t real people’s voices he’s cloning):
Alex:
- Reference Audio: “Let me know in the comment section below. This is the COD Archive, and I’ll see you tomorrow. Take care.”
- Cloned Voice: “Hey guys, what’s up? Alex here, back at it again with another video. Today we will be learning how to clone voices with a state-of-the-art text-to-speech model. Exciting, right? Let’s just get right into it.”
Amelia:
- Reference Audio: “Hi! I’m Amelia, a super high quality English voice. I love to read. Seriously, I’m a total bookworm. So what are you waiting for? Get me reading!”
- Cloned Voice: “All you need is a short clean audio sample of just 5 to 10 seconds. Then the model can generate a high quality speech sample mimicking the voice, tone and style of speech and even accent.”
See what I mean? It’s not just mimicking the words, it’s getting the vibe of the voice. The tone, the style, even the accent. It’s like magic. Or, you know, really clever AI.
More Than Just Mimicking: Whispers, Emotions, and All That Jazz
But Llasa 3B isn’t just a one-trick pony that clones voices. It can also play with style. Want a whisper? Give it a whisper sample, and it’ll whisper back.
And emotions? Yep, it can do those too. The examples of “confusion,” “anger,” and “laughter” are pretty convincing. It’s not just monotone robotic speech; it’s speech with feeling. Imagine the possibilities for creating more engaging and realistic AI assistants, characters in games, or even just making your text messages sound a bit more… you.
Now, it’s not perfect. The Optimus Prime example shows that it can struggle with really stylized or unique voices. Peter Cullen’s iconic Optimus Prime voice is pretty distinct, and it seems Llasa 3B couldn’t quite nail it. But hey, even humans can’t perfectly imitate Optimus Prime!
What’s Next for Llasa? And Why Isn’t Everyone Talking About This?
The creators of Llasa have an 8B model in the works – that’s an even bigger, potentially more powerful version. It’s tantalizing to think about what an 8B Llasa could do if the 3B is already this impressive. There are also questions about fine-tuning with LoRA (another AI technique) and even mixing and merging voices. It sounds like there’s a whole playground of possibilities to explore.
But back to the big question: why isn’t Llasa 3B blowing up the internet right now? It’s open-source, it’s free to use (if you’ve got the tech know-how), and it’s genuinely groundbreaking. Maybe it’s just early days, the official paper is still pending, and people are waiting for that seal of academic approval. Maybe it’s one of those hidden gems that the AI community will slowly discover and appreciate over time. Or maybe, just maybe, This model is licensed under the CC BY-NC-ND 4.0 License, which prohibits free commercial use because of ethics and privacy concerns, bummer!
Honestly, I’m scratching my head. This tech is incredible. And the fact that it’s built on top of Llama 3, making it essentially “just a llama model in disguise”, is even cooler. It shows the power and flexibility of these foundational models.
Okay, Ready to Try Llasa Yourself? Here’s a (Gentle) How-To!
Alright, so you’re intrigued and maybe thinking, “Cool, but how do I actually use this thing?” Don’t worry, it’s not as scary as it might look! You don’t need to be a coding whiz to get Llasa 3B talking (or cloning voices). Here’s a simplified guide to get you started.
First things first: A little setup for Llasa
You’ll need to install something called xcodec2. Think of it as a special tool that Llasa 3B uses to understand and create speech. If you’re familiar with coding environments, you’ll want to use conda. If that sounds like another language to you, just follow these steps closely:
- Open your terminal or command prompt. (This is where you type in text commands to your computer – if you’re not sure how to do this, a quick web search for “open terminal” or “open command prompt” on your operating system will help).
- Create a virtual environment: Type this command and press Enter:
conda create -n xcodec2 python=3.9This is like creating a separate little workspace for Llasa 3B to live in, so it doesn’t mess with anything else on your computer. - Activate the environment: Now, step into that workspace by typing:
conda activate xcodec2You’ll probably see the name of your environment (xcodec2) in parentheses at the beginning of your command line, to show you’re inside it. - Install xcodec2: Finally, install the tool Llasa needs:
pip install xcodec2==0.1.3This command downloads and installs the correct version of xcodec2.
Now you’re ready to play! Let’s try two main things Llasa 3B can do:
1. Just Plain Text-to-Speech (No Voice Cloning Here)
Let’s say you just want to turn some text into speech, using Llasa 3B’s natural-sounding voice. Here’s the code for that. Copy and paste this whole block of code into a Python file (you can use a simple text editor and save it as something like text_to_speech.py).
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b ='HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda') # Make sure you have a CUDA-enabled GPU for faster processing!
from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()
input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.'
def ids_to_speech_tokens(speech_ids):
speech_tokens_str = []
for speech_id in speech_ids:
speech_tokens_str.append(f"<|s_{speech_id}|>")
return speech_tokens_str
def extract_speech_ids(speech_tokens_str):
speech_ids = []
for token_str in speech_tokens_str:
if token_str.startswith('<|s_') and token_str.endswith('|>'):
num_str = token_str[4:-2]
num = int(num_str)
speech_ids.append(num)
else:
print(f"Unexpected token: {token_str}")
return speech_ids
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left" # Important for this model!
input_ids = tokenizer.apply_chat_template(
chat,
tokenize=True,
return_tensors='pt',
continue_final_message=True
).to('cuda') # Move input to GPU
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(
input_ids,
max_length=2048,
eos_token_id= speech_end_id ,
do_sample=True,
top_p=1,
temperature=0.8,
)
generated_ids = outputs[0][input_ids.shape[1]:-1]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = Codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
What’s going on in this code? (Don’t worry, you don’t need to understand every line, but a little overview helps!)
- Imports: It’s grabbing the tools it needs (like the Llama 3B model and the xcodec2 stuff).
- Loading Models: It’s loading the pre-trained Llasa 3B model and the xcodec2 model – these are the brains of the operation!
- Input Text: See this line? input_text = ‘…’ That’s where you can change the text to whatever you want Llasa 3B to say! Go ahead and change the example text to something fun.
- Generating Speech: The rest of the code is basically telling the model to take your text and turn it into speech.
- Saving Audio: Finally, sf.write(“gen.wav”, …) saves the generated speech as a file called gen.wav. You’ll find this file in the same folder where you saved your Python file.
To run this:
- Save the code as a .py file (like text_to_speech.py).
- Open your terminal, make sure your xcodec2 conda environment is activated (conda activate xcodec2).
- Navigate to the folder where you saved the .py file (using the cd command in the terminal).
- Run the script by typing: python text_to_speech.py and pressing Enter.
After a bit of processing (especially if you’re using a GPU, it’ll be faster!), you should have a gen.wav file with Llasa 3B speaking your text!
2. Voice Cloning Time! (Text-to-Speech with a Voice Prompt)
Now for the really cool part – voice cloning! This code is a bit longer, but it’s worth it. Again, copy and paste this into a new Python file (like voice_clone.py).
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b ='HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda') # GPU recommended!
from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()
prompt_wav_path = "sample_voice.wav" # <--- REPLACE THIS!
prompt_wav, sr = sf.read(prompt_wav_path)
prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)
prompt_text ="This is the voice I want to clone." # You can describe the prompt voice here, or leave it blank
target_text = 'Suddenly, there was laughter around me. I looked at them, straightened my chest with vigor,甩了甩那稍显肉感的双臂, and said with a light smile, "The meat on my body is to hide my bursting charm, otherwise, wouldn't it scare you?"' # Text to speak in the cloned voice
input_text = prompt_text + target_text
def ids_to_speech_tokens(speech_ids):
speech_tokens_str = []
for speech_id in speech_ids:
speech_tokens_str.append(f"<|s_{speech_id}|>")
return speech_tokens_str
def extract_speech_ids(speech_tokens_str):
speech_ids = []
for token_str in speech_tokens_str:
if token_str.startswith('<|s_') and token_str.endswith('|>'):
num_str = token_str[4:-2]
num = int(num_str)
speech_ids.append(num)
else:
print(f"Unexpected token: {token_str}")
return speech_ids
with torch.no_grad():
vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
vq_code_prompt = vq_code_prompt[0,0,:]
speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
]
tokenizer.padding_side = "left" # Important for this model!
input_ids = tokenizer.apply_chat_template(
chat,
tokenize=True,
return_tensors='pt',
continue_final_message=True
).to('cuda') # Move input to GPU
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(
input_ids,
max_length=2048,
eos_token_id= speech_end_id ,
do_sample=True,
top_p=1,
temperature=0.8,
)
generated_ids = outputs[0][input_ids.shape[1]-len(speech_ids_prefix):-1] # Notice the slight change here!
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = Codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Cloned voice audio saved to gen.wav")
Key differences in the voice cloning code:
- prompt_wav_path = “sample_voice.wav”: This is important! You need to replace “sample_voice.wav” with the actual path to an audio file (like a .wav or .mp3 file) that contains the voice you want to clone. Make sure this audio file is in the same folder as your Python script, or provide the full file path. The example code assumes it’s called sample_voice.wav and in the same folder.
- prompt_text = “…”: This is optional. You can describe the voice in the prompt text, but it’s not strictly necessary for cloning.
- target_text = ‘…’: This is the text you want Llasa 3B to speak in the cloned voice. Change this to whatever you like!
- Encoding Prompt Audio: The code now includes steps to load and process your prompt_wav audio file so Llasa 3B can learn the voice characteristics.
To run the voice cloning code:
- Get a sample voice audio file. Make sure it’s relatively short and clear (a few seconds is enough!). Save it as sample_voice.wav (or whatever you set prompt_wav_path to) in the same folder as your Python script.
- Save the code as a .py file (like voice_clone.py).
- Open your terminal, activate your xcodec2 conda environment.
- Navigate to the folder.
- Run the script: python voice_clone.py
Again, after processing, you should get a gen.wav file, but this time, the speech should be in the style of the voice from your sample_voice.wav file!
Important Notes:
- GPU Recommended: These models work much faster if you have a CUDA-enabled NVIDIA GPU. If you don’t, it will still run on your CPU, but it will be significantly slower.
- File Paths: Double-check your file paths, especially for the prompt_wav_path in the voice cloning code. If the path is wrong, the code won’t be able to find your audio file.
- Experiment! The fun part is playing around! Try different input texts, different voice prompts, and see what Llasa 3B can do!
Give Llasa a Whirl!
If you’re even remotely interested in text-to-speech, voice cloning, or just cool AI stuff in general, you need to check out Llasa.
- Try the demo: Head over to Hugging Face Spaces and have a play. Clone some voices, make it whisper, see what it can do!
- Dive into the code: If you’re technically inclined, explore the GitHub repo. You might even be able to contribute and help make it even better!
Let me know in the comments what you think! Have you tried Llasa? Are you as blown away as I am? Let’s get this amazing piece of open-source AI the attention it deserves! Who knows, maybe you’ll be the one to discover the next big thing you can do with Llasa. The possibilities, honestly, are kind of mind-boggling.
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure


