Fun, I can't get to it because I can't get past the "Making sure you're not a bot!" page. It's just stuck at "calculating...". I understand the desire to slow down AI bots, but . If all the gnome apps are now behind this, they just completely shut down a small-time contributor. I love to play with Gnome apps and help out with things here and there, but I'm not going to fight with this damn thing to do so.
Thanks for the heads-up! We weren’t aware of the GNOME Dia project. Since we focus on speech AI, we’ll make sure to clarify that distinction.
aclark 1 hours ago [-]
Ditto this! Dia diagram tool user here just noticing the name clash. Good luck with your Dia!! Assuming both can exist in harmony. :-)
Magma7404 44 minutes ago [-]
I know it's a bit ridiculous to see that as some kind of conspiracy, but I have seen a very long list of AI-related projects that got the same name as a famous open-source project, as if they wanted to hijack the popularity of those projects, and Dia is yet another example. It was relatively famous a few years ago and you cannot have forgotten it if you used Linux for more than a few weeks. It's almost done on purpose.
teddyh 32 minutes ago [-]
The generous interpretation is that the AI hype people just didn’t know about those other projects, i.e. that they are neither open source developers, nor users.
xbmcuser 17 minutes ago [-]
Wow first time I have felt that this could be the end of voice acting/audio book narration etc. The speed with with the ways things are changing how soon before you can make any book any novel into a complete audio video / movie or tv show.
notdian 39 minutes ago [-]
made a small change and got it running on M2 Pro 16GB Macbook pro, the quality is amazing.
Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.
Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
gfaure 1 hours ago [-]
Amazing that you developed this over the course of three months! Can you drop any insight into how you pulled together the audio data?
isoprophlex 35 minutes ago [-]
+1 to this, amazing how you managed to deliver this, and iff you're willing to share i'd be most interested in learning what you did in terms of train data..!
new_user_final 1 hours ago [-]
Easily 10 times better than recent OpenAI voice model. I don't like robotic voices.
Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.
popalchemist 8 minutes ago [-]
This looks excellent, thank you for releasing openly.
isoprophlex 31 minutes ago [-]
Incredible quality demo samples, well done. How's the performance for multilingual generation?
I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.
youssefabdelm 18 minutes ago [-]
Anyone know if possible to fine-tune for cloning my voice?
toebee 1 hours ago [-]
It is way past bedtime here, will be getting back to comments after a few hours of sleep! Thanks for all the kind words and feedback
Versipelle 50 minutes ago [-]
This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.
I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with
IshKebab 2 hours ago [-]
Why does it say "join waitlist" if it's already available?
Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.
toebee 2 hours ago [-]
We're envisioning a platform with a social aspect, so that is the biggest difference. Also, bigger models!
We are aware of the fact that you do not need to create a venv when using pre-existing uv. Just added it for people spinning up new GPUs on cloud. But I'll update the README to make that a bit clearer. Thanks for the feedback :)
flakiness 2 hours ago [-]
Seek back a few tens of bytes which states "Play with a larger version of Dia"
time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?
the python code snippet seems to imply that the entire audio bytes are generated directly?
toebee 2 hours ago [-]
Sounds awesome!
I think it won't be very hard to run it using output streaming, although that might require beefier GPUs. Give us an email and we can talk more - nari.ai.contact at gmail dot com.
It's way past bedtime where I live, so will be able to get back to you after a few hours. Thanks for the interest :)
sarangzambare 2 hours ago [-]
no worries, i will email you
ivape 2 hours ago [-]
Darn, don't have the appropriate hardware.
The full version of Dia requires around 10GB of VRAM to run.
If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.
toebee 2 hours ago [-]
We will work on a quantized version of the model, so hopefully you will be able to run it soon!
We've seen Bark from Suno go from 16GB requirement -> 4GB requirement + running on CPUs. Won't be too hard, just need some time to work on it.
ivape 1 hours ago [-]
No doubt, these TTS models locally are what I'm looking for because I'm so done typing and reading :)
pzo 1 hours ago [-]
Sounds great. Hope more language support in the future. In comparison Sesame CSM-1B sounds like trained on stoned people.
brumar 1 hours ago [-]
Impressive! Is it english only at the moment?
toebee 1 hours ago [-]
Unfortunately yes at the moment
film42 1 hours ago [-]
Very very impressive.
xienze 1 hours ago [-]
How do you declare which voice should be used for a particular speaker? And can it created a cloned speaker voice from a sample?
toebee 1 hours ago [-]
You can add an audio prompt and prepend text corresponding to it in the script. You can get a feel for it by trying the second example in the Gradio interface!
stuartjohnson12 2 hours ago [-]
Impressive project! We'd love to use something like this over at Delfa (https://delfa.ai). How does this hold up from the perspective of stability? I've spoken to various folks working on voice models, and one thing that has consistently held Eleven Labs ahead of the pack from my experience is that their models seem to mostly avoid (while albeit not being immune to) accent shifts and distortions when confronted with unfamiliar medical terminology.
A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.
toebee 2 hours ago [-]
Interesting. I haven't thought of that problem before. I'm guessing a large enough audio dataset for medical terminology does not exist publicly.
But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.
Rendered at 19:17:52 GMT+0000 (Coordinated Universal Time) with Vercel.
> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!
Seriously impressive. Wish I could direct link the audio.
Kudos to the Dia team.
https://gitlab.gnome.org/GNOME/dia
https://github.com/nari-labs/dia/pull/4
Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.
https://github.com/SparkAudio/Spark-TTS
I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.
I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with
Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.
We are aware of the fact that you do not need to create a venv when using pre-existing uv. Just added it for people spinning up new GPUs on cloud. But I'll update the README to make that a bit clearer. Thanks for the feedback :)
time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?
the python code snippet seems to imply that the entire audio bytes are generated directly?
It's way past bedtime where I live, so will be able to get back to you after a few hours. Thanks for the interest :)
The full version of Dia requires around 10GB of VRAM to run.
If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.
We've seen Bark from Suno go from 16GB requirement -> 4GB requirement + running on CPUs. Won't be too hard, just need some time to work on it.
A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.
But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.