This is really neat. I cloned my voice and can generate text, but I can't seem to generate longer clips. The README.md says:
> Context Window: 2048 tokens, enough for processing ~30 seconds of audio (including prompt duration)
But it's cutting off for me before even that point. I fed it a paragraph of text and it gets part of the way through it before skipping a few words ahead, saying a few words more, then cutting off at 17 seconds. Another test just cut off after 21 seconds (no skipping).
Lastly, I'm on a MBP M3 Max with 128GB running Sequoia. I'm following all the "Guidelines for minimizing Latency" but generating a 4.16 second clip takes 16.51s for me. Not sure what I'm doing wrong or how you would use this in practice since it's not realtime and the limit is so low (and unclear). Maybe you are supposed to cut your text into smaller chunks and run them in parallel/sequence to get around the limit?
The model weighs 1.5GB [1] (the q4 quant is ~500MB)
The demo is impressive. It uses reference audio at inference time, and it looks like the training code is mostly available [2][3] with a reference dataset [4] as well.
Every couple of weeks I see a new TTS model showcased here and it’s always difficult to see how they differ from one another. Why don’t they describe the architecture and details of the trailing data?
My cynical side thinks people just take the state-of-the-art open source model, use an LLM to alter the source, minimal fine tuning to change the weights and they are able to claim “we built our own state of the art tts”.
I know it’s open source, so I can dig into the details myself, but are they any good high-level overviews of modern TTS, comparing/contrasting the top models?
The special sauce here is that it is built on a very small LLM (Qwen) which means this can run on CPU-only, or even on micro devices like Raspberry Pi or a mobile phone.
Architecturally it's similar to other LLM-based TTS models (like OuteTTS) but the underlying LLM makes them able to release it under an Apache 2 license.
Without the resources to do a study to see if the quality is actually better or worse than other options, these open-TTS models must be judged by what you think of their output. (That is, do your own study.)
I've found some of them to be surprisingly good. I keep a list of them, as I have future project ideas that might need a good one, and each has its own merits.
I'm yet to find one that does good spoken informal Chinese. I'd appreciate if anyone can suggest one!
> Context Window: 2048 tokens, enough for processing ~30 seconds of audio (including prompt duration)
But it's cutting off for me before even that point. I fed it a paragraph of text and it gets part of the way through it before skipping a few words ahead, saying a few words more, then cutting off at 17 seconds. Another test just cut off after 21 seconds (no skipping).
Lastly, I'm on a MBP M3 Max with 128GB running Sequoia. I'm following all the "Guidelines for minimizing Latency" but generating a 4.16 second clip takes 16.51s for me. Not sure what I'm doing wrong or how you would use this in practice since it's not realtime and the limit is so low (and unclear). Maybe you are supposed to cut your text into smaller chunks and run them in parallel/sequence to get around the limit?
But the current one seems really good, tested it for quite a bit with multiple kind of inputs.
The demo is impressive. It uses reference audio at inference time, and it looks like the training code is mostly available [2][3] with a reference dataset [4] as well.
From the README:
> NeuTTS Air is built off Qwen 0.5B
1. https://huggingface.co/neuphonic/neutts-air/tree/main
2. https://github.com/neuphonic/neutts-air/issues/7
3. https://github.com/neuphonic/neutts-air/blob/feat/example-fi...
4. https://huggingface.co/datasets/neuphonic/emilia-yodas-engli...
My cynical side thinks people just take the state-of-the-art open source model, use an LLM to alter the source, minimal fine tuning to change the weights and they are able to claim “we built our own state of the art tts”.
I know it’s open source, so I can dig into the details myself, but are they any good high-level overviews of modern TTS, comparing/contrasting the top models?
Architecturally it's similar to other LLM-based TTS models (like OuteTTS) but the underlying LLM makes them able to release it under an Apache 2 license.
I've found some of them to be surprisingly good. I keep a list of them, as I have future project ideas that might need a good one, and each has its own merits.
I'm yet to find one that does good spoken informal Chinese. I'd appreciate if anyone can suggest one!
This means using this TTS in commercial project is very dicy due to GPL3.
On Fdroid