Hi, thanks for the great work on MOSS-TTS!
I’m experimenting with single-speaker fine-tuning and noticed a significant difference in output quality depending on dataset size.
Observation
-
With ~40 minutes of clean, well-aligned data → results are natural and fluent
-
With ~7–8 minutes of data → generated speech becomes:
- very slow
- contains unnatural pauses between words
- lacks proper prosody
Question
Is there a recommended minimum dataset duration for stable fine-tuning?
Additional details
- Data is clean, single speaker, properly transcribed
- Audio is segmented into short clips (~3–8 seconds)
- Same training configuration used in both cases
Hypothesis
It seems like the model struggles with:
- alignment learning
- prosody modeling
when data is too limited
Clarification request
- Is this expected behavior for small datasets?
- Are there recommended strategies (e.g., freezing layers, LoRA, data augmentation) for low-resource fine-tuning?
- Is there any guideline on minimum hours of data for stable results?
I believe this would be very helpful for others trying to adapt the model to custom voices.
Thanks in advance!
Hi, thanks for the great work on MOSS-TTS!
I’m experimenting with single-speaker fine-tuning and noticed a significant difference in output quality depending on dataset size.
Observation
With ~40 minutes of clean, well-aligned data → results are natural and fluent
With ~7–8 minutes of data → generated speech becomes:
Question
Is there a recommended minimum dataset duration for stable fine-tuning?
Additional details
Hypothesis
It seems like the model struggles with:
when data is too limited
Clarification request
I believe this would be very helpful for others trying to adapt the model to custom voices.
Thanks in advance!