Poor prosody and excessive pauses when fine-tuning with small dataset (5–10 minutes)

Hi, thanks for the great work on MOSS-TTS!

I’m experimenting with single-speaker fine-tuning and noticed a significant difference in output quality depending on dataset size.

 Observation

* With ~40 minutes of clean, well-aligned data → results are natural and fluent
* With ~7–8 minutes of data → generated speech becomes:

  * very slow
  * contains unnatural pauses between words
  * lacks proper prosody

Question

Is there a recommended **minimum dataset duration** for stable fine-tuning?

Additional details

* Data is clean, single speaker, properly transcribed
* Audio is segmented into short clips (~3–8 seconds)
* Same training configuration used in both cases

Hypothesis

It seems like the model struggles with:

* alignment learning
* prosody modeling
  when data is too limited

Clarification request

* Is this expected behavior for small datasets?
* Are there recommended strategies (e.g., freezing layers, LoRA, data augmentation) for low-resource fine-tuning?
* Is there any guideline on minimum hours of data for stable results?

I believe this would be very helpful for others trying to adapt the model to custom voices.

Thanks in advance!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor prosody and excessive pauses when fine-tuning with small dataset (5–10 minutes) #94

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor prosody and excessive pauses when fine-tuning with small dataset (5–10 minutes) #94

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions