Skip to content

Poor prosody and excessive pauses when fine-tuning with small dataset (5–10 minutes) #94

@Mrkomiljon

Description

@Mrkomiljon

Hi, thanks for the great work on MOSS-TTS!

I’m experimenting with single-speaker fine-tuning and noticed a significant difference in output quality depending on dataset size.

Observation

  • With ~40 minutes of clean, well-aligned data → results are natural and fluent

  • With ~7–8 minutes of data → generated speech becomes:

    • very slow
    • contains unnatural pauses between words
    • lacks proper prosody

Question

Is there a recommended minimum dataset duration for stable fine-tuning?

Additional details

  • Data is clean, single speaker, properly transcribed
  • Audio is segmented into short clips (~3–8 seconds)
  • Same training configuration used in both cases

Hypothesis

It seems like the model struggles with:

  • alignment learning
  • prosody modeling
    when data is too limited

Clarification request

  • Is this expected behavior for small datasets?
  • Are there recommended strategies (e.g., freezing layers, LoRA, data augmentation) for low-resource fine-tuning?
  • Is there any guideline on minimum hours of data for stable results?

I believe this would be very helpful for others trying to adapt the model to custom voices.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions