Skip to content

Fine-tuning a new language: Tokenizer compatibility and dataset requirements #107

@izyspania

Description

@izyspania

I am interested in fine-tuning the model for the Romanian language and i have a few technical questions regarding the adaptation process:

Fine-tuning Feasibility: Is it possible to fine-tune the current model specifically for Romanian phonetics? Does the model already have latent support for it?

Dataset Size: What is the recommended size of the dataset (audio-text pairs) required for a high-quality Romanian fine-tune?

Tokenizer Support: Does the underlying text tokenizer natively support Romanian diacritics? Specifically, does it correctly handle the following characters: Ă, ă, Â, â, Î, î, Ș, ș, Ț, ț?

Audio Tokenizer: Will the current audio tokenizer handle the specific vowel sounds and "closed" phonemes of Romanian without additional training?

Do you have any plans for adding Romanian yourself?

Thank you for your work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions