Fine-tuning a new language: Tokenizer compatibility and dataset requirements

I am interested in fine-tuning the model for the Romanian language and i have a few technical questions regarding the adaptation process:

Fine-tuning Feasibility: Is it possible to fine-tune the current model specifically for Romanian phonetics? Does the model already have latent support for it? 

Dataset Size: What is the recommended size of the dataset (audio-text pairs) required for a high-quality Romanian fine-tune?

Tokenizer Support: Does the underlying text tokenizer natively support Romanian diacritics? Specifically, does it correctly handle the following characters: Ă, ă, Â, â, Î, î, Ș, ș, Ț, ț?

Audio Tokenizer: Will the current audio tokenizer handle the specific vowel sounds and "closed" phonemes of Romanian without additional training?

Do you have any plans for adding Romanian yourself?

Thank you for your work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning a new language: Tokenizer compatibility and dataset requirements #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fine-tuning a new language: Tokenizer compatibility and dataset requirements #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions