You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+36-36Lines changed: 36 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,34 +30,34 @@ Les versions anteriors a la 0.4.0 feien servir un alfabet sense vocals accentuad
30
30
31
31
Nota: Per la versió 0.6.0 del model vaig combinar el corpus complet (train, dev i test) de CommonVoice amb el de [ParlamentParlaClean](https://collectivat.cat/asr) per després barrejar-lo i dividir-lo en tres sets: train (75%), dev (20%) i test(5%). D'aquesta manera s'ha augmentat el nombre de dades d'entrenament. Com que degut a això el set test conté dades del corpus CommonVoice que podrien haver estat emprades en l'entrenament dels altres models, s'han avaluat tots els models exclusivament amb 1713 frases que cap model ha mai vist (totes del corpus ParlamentParlaClean).
32
32
33
-
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
Aquesta avaluació demostra com el models no generalitzen gaire bé.
50
50
51
51
El corpus FestCat té una variablititat major pel que fa al nombre de paraules per frase, amb el 90% entre 2 i 23 paraules, mentre que en el corpus de CommonVoice la major part de les frases contenen entre 3 i 16 paraules.
52
52
53
53
Com era d'esperar, avaluant els models només amb les frases del corpus d'avaluació que contenen 4 o més paraules el resultat millora:
54
54
55
-
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
@@ -88,32 +88,32 @@ What follows is a comparison of the different published model versions, the data
88
88
89
89
Note: For version 0.6.0 the whole CommonVoice dataset (train, dev and test files) was combined with the clean dataset of ParlamentParla, shuffled and split in train/dev/test files using a 75/20/5 ratio. Due to this fact, a comparison between the models can only be made by using 1713 sentences from the ParlamentParla dataset not seen by any model during training.
90
90
91
-
| Model | Corpus | Augmentation | WER | CER | Loss |
Validating the models against the FestCat dataset shows that the models do not generalize well. This corpus has a higer variability in the word count of the test sentences, with 90% of the sentences containing an evenly distributed amount of words between 2 and 23, whilst most of the sentences in the CommonVoice corpus contain between 3 and 16 words.
108
108
109
109
As expected, validating the models against a test set containing only sentences with 4 or more words improves accuracy:
110
110
111
-
| Model | Corpus | Augmentation | WER | CER | Loss |
## Scorer extés creat a partir de fonts diverses amb tots els accents diacrítics
2
2
3
3
- Arxius test, dev i train del dataset de Common Voice (10/12/2019)
4
-
- Frases del dataset Crowdsourced high-quality Catalan speech data set (https://www.openslr.org/69/)
4
+
- Frases del dataset Crowdsourced high-quality Catalan speech data set (https://www.openslr.org/69/)
5
5
- Frases del dataset Ancora a partir del recull d'Universal dependencies (https://github.com/UniversalDependencies/UD_Catalan-AnCora)
6
6
- Frases de la wikipedia recollides pel projecte WikiAnn (https://elisa-ie.github.io/wikiann/)
7
-
- 12.795.447 de frases extretes després de validar i normalitzar el corpus OSCAR (https://oscar-corpus.com/)
7
+
- 12.795.447 de frases extretes després de validar i normalitzar el corpus OSCAR (https://oscar-corpus.com/)
8
+
9
+
Degut a la mida final de l'scorer aquest no fa part del repository però el podeu descarregar [aquí](https://github.com/ccoreilly/deepspeech-catala/releases/download/0.4.0/kenlm.scorer)
0 commit comments