Skip to content

Commit 3493093

Browse files
committed
Actualitzats els README
1 parent defb953 commit 3493093

File tree

3 files changed

+43
-38
lines changed

3 files changed

+43
-38
lines changed

README.md

Lines changed: 36 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -30,34 +30,34 @@ Les versions anteriors a la 0.4.0 feien servir un alfabet sense vocals accentuad
3030

3131
Nota: Per la versió 0.6.0 del model vaig combinar el corpus complet (train, dev i test) de CommonVoice amb el de [ParlamentParlaClean](https://collectivat.cat/asr) per després barrejar-lo i dividir-lo en tres sets: train (75%), dev (20%) i test(5%). D'aquesta manera s'ha augmentat el nombre de dades d'entrenament. Com que degut a això el set test conté dades del corpus CommonVoice que podrien haver estat emprades en l'entrenament dels altres models, s'han avaluat tots els models exclusivament amb 1713 frases que cap model ha mai vist (totes del corpus ParlamentParlaClean).
3232

33-
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
34-
| --------------------------------------------------------------------- | ------------------------------- | ------------------ | ------ | ------ | ------ |
35-
| deepspeech-catala@0.4.0 | CommonVoice | No | 30,16% | 13,79% | 112,96 |
36-
| deepspeech-catala@0.5.0 | CommonVoice || 29,66% | 13,84% | 108,52 |
37-
| deepspeech-catala@0.6.0 | CommonVoice+ParlamentParlaClean | No | 13,85% | 5,62% | 50,49 |
38-
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 22,62% | 13,59% | 80,45 |
33+
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
34+
| --------------------------------------------------------------------- | --------------------------------- | ------------------ | ------ | ------ | ------ |
35+
| deepspeech-catala@0.4.0 | CommonVoice | No | 30,16% | 13,79% | 112,96 |
36+
| deepspeech-catala@0.5.0 | CommonVoice || 29,66% | 13,84% | 108,52 |
37+
| deepspeech-catala@0.6.0 | CommonVoice + ParlamentParlaClean | No | 13,85% | 5,62% | 50,49 |
38+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 22,62% | 13,59% | 80,45 |
3939

4040
### Corpus d'avaluació [FestCat](http://festcat.talp.cat/devel.php)
4141

42-
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
43-
| --------------------------------------------------------------------- | ------------------------------- | ------------------ | ------ | ------ | ------ |
44-
| deepspeech-catala@0.4.0 | CommonVoice | No | 77,60% | 65,62% | 243,25 |
45-
| deepspeech-catala@0.5.0 | CommonVoice || 78,12% | 65,61% | 235,60 |
46-
| deepspeech-catala@0.6.0 | CommonVoice+ParlamentParlaClean | No | 76,10% | 65,16% | 240,69 |
47-
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 80,58% | 66,82% | 180,81 |
42+
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
43+
| --------------------------------------------------------------------- | --------------------------------- | ------------------ | ------ | ------ | ------ |
44+
| deepspeech-catala@0.4.0 | CommonVoice | No | 77,60% | 65,62% | 243,25 |
45+
| deepspeech-catala@0.5.0 | CommonVoice || 78,12% | 65,61% | 235,60 |
46+
| deepspeech-catala@0.6.0 | CommonVoice + ParlamentParlaClean | No | 76,10% | 65,16% | 240,69 |
47+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 80,58% | 66,82% | 180,81 |
4848

4949
Aquesta avaluació demostra com el models no generalitzen gaire bé.
5050

5151
El corpus FestCat té una variablititat major pel que fa al nombre de paraules per frase, amb el 90% entre 2 i 23 paraules, mentre que en el corpus de CommonVoice la major part de les frases contenen entre 3 i 16 paraules.
5252

5353
Com era d'esperar, avaluant els models només amb les frases del corpus d'avaluació que contenen 4 o més paraules el resultat millora:
5454

55-
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
56-
| --------------------------------------------------------------------- | ------------------------------- | ------------------ | ------ | ------ | ------ |
57-
| deepspeech-catala@0.4.0 | CommonVoice | No | 58,78% | 46,61% | 193,85 |
58-
| deepspeech-catala@0.5.0 | CommonVoice || 58,94% | 46,47% | 188,42 |
59-
| deepspeech-catala@0.6.0 | CommonVoice+ParlamentParlaClean | No | 56,68% | 46,00% | 189,03 |
60-
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 61,11% | 48,16% | 144,78 |
55+
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
56+
| --------------------------------------------------------------------- | --------------------------------- | ------------------ | ------ | ------ | ------ |
57+
| deepspeech-catala@0.4.0 | CommonVoice | No | 58,78% | 46,61% | 193,85 |
58+
| deepspeech-catala@0.5.0 | CommonVoice || 58,94% | 46,47% | 188,42 |
59+
| deepspeech-catala@0.6.0 | CommonVoice + ParlamentParlaClean | No | 56,68% | 46,00% | 189,03 |
60+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 61,11% | 48,16% | 144,78 |
6161

6262
## Possibles següents passos
6363

@@ -88,32 +88,32 @@ What follows is a comparison of the different published model versions, the data
8888

8989
Note: For version 0.6.0 the whole CommonVoice dataset (train, dev and test files) was combined with the clean dataset of ParlamentParla, shuffled and split in train/dev/test files using a 75/20/5 ratio. Due to this fact, a comparison between the models can only be made by using 1713 sentences from the ParlamentParla dataset not seen by any model during training.
9090

91-
| Model | Corpus | Augmentation | WER | CER | Loss |
92-
| --------------------------------------------------------------------- | ------------------------------- | ------------ | ------ | ------ | ------ |
93-
| deepspeech-catala@0.4.0 | CommonVoice | No | 30,16% | 13,79% | 112,96 |
94-
| deepspeech-catala@0.5.0 | CommonVoice || 29,66% | 13,84% | 108,52 |
95-
| deepspeech-catala@0.6.0 | CommonVoice+ParlamentParlaClean | No | 13,85% | 5,62% | 50,49 |
96-
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 22,62% | 13,59% | 80,45 |
91+
| Model | Corpus | Augmentation | WER | CER | Loss |
92+
| --------------------------------------------------------------------- | --------------------------------- | ------------ | ------ | ------ | ------ |
93+
| deepspeech-catala@0.4.0 | CommonVoice | No | 30,16% | 13,79% | 112,96 |
94+
| deepspeech-catala@0.5.0 | CommonVoice || 29,66% | 13,84% | 108,52 |
95+
| deepspeech-catala@0.6.0 | CommonVoice + ParlamentParlaClean | No | 13,85% | 5,62% | 50,49 |
96+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 22,62% | 13,59% | 80,45 |
9797

9898
### Test corpus from the [FestCat](http://festcat.talp.cat/devel.php) dataset
9999

100-
| Model | Corpus | Augmentation | WER | CER | Loss |
101-
| --------------------------------------------------------------------- | ------------------------------- | ------------ | ------ | ------ | ------ |
102-
| deepspeech-catala@0.4.0 | CommonVoice | No | 77,60% | 65,62% | 243,25 |
103-
| deepspeech-catala@0.5.0 | CommonVoice || 78,12% | 65,61% | 235,60 |
104-
| deepspeech-catala@0.6.0 | CommonVoice+ParlamentParlaClean | No | 76,10% | 65,16% | 240,69 |
105-
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 80,58% | 66,82% | 180,81 |
100+
| Model | Corpus | Augmentation | WER | CER | Loss |
101+
| --------------------------------------------------------------------- | --------------------------------- | ------------ | ------ | ------ | ------ |
102+
| deepspeech-catala@0.4.0 | CommonVoice | No | 77,60% | 65,62% | 243,25 |
103+
| deepspeech-catala@0.5.0 | CommonVoice || 78,12% | 65,61% | 235,60 |
104+
| deepspeech-catala@0.6.0 | CommonVoice + ParlamentParlaClean | No | 76,10% | 65,16% | 240,69 |
105+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 80,58% | 66,82% | 180,81 |
106106

107107
Validating the models against the FestCat dataset shows that the models do not generalize well. This corpus has a higer variability in the word count of the test sentences, with 90% of the sentences containing an evenly distributed amount of words between 2 and 23, whilst most of the sentences in the CommonVoice corpus contain between 3 and 16 words.
108108

109109
As expected, validating the models against a test set containing only sentences with 4 or more words improves accuracy:
110110

111-
| Model | Corpus | Augmentation | WER | CER | Loss |
112-
| --------------------------------------------------------------------- | ------------------------------- | ------------ | ------ | ------ | ------ |
113-
| deepspeech-catala@0.4.0 | CommonVoice | No | 58,78% | 46,61% | 193,85 |
114-
| deepspeech-catala@0.5.0 | CommonVoice || 58,94% | 46,47% | 188,42 |
115-
| deepspeech-catala@0.6.0 | CommonVoice+ParlamentParlaClean | No | 56,68% | 46,00% | 189,03 |
116-
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 61,11% | 48,16% | 144,78 |
111+
| Model | Corpus | Augmentation | WER | CER | Loss |
112+
| --------------------------------------------------------------------- | --------------------------------- | ------------ | ------ | ------ | ------ |
113+
| deepspeech-catala@0.4.0 | CommonVoice | No | 58,78% | 46,61% | 193,85 |
114+
| deepspeech-catala@0.5.0 | CommonVoice || 58,94% | 46,47% | 188,42 |
115+
| deepspeech-catala@0.6.0 | CommonVoice + ParlamentParlaClean | No | 56,68% | 46,00% | 189,03 |
116+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 61,11% | 48,16% | 144,78 |
117117

118118
## Possible next steps
119119

lm/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Model del llenguatge
2+
3+
El model actualment recomanat és el que es troba a la carpeta `ext-diacritics`

lm/ext-diacritics/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
## Scorer extés creat a partir de fonts diverses amb tots els accents diacrítics
22

33
- Arxius test, dev i train del dataset de Common Voice (10/12/2019)
4-
- Frases del dataset Crowdsourced high-quality Catalan speech data set (https://www.openslr.org/69/)
4+
- Frases del dataset Crowdsourced high-quality Catalan speech data set (https://www.openslr.org/69/)
55
- Frases del dataset Ancora a partir del recull d'Universal dependencies (https://github.com/UniversalDependencies/UD_Catalan-AnCora)
66
- Frases de la wikipedia recollides pel projecte WikiAnn (https://elisa-ie.github.io/wikiann/)
7-
- 12.795.447 de frases extretes després de validar i normalitzar el corpus OSCAR (https://oscar-corpus.com/)
7+
- 12.795.447 de frases extretes després de validar i normalitzar el corpus OSCAR (https://oscar-corpus.com/)
8+
9+
Degut a la mida final de l'scorer aquest no fa part del repository però el podeu descarregar [aquí](https://github.com/ccoreilly/deepspeech-catala/releases/download/0.4.0/kenlm.scorer)

0 commit comments

Comments
 (0)