Skip to content

Commit 8f277f7

Browse files
committed
Afegit scorer amb accents diacrítics i frases del corpus OSCAR
1 parent a111469 commit 8f277f7

File tree

10 files changed

+823432
-0
lines changed

10 files changed

+823432
-0
lines changed

lm/common-voice/alphabet.txt

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
2+
# associated with a numeric label.
3+
# A line that starts with # is a comment. You can escape it with \# if you wish
4+
# to use '#' as a label.
5+
6+
a
7+
b
8+
c
9+
ç
10+
d
11+
e
12+
f
13+
g
14+
h
15+
i
16+
ï
17+
j
18+
k
19+
l
20+
m
21+
n
22+
o
23+
p
24+
q
25+
r
26+
s
27+
t
28+
u
29+
ü
30+
v
31+
w
32+
x
33+
y
34+
z
35+
'
36+
-
37+
·
38+
# The last (non-comment) line needs to end with a newline.

lm/ext-diacritics/.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
frases.txt
2+
raw/oscar.txt
3+
lm.binary
4+
kenlm.scorer

lm/ext-diacritics/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
## Scorer extés creat a partir de fonts diverses amb tots els accents diacrítics
2+
3+
- Arxius test, dev i train del dataset de Common Voice (10/12/2019)
4+
- Frases del dataset Crowdsourced high-quality Catalan speech data set (https://www.openslr.org/69/)
5+
- Frases del dataset Ancora a partir del recull d'Universal dependencies (https://github.com/UniversalDependencies/UD_Catalan-AnCora)
6+
- Frases de la wikipedia recollides pel projecte WikiAnn (https://elisa-ie.github.io/wikiann/)
7+
- 12.795.447 de frases extretes després de validar i normalitzar el corpus OSCAR (https://oscar-corpus.com/)

lm/ext-diacritics/alphabet.txt

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
2+
# associated with a numeric label.
3+
# A line that starts with # is a comment. You can escape it with \# if you wish
4+
# to use '#' as a label.
5+
6+
a
7+
à
8+
b
9+
c
10+
ç
11+
d
12+
e
13+
è
14+
é
15+
f
16+
g
17+
h
18+
i
19+
í
20+
ï
21+
j
22+
k
23+
l
24+
m
25+
n
26+
o
27+
ò
28+
ó
29+
p
30+
q
31+
r
32+
s
33+
t
34+
u
35+
ú
36+
ü
37+
v
38+
w
39+
x
40+
y
41+
z
42+
'
43+
-
44+
·
45+
# The last (non-comment) line needs to end with a newline.

lm/ext-diacritics/raw/ancora.txt

Lines changed: 16678 additions & 0 deletions
Large diffs are not rendered by default.

lm/ext-diacritics/raw/commonvoice.txt

Lines changed: 79633 additions & 0 deletions
Large diffs are not rendered by default.

lm/ext-diacritics/raw/crowdsourced.txt

Lines changed: 4240 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)