Unexpected language combinations

I just looked at http://data.statmt.org/cc-matrix/, and the availablility/unavailability of certain language combinations is so unexpected, it almost certainly points to deep flaws in the pipeline that produced them.

So if we pick a random medium-resource language, like Armenian, we find that there is no Armenian-English nor Armenian-Russian (nor -French, -German, -Spanish...) where we would expect a lot of parallel data.

But there is Armenian-Burmese and Armenian-Khmer, which seems unlikely given that they are very low-resource languages, and have zero contact or cultural overlap or bureacratic overlap (e.g. Council of Europe, Eurovision, CIS...).

In fact, all the pairings of Burmese `my` and Khmer `km` are unlikely and suspicious.

<img width="779" alt="Screenshot 2021-07-13 at 12 28 23" src="https://user-images.githubusercontent.com/11457984/125436872-4d1bc6b8-a811-4332-a188-7b587b7a0298.png">
<img width="624" alt="Screenshot 2021-07-13 at 12 28 32" src="https://user-images.githubusercontent.com/11457984/125436884-8fe0a10b-b0e8-4baf-ba38-86bdde91107d.png">



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected language combinations #190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected language combinations #190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions