Skip to content

Clarify LANGUAGES and ALL_LANGUAGES settings #38036

@bradenmacdonald

Description

@bradenmacdonald

openedx-platform has two very similar language settings, and this is my attempt to define and document the difference. I think we need to clarify the differences and perhaps make them more consistent. Also, it's unclear if ALL_LANGUAGES needs to be a setting at all.

LANGUAGES:

Should represent the languages supported by the Open edX platform (i.e. available localizations of the Open edX platform user interface), but it seems to list too many languages.

# Sourced from http://www.localeplanet.com/icu/ and wikipedia
LANGUAGES = [
('en', 'English'),
('rtl', 'Right-to-Left Test Language'),
('eo', 'Dummy Language (Esperanto)'), # Dummy languaged used for testing
('am', 'አማርኛ'), # Amharic
('ar', 'العربية'), # Arabic
('az', 'azərbaycanca'), # Azerbaijani
('bg-bg', 'български (България)'), # Bulgarian (Bulgaria)
('bn-bd', 'বাংলা (বাংলাদেশ)'), # Bengali (Bangladesh)
('bn-in', 'বাংলা (ভারত)'), # Bengali (India)
('bs', 'bosanski'), # Bosnian
('ca', 'Català'), # Catalan
('ca@valencia', 'Català (València)'), # Catalan (Valencia)
('cs', 'Čeština'), # Czech
('cy', 'Cymraeg'), # Welsh
('da', 'dansk'), # Danish
('de-de', 'Deutsch (Deutschland)'), # German (Germany)
('el', 'Ελληνικά'), # Greek
('en-uk', 'English (United Kingdom)'), # English (United Kingdom)
('en@lolcat', 'LOLCAT English'), # LOLCAT English
('en@pirate', 'Pirate English'), # Pirate English
('es-419', 'Español (Latinoamérica)'), # Spanish (Latin America)

  • This is a standard Django setting
  • Full language names are currently specified using the local name (endonym) like "Deutsch" for German; this seems worse than the Django recommendation of naming them in English and marking them for translation, e.g. ("de", _("German")),
  • All lowercase, separated by hyphens.
  • Has non-standard "languages" used for development and testing purposes:
    • rtl Right-to-Left Test Language
    • eo Dummy Language for coverage testing (docs)
    • en@lolcat LOLCAT English 😸 (why do we have this 🤔)
    • en@pirate Pirate English 🏴‍☠️ (why do we have this 🤔)
  • Weirdly uses an @ sign (ca@valencia) as the code for "Catalan (Valencia)", which is an old GNU libc / gettext practice and not usually used for internet localization purposes. ca-es-valencia or ca-valencia would be more common.
  • Is pretty inconsistent with locale vs. language codes.
    • Uses it-it, jp-jp, tr-tr, fi-fi instead of just it, jp, tr, fi for Italian, Japanese, Turkish, Finnish, etc. where there is one dominant country that uses each language
    • But uses just en for English, fr for French, and ru for Russian, all of which are spoken in a very wide variety of countries with many regional differences; for example, fr-fr (France French) and fr-ca (Canadian French) is an important distinction for a lot of Open edX users.
  • Three Chinese language codes: zh-cn (Mandarin/Mainland China/simplified), zh-hk (Cantonese/Hong Kong/traditional), and zh-tw (Chinese-Taiwan). This seems correct, but as you'll see, other parts of the platform use totally different codes. (Django upstream uses two, zh-hans for simplified and zh-hant for traditional, which some argue is technically more correct but seems to not really be used in practice; as I understand it, browsers typically send/expect zh-cn etc.)
  • Is copied into a dict as settings.LANGUAGE_DICT
  • Use case: a subset of LANGUAGES is used by the lang_pref API and powers the "Site Language" setting of the Accounts MFE:
    Image
  • Use case: Used for the "Languages" taxonomy that's managed by the system and can be used to apply language tags to content. (This seems to be an oversight - ALL_LANGUAGES is likely a better fit, see below)

ALL_LANGUAGES:

Intended to represent "all" languages, regardless of whether or not you can use the Open edX platform in this language.

# Source:
# http://loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt according to http://en.wikipedia.org/wiki/ISO_639-1
# Note that this is used as the set of choices to the `code` field of the `LanguageProficiency` model.
ALL_LANGUAGES = [
["aa", "Afar"],
["ab", "Abkhazian"],
["af", "Afrikaans"],
["ak", "Akan"],
["sq", "Albanian"],
["am", "Amharic"],
["ar", "Arabic"],

  • Not a standard Django setting, and the code doesn't explain why it exists other than to be used for LanguageProficiency
  • Full language names are specified in English only and not translated
  • No fake/development/weird languages, but does have eo "Esperanto" which is the languages code we use as a dummy language for coverage testing (see above)
  • Pretty consistently has only language codes without locale suffix; only one entry es "Spanish" for example, whereas LANGUAGES has six different Spanish locales.
  • Two Chinese language codes: zh_HANS (Simplified Chinese / Mandarin) and zh_HANT (Traditional Chinese / Cantonese). These introduce UPPERCASE and _ (underscore), inconsistent with the lowercase-hyphenated format of LANGUAGES and inconsistent with the ISO 639-1 standard and all the other codes in the list, which have only two letters.
  • Use case: Used for the "Course language" setting on Studio's "Schedule & Details" page
    Image
  • Use case: Used to defined the choices of the LanguageProficiency model, part of the user's public profile (different from their platform language setting). Because it's used in a model's choices field, changing this setting will result in a new migration needing to be created. I guess the thought here was that users may want to list languages on their profile even if those languages are not supported by the system, hence ALL_LANGUAGES was needed to be different from LANGUAGES ????
  • Use case: Used as the list of languages for picking a transcript language in the legacy video editor
  • Use case: used to define the choices of the language field of CourseTeam
  • Use case: Used in transcript_utils to get the name of a language, if it can't find the name in LANGUAGES/LANGUAGES_DICT.

Other notes

  • There is a management command, migrate_user_profile_langs that can migrate users' language preferences to help with cleaning this sort of thing up, with the example given being to go from zh-cn (old) to zh-hans (new), despite the zh-cn one being correct according to the current LANGUAGES setting values.
  • There is a setting called EXTENDED_VIDEO_TRANSCRIPT_LANGUAGES - "Additional languages that should be supported for video transcripts, not included in ALL_LANGUAGES". Which seems to go against the spirit of "all" languages already being included in "ALL_LANGUAGES" 😛 . There wasn't really any explanation for this.
  • The standard browser Intl API prefers mixed capitalization with hyphens for locale codes, but understands the lowercase version that matches our backend settings.LANGUAGES (It doesn't recognize zh_HANS with underscores as seen in the ALL_LANGUAGES setting):
    Image
    Image

Thoughts

The LANGUAGES setting is not really useful on its own - it has too many languages, so you have to use this API to merge it with DarkLangConfig to get the actual list of supported languages. So it seems to me that LANGUAGES is already somewhat playing the role of ALL_LANGUAGES, and with a bit more cleanup we could merge them, such that LANGUAGES is "all languages", and the subset of LANGUAGES+DarkLangConfig represents the available languages of the UI. Cleaning this up will be a fair amount of work, but it would be good to resolve the many inconsistencies between the two lists.

Or, if we need to keep ALL_LANGUAGES, can it be moved out of settings? Does anyone ever override it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    code healthProactive technical investment via refactorings, removals, etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions