Fluent-mLAMA

Repository with code and data for the "Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks" paper, presented at EMNLP-2025.

You can also find a poster and a short video presentation of this work.

0. Pre-requisites

Download the initial MLAMA benchmark from the Github repo.
To reproduce the ChatGPT few-shot prompt generation, you would need to generate an OpenAI token. You can generate it here, and paste it into the openai-api-key.txt document.

1. Benchmark Modification

The facts in the MLAMA benchmark Kassner et al., 2021 are translated in the templated manner: for example, to translate R=[X] died in [Y] with X=Sofya Kovalevskaya and Y=Stockholm, the relation R is translated with Google Translate once and for all X-Y pairs, and the X-Y translations are retrieved from Wikidata. This leads to ungrammaticalities or wrong translations of the templates (because NMT system has not seen the named entites within the sentence): for example, for the Russian it will end up with Софья Ковалевская умер в Стокгольм, where the verb умер is used in the wrong gender and the noun Стокгольм is used in the wrong case.

To address this, we modified the subpart of this benchmark with a simple trick: first, filling in the X-Y pairs in the initial English prompts, and then translating full sentences with Google Translate (and optionally ChatGPT).

Codes and data:

modification.ipynb - Jupyter notebook with step-by-step generation of translations and their preparation for MLAMA evaluation.
datamodifier.py - source code for the notebook.
fixed_data - results of the code for 9 languages and subset of relations covered in the paper (unzip into the folder locally).
chatgpt_prompts - few-shot prompts needed to generate the ChatGPT translations (for Experiment 1 with Slavic languages).
reader.py - code for mLAMA reader (borrowed from the initial MLAMA dataset)

2. LLM Testing

Now we want to test our hypothesis: does prompting an LLM with fluent sentences help retrieve more facts compared to the disfluent templated translations?

We are doing it in a following manner:

for each fact (e.g. Sofya Kovalevskaya died in Stockholm) we have a templated translation and a full-sentence translation.
in the dataset (part 1), we already split the "prompt base" (Sofya Kovalevskaya died in) from an object of interest (Stockholm).
we take a prompt base, concatenate it with the correct object and with a set of incorrect objects ("distractors", e.g. Moscow, London, Paris), and feed it to an LLM (in our case, it's meta-llama/Llama-2-7b-chat-hf).
we get the log-probability scores of each object (both correct and distractor ones), and rank them by log-probabilities. If the model has ranked the correct objects high enough (for example, in top-3 options), we say that it knows a fact.

Codes and data:

processor.py - module used for prompting an LLM.
script.py - python script to run evaluation (for a given language and a range of relations). For example: python script.py -l ru -r P101,P103,P108,P127,P1376,P159,P19,P20,P36,P364,P407,P449,P463,P495,P740
stats - resulting data with the ranks and log-probabilities of the objects
- - stats_wide - distributions for the mode where aliases are counted as correct objects

3. Evaluation, Graphs and Tables

Finally, we do a statistical analysis of the results. You can find all descriptions and discussion about it in the paper; here we are just providing all source codes.

Codes and data:

stats.py - class with most graphs and tables visualization
grammeval.py - additional class which compares the QE qualitvy (approximation of fluency) with the increas in fact retrieval.
statistics.ipynb - notebook which shows generation of every table and its reference to the paper.
graphs - folder with all generated graphs
tables - folder with all generated LaTeX tables.

Acknowledgments

This research was funded by NCCR Evolving Language, Swiss National Science Foundation Agreement 51NF40_180888. We also thank the following people for helping with the preparation of the few-shot prompts and qualitative analysis of the larger sample of languages:

Yulia Alpaieva (Charles University) for Ukrainian,
Michelle Wastl (UZH) for Croatian,
Nam Luu (University of the Basque Country) for Vietnamese,
Sophia Conrad (UZH) for Danish,
Polina Nalsedskova (HSE University) for Indonesian.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fluent-mLAMA

0. Pre-requisites

1. Benchmark Modification

2. LLM Testing

3. Evaluation, Graphs and Tables

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
chatgpt_prompts/translation20		chatgpt_prompts/translation20
fixed_data		fixed_data
graphs		graphs
gt_doc		gt_doc
stats		stats
stats_wide		stats_wide
tables		tables
README.md		README.md
datamodifier.py		datamodifier.py
grammareval.py		grammareval.py
modification.ipynb		modification.ipynb
openai-api-key.txt		openai-api-key.txt
poster.pdf		poster.pdf
processor.py		processor.py
reader.py		reader.py
script.py		script.py
statistics.ipynb		statistics.ipynb
stats.py		stats.py

ZurichNLP/Fluent-mLAMA

Folders and files

Latest commit

History

Repository files navigation

Fluent-mLAMA

0. Pre-requisites

1. Benchmark Modification

2. LLM Testing

3. Evaluation, Graphs and Tables

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages