CCMACLRL_COM231ML_PROJECT

Cross-Linguistic Fake News Detection Using Machine Learning

This research investigates how machine learning models trained on English-language datasets perform when tested on Filipino-language fake news data — and whether retraining them on Filipino datasets restores accuracy. The study aims to determine whether performance drops are caused by language mismatch rather than model inefficiency.

Research Description

Fake news continues to pose a global challenge, influencing public opinion and distorting democratic discourse. While existing fake news detectors perform well on English data, their effectiveness often declines when applied to other languages such as Filipino due to linguistic and contextual differences.

This study evaluates the cross-linguistic generalizability of four traditional machine learning algorithms:

Naive Bayes
Logistic Regression
Support Vector Machine (SVM)
Random Forest

The models were trained and tested in three phases:

Phase 1 – English-to-English Evaluation
Models were trained and evaluated using English fake and real news datasets to establish baseline performance.
Phase 2 – Cross-Linguistic Evaluation
English-trained models were tested on Filipino-language data to assess how linguistic mismatch affects model accuracy and F1-scores.
Phase 3 – Filipino-to-Filipino Evaluation
Models were retrained using the Filipino dataset and re-evaluated to determine whether native-language retraining restores performance.

Key Findings

English-trained models achieved near-perfect accuracy (F1 ≈ 0.99) when evaluated on English data.
When tested on Filipino data, accuracy dropped significantly (F1 ≈ 0.31–0.67), showing that models struggled with language and contextual transfer.
Retraining with Filipino datasets restored high accuracy and F1-scores (≈ 0.93–0.97), confirming that language-specific data is essential for reliable fake news detection.
Results demonstrate that performance degradation was due to language mismatch, not model inefficiency.

Research Contribution

This study contributes to the growing field of multilingual fake news detection by:

Empirically showing the limitations of English-trained models on Filipino text.
Demonstrating that retraining with localized data restores performance.
Providing benchmark results for future cross-linguistic research in low-resource languages like Filipino.

Datasets

Kaggle (English Dataset): https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
Hugging Face (Filipino Dataset): https://huggingface.co/datasets/jcblaise/fake_news_filipino

Keywords

Fake News Detection · Machine Learning · Cross-Linguistic Evaluation · Language Mismatch · Filipino Dataset · SVM · Naive Bayes · Logistic Regression · Random Forest

Research Paper:
Cross-Linguistic Fake News Detection Using Machine Learning: A Comparative Study on English and Filipino News Datasets
Authored by:

Chris Lawrence De Vera, National University – Manila (2025)
Lovely Joy Reyes, National University – Manila (2025)
Jude Renwell Prodigalidad, National University – Manila (2025)

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
fake_news		fake_news
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCMACLRL_COM231ML_PROJECT

Cross-Linguistic Fake News Detection Using Machine Learning

Research Description

Key Findings

Research Contribution

Datasets

Keywords

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CCMACLRL_COM231ML_PROJECT

Cross-Linguistic Fake News Detection Using Machine Learning

Research Description

Key Findings

Research Contribution

Datasets

Keywords

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages