This organization hosts all repositories related to my PhD research in AI and NLP, including robust NMT, lexical normalization, sentence embeddings, and data augmentation projects.
⚠️ Work in progress: still migrating repositories from my lab's private GitLab.
🎓 Read the full thesis here: Robust Neural Machine Translation of User-Generated Content.
For an overview of my personal projects, contributions, and pinned repositories, visit my personal GitHub: github.com/lydianish
This repository contains the full research code and experiments from my PhD work on making sentence embeddings robust to user-generated content (UGC). It includes the full training pipelines for RoLASER and RoSONAR, covering synthetic UGC generation, teacher–student training, and evaluation on both natural and artificial non-standard text. Ideal for researchers interested in UGC robustness, sentence embeddings, and multilingual NLP.
🔹 RoLASER
A demo-focused version of the RoLASER model for quick exploration. It provides pre-trained models, example scripts, and visualisations to understand how token-level and character-level student encoders align standard and non-standard sentences in the LASER embedding space. Perfect for testing and educational purposes.