This repository contains scripts for the two Python-based modelling projects (Projects 1 and 2), where semantic models (Word2Vec) and recurrent neural networks (that is, simple recurrent networks (SRNs) and long-short-term memory (LSTM) models) were trained to simulate and interpret human reading.
Impact: Modelling how humans process and interpret language provides valuable insights into neuro-cognitive behaviour. This could help neuropsychologists better understand language processing impairments, such as in aphasia or autism spectrum disorders. In practical applications, understanding human reading performance via modelling improves language learning technologies (e.g., Duolingo, Babbel), by making them more tailored as adaptive learning systems to real cognitive patterns.
- Word2Vec models effectively capture human-like word semantics, especially with larger hidden layers and smaller context windows, mirroring cognitive constraints like limited working memory
- LSTMs outperform SRNs in modeling human sensitivity to syntactic ambiguity, despite similar performance in capturing general language statistics (perplexity)
- Lower perplexity ≠better cognitive modeling: statistical accuracy alone is insufficient for predicting human-like language behaviour
- Together, these findings highlight the importance of aligning computational models not just with linguistic data, but also with psycholinguistic phenomena
- Training: CBOW Word2Vec models on a large English corpus (ENCOW) with 16 billion words; hidden layer size and context window varied
- Testing: The models' ability to predict humans' semantics-based word processing (categorisation and semantic priming)
- Findings:
- Distributional word vectors from Word2Vec generally capture well word semantics used by humans.
- The bigger the hidden layer size, the better a Word2Vec model predicts human word processing.
- However, models with a smaller context window size tend to predict human behaviour better, indicating limited working memory capacity in human word processing
Project 2: SRN vs LSTM in characterising the statistical structure of language and syntactic ambiguity (garden-path sentences)
- Training: SRN vs LSTM models on a large English corpus with 8.7 billion words; training data size varied
- Testing: The models' ability to characterise language statistics (perplexity) and predict human performance (sensitivity to syntactic ambiguity)
- Findings:
- With big training sizes, SRN and LSTM give almost equally low perplexity, meaning that they both capture well the statistical structure of language.
- However, LSTMs show higher sensitivity to syntactically ambiguous sentences than SRNs do.
- Characterising linguistic statistics well therefore does not necessarily indicate good prediction of human sentence processing.