Steering Trust

A research agenda on AI honesty, deception and trustworthiness. This repo tracks my work on evaluating and steering honesty in large language models, starting with benchmark replication and building towards mechanistic interpretability and honesty interventions.

Status (21 March 2026): I'm currently working on Step 1 of this agenda: replicating and extending the MASK honesty benchmark on 2026 frontier models, as part of BlueDot Impact's Technical AI Safety Project Sprint. See Current Project below for details and preliminary results. The original steering vectors work described in earlier versions of this README is planned for a later phase; I decided to start with the benchmarking foundations first.

Research Agenda

My long-term research interest is investigating whether we can detect a model's internal representation of its own honesty, find a related steering vector for this trait, and apply it to maximize honesty and minimize deception across different situations and benchmarks. If we can be reasonably certain that a model is being mostly honest, many (but not all) alignment problems could be minimized.

This breaks down into roughly three phases:

Evaluating honesty at the frontier: Replicate the MASK benchmark on the latest frontier models to establish baselines and test whether the original finding (that honesty does not improve with scale) persists. (in progress)
Investigating alternative elicitation methods: Explore whether dishonesty can be elicited via methods other than prompt pressure (e.g. activation steering, contrastive prompts), to disentangle genuine deceptiveness from instruction-following compliance / role-laying. (planned)
Steering for honesty: Apply and compare honesty interventions (LoRRA, contrastive activation addition, SAEs, weight steering) and measure their impact on MASK and other benchmarks. (planned)

Current Project: MASK Replication & Extension

Context: The MASK benchmark (Ren et al., 2025) is the first large-scale evaluation that disentangles model honesty from accuracy. The original paper found that frontier LLMs lie 20–60% of the time when pressured, and that honesty does not improve with training compute. These results were based on 27 models with training FLOPs up to ~10²⁶.

What I'm doing: Replicating MASK on 9 current models from 7 families (GPT, Claude, Gemini, DeepSeek, Grok, Llama, Qwen) using the inspect_evals framework. This extends the original FLOP range into ~10²⁶–10²⁷ with models released after the paper, while overlapping with the paper's upper range for continuity. See the preliminary planning here

Progress so far:

Environment and workflow set up (Windows / local laptop for API models, RunPod for open-weight models)
Provider SDKs installed and tested for OpenAI, Anthropic, Google, DeepSeek, and xAI
Preliminary n=10 pilot runs completed for 6 API models, confirming the pipeline works
Phase 1 de-risking effectively complete; moving into full 1000-record runs

Results and a write-up will be added here as they become available.

About Me

I'm Ignacio, an Aerospace Engineer transitioning towards a career in technical AI safety research. I am particularly interested in questions of mechanistic interpretability and evaluations of honesty and deception in LLMs.

I'm currently building my research portfolio through programs like BlueDot Impact's Technical AI Safety Project Sprint, and I plan to apply the skills I'm developing here towards more exhaustive interpretability work in the near future.

I have two goals for this project: the main one is learning-by-doing. I want to gain hands-on experience with empirical AI safety research, from running large-scale LLM evaluations to implementing interpretability and steering techniques on open-weight models.

My second goal is to explore some questions that interest me about the topics of AI honesty, truthfulness and deception. These might have been already investigated elsewhere, but I want to try my hand at them first before doing a deeper dive into the literature. I’m currently focused on identifying and eliciting honesty in frontier models as a key aspect of AI safety. My Theory of Change for this is, very simplified:

For models to be accurate and effective, they must have some kind of internal representation of the world.
If a model is being deceptive, there must be some mismatch between its internal world model and their outputs/actions.
This mismatch should be, in principle, traceable somewhere in the hidden states of the models.
If deception can be found, it can theoretically be steered away from.
If deception is minimized, many misalignment issues, such as scheming, can also be reduced.

This is a very idealized picture, but I believe any progress on this front would be helpful as a basis for other safety and alignment interventions.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Steering Trust

Research Agenda

Current Project: MASK Replication & Extension

About Me

Related Reading

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Steering Trust

Research Agenda

Current Project: MASK Replication & Extension

About Me

Related Reading

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!