Personal research conducted with the rigor of an institutional lab.
The pipeline is: read paper -> run experiment -> build project -> iterate.
/
├── papers/ # Paper notes: summary, core ideas, implementation angles
├── comprehensive-experiments/ # Controlled pattern comparison experiments
└── research-projects/ # Projects justified by experiment findings
Start here before opening any paper folder.
1. CoALA Establish design vocabulary and architecture taxonomy first
2. Attention Is All You Need Understand the substrate everything runs on
3. CoT -> Self-Consistency -> ToT -> LATS Trace the full reasoning evolution in sequence
4. ReAct -> Reflexion -> FuseMind Core agent loop design
5. AutoGen + MetaGPT Most-referenced frameworks during implementation
6. MemGPT + CoALA (revisit) Memory layer design decisions
7. Self-RAG + RAPTOR Selective retrieval and long-document indexing
8. Toolformer -> Gorilla -> ToolBench Tool use from first principles to benchmark
9. vLLM + FlashAttention 2 Inference economics: every agent runs on a serving stack
10. SWE-bench + LLM-as-Judge Set measurement baselines before designing experiments
11. Constitutional AI + Prompt Injection Safety is an architectural constraint, not a post-hoc filter
12. DSPy Rethink how you approach prompt engineering entirely
13. 2024-2025 papers Close the gap to the current frontier
The map of what components exist, what control-flow patterns are available, and how agents are deployed in real software systems.
| Paper | Role |
|---|---|
| LLM-Enabled Multi-Agent Systems (Survey) | Component taxonomy and control-flow pattern overview |
| Multi-Agent Collaboration Mechanisms: A Survey of LLMs | Collaboration mechanism classification |
| LLM-Powered Multi-Agent Systems: A Technical Framework (IEEE) | Technical framework and architectural perspective |
| LLM-Based Multi-Agent Systems for Software Engineering (ACM) | Application to real software engineering workflows |
| NVIDIA: Smaller LMs for Agents (Nemotron / SLM) | Small model roles in agent systems, inference cost perspective |
ReAct without understanding its reasoning lineage is a partial understanding. Read these in order. Each paper is a direct response to the limitations of the previous one.
| Paper | Role |
|---|---|
| Chain-of-Thought Prompting (Wei et al., 2022) | The root of all agent reasoning. Explicit step-by-step reasoning elicitation |
| Self-Consistency (Wang et al., 2023) | Multiple reasoning paths + majority vote = reliability improvement. Directly feeds into aggregation experiments |
| Tree of Thoughts (Yao et al., 2023) | Breaks linear reasoning into tree search. Enables backtracking and exploration. Direct predecessor to LATS |
| LATS: Language Agent Tree Search (2023) | Integrates ToT + MCTS + ReAct. Currently the strongest single-agent planning architecture |
Experiments: experiments/reasoning-evolution
The canonical agent loop and its direct extensions.
| Paper | Role |
|---|---|
| ReAct: Synergizing Reasoning and Acting (Yao et al., 2022) | The origin of the Reason + Act loop. Reference point for all agent loop design |
| Reflexion (Shinn et al., 2023) | Post-task verbal reflection for self-improvement without gradient updates |
| ReflAct | Combines Reflexion + ReAct into a unified structure |
| FuseMind: Fusing Reflection and Prediction in Agents | Adds next-action prediction to reflection. Efficiency improvement through anticipation |
Experiments: experiments/react-vs-cot, experiments/react-vs-reflexion-vs-fusemind
Tool calling is the agent's interface to the world. Without this lineage, half of agent system design is missing.
| Paper | Role |
|---|---|
| Toolformer (Schick et al., 2023) | First work on LLMs learning where to insert tool calls via self-supervised training |
| Gorilla (Patil et al., 2023) | API-call-specialized LLM. Core reference for how tool specifications are injected into context |
| ToolBench / ToolLLM (Qin et al., 2023) | 16,000-API benchmark + DFS decision tree for tool selection. Largest tool-use evaluation suite |
Experiments: experiments/tool-use-strategies
Not surveys. These are the papers that actually defined the implementation patterns in common use today.
| Paper | Role |
|---|---|
| AutoGen (Wu et al., Microsoft, 2023) | Closest to an industry standard for multi-agent conversation. Conversable agent abstraction |
| MetaGPT (Hong et al., 2023) | Role-based agents (PM / Engineer / QA). Encodes SOPs into agent collaboration structure |
| CAMEL (Li et al., 2023) | Role-playing for autonomous agent-to-agent collaboration. Inception prompting concept |
| HuggingGPT / JARVIS (Shen et al., 2023) | LLM as controller, external specialized models as tools. Canonical orchestration pattern |
Experiments: experiments/multi-agent-patterns
RAG alone is not a memory architecture. This section covers the full stack from naive retrieval to cognitive memory design.
| Paper | Role |
|---|---|
| Retrieval-Augmented Generation (Lewis et al., 2020) | Dense Passage Retriever + joint fine-tuning. The baseline for all agent knowledge layers |
| MemGPT (Packer et al., 2023) | Applies OS virtual memory concepts to LLMs. The reference design for infinite-context agents |
| CoALA: Cognitive Architectures for Language Agents (2023) | Classifies agent memory into Working / Episodic / Semantic / Procedural. The most rigorous architectural taxonomy available. Without this, design language is unstable |
| Self-RAG (Asai et al., 2023) | Model decides whether retrieval is needed at all. Defines the always-retrieve vs. selective-retrieve tradeoff |
| RAPTOR (Sarthi et al., 2024) | Recursive document summarization into tree-indexed structures. Current best practice for long-document RAG |
Experiments: experiments/rag-comparison, experiments/rag-retriever-strategies, experiments/self-rag-vs-naive-rag, experiments/memory-architecture, experiments/raptor-vs-flat-rag
How to aggregate agent outputs to improve reliability, and how real deployments fail.
| Paper | Role |
|---|---|
| Reliable Decision-Making for Multi-Agent LLM Systems | Aggregation and ensemble methods for reliability improvement across agent runs |
| AI Agents in 2025: Expectations vs. Reality (IBM) | Realistic operational limits and deployment failure modes |
| SWE-bench (Jimenez et al., 2024) | Real GitHub issues as agent tasks. De facto standard benchmark for software agents |
| AgentBench (Liu et al., 2023) | Comprehensive agent evaluation across 8 environments: web, database, OS, etc. |
| WebArena (Zhou et al., 2023) | Realistic web task automation benchmark with live environments |
| LLM-as-a-Judge (Zheng et al., 2023) | Using LLMs to evaluate LLM outputs. Foundation for automated evaluation pipelines |
Experiments: experiments/aggregation-reliability, experiments/monitoring-ops, experiments/llm-as-judge-pipeline
Once an agent takes real actions, safety is an architectural constraint, not an add-on.
| Paper | Role |
|---|---|
| Constitutional AI (Anthropic, 2022) | Model-level safety principles. Thinking framework for designing guard agents |
| R-Judge / Agent Safety Bench (2024) | Benchmark for detecting dangerous actions in agent environments |
| Prompt Injection Attacks on LLM Agents (2023+) | Instruction injection from external data. The primary attack surface for RAG + tool agents |
Experiments: experiments/guard-agent, experiments/prompt-injection-defense
An agent architect who does not understand what is happening inside the model is essentially an API wrapper engineer. These are non-negotiable.
| Paper | Role |
|---|---|
| Attention Is All You Need (Vaswani et al., 2017) | Mechanical understanding of transformers. Required to reason about latency, context window cost, and orchestration overhead |
| Scaling Laws for Neural Language Models (Kaplan et al., 2020) | Informs every decision about model size selection for specific agent roles |
| GPT-4 Technical Report (OpenAI, 2023) | Understanding the current capability frontier you are architecting on top of |
| LLaMA / LLaMA 2 / LLaMA 3 (Meta) | The open-weight foundation. Most production inference infrastructure is built around these |
| Mistral 7B / Mixtral 8x7B | MoE architecture. Directly relevant to efficient agent inference and role-specialized routing |
This is the category most agent builders ignore, and the one NVIDIA operates at. An architect who cannot identify where the compute bottleneck is in their system is not an architect.
| Paper / Resource | Role |
|---|---|
| FlashAttention 1 and 2 (Dao et al., 2022 / 2023) | Why long-context agents are expensive and how it is being solved at the hardware level |
| Orca: Continuous Batching (Yu et al., 2022) | How inference servers handle concurrent agent requests. The foundation for vLLM |
| vLLM: PagedAttention (Kwon et al., 2023) | The actual system running most production agent backends today. KV cache paging |
| Speculative Decoding (Leviathan et al., 2023) | How fast token generation is achieved. Critical for latency-sensitive agent loops |
| SGLang (Zheng et al., 2024) | Structured generation and agent-specific inference optimization. Radix attention for prefix caching |
| TensorRT-LLM (NVIDIA documentation) | If you are building on NVIDIA hardware, this is not optional reading |
Note: Experiments for this section require dedicated server infrastructure and are deferred. Folder scaffolds will be added when the environment is available.
What separates someone who builds agents from someone who architects agent systems is the ability to reason about control flow at the structural level.
| Paper / Resource | Role |
|---|---|
| Executable Code Actions Elevate LLM Agents (Wang et al., 2024) | Code as action space vs. JSON as action space. One of the most consequential architectural decisions in agent design |
| LangGraph design documentation and LCEL | Control flow patterns for stateful agent graphs. Cyclic vs. DAG structures and why it matters |
| Flows: Building Blocks for Multi-Agent Systems (EPFL, 2024) | Formal treatment of agent composition. Most rigorous available framework for agent interface design |
| DSPy (Khattab et al., 2023 / 2024) | Programmatic prompt optimization. Changes how you think about prompt engineering: from manual tuning to compiled programs |
Experiments: experiments/code-vs-json-action-space, experiments/dspy-vs-manual-prompting
The earlier sections skew heavily toward 2023. These close the gap to the current frontier.
| Paper | Role |
|---|---|
| Agent Workflow Memory (AWM, 2024) | Agents that learn and reuse workflow patterns from experience, beyond episodic memory |
| OpenDevin / SWE-agent (2024) | Current state of the art for coding agents. SWE-bench is the benchmark; these are the actual systems |
| AgentScope (Alibaba, 2024) | Production-grade multi-agent framework with serious fault-tolerance and scheduling design |
| LLM Agent Survey 2024 (Xi et al.) | Most comprehensive and current survey. Should supplement or replace the 2023 surveys |
| Anthropic Model Specification (2024) | How alignment is operationalized at the model level. Required reading for safety agent design |
| Resource | Use |
|---|---|
| Multi-Agent-Papers GitHub Collection | Starting point for new paper discovery in the multi-agent space |
Every experiment follows the same measurement structure: identical task, varying architecture or strategy, evaluated across performance / latency / token cost / failure rate.
Run these first. They establish the empirical intuitions that everything else builds on.
| Folder | What is being compared | Key papers |
|---|---|---|
experiments/react-vs-cot |
CoT only vs. ReAct + tool calls on identical tasks. Success rate, failure mode analysis | ReAct, Chain-of-Thought |
experiments/react-vs-reflexion-vs-fusemind |
ReAct vs. ReAct + Reflection vs. FuseMind on multi-step reasoning. Accuracy, attempt count, token cost | Reflexion, ReflAct, FuseMind |
experiments/reasoning-evolution |
CoT -> Self-Consistency -> ToT -> LATS: accuracy and cost at each step of the evolution | CoT, Self-Consistency, ToT, LATS |
experiments/multi-agent-patterns |
Single LLM vs. Planner-Worker vs. 3-4 agent collaboration on a complex task. Performance, latency, cost | AutoGen, MetaGPT, CAMEL |
| Folder | What is being compared | Key papers |
|---|---|---|
experiments/rag-comparison |
No-RAG vs. naive RAG vs. task-specific structured RAG. Answer quality and hallucination rate | RAG original |
experiments/rag-retriever-strategies |
DPR dense retrieval vs. embedding search vs. keyword search. Quality and speed tradeoffs | RAG, RAPTOR |
experiments/self-rag-vs-naive-rag |
Always-retrieve vs. model-decides-when-to-retrieve. Accuracy, latency, unnecessary retrieval rate | Self-RAG |
experiments/memory-architecture |
No memory vs. flat conversation history vs. CoALA-style layered memory. Task coherence over long sessions | MemGPT, CoALA |
experiments/raptor-vs-flat-rag |
Flat chunk indexing vs. RAPTOR recursive tree indexing on long documents. Recall and coherence | RAPTOR |
| Folder | What is being compared | Key papers |
|---|---|---|
experiments/aggregation-reliability |
Single agent vs. majority vote vs. weighted vote vs. critic-agent final selection. Confidence calibration | Reliable Decision-Making, Self-Consistency |
experiments/llm-as-judge-pipeline |
Human evaluation vs. LLM-as-Judge correlation measurement. Failure mode analysis under adversarial outputs | LLM-as-Judge |
experiments/monitoring-ops |
Collect agent execution logs, cluster failure patterns, identify input types with high failure rates | AI Agents: Expectations vs. Reality |
| Folder | What is being compared | Key papers |
|---|---|---|
experiments/tool-use-strategies |
Tool spec injection methods: description density, example count, structured vs. free-form. DFS vs. greedy tool selection | Toolformer, Gorilla, ToolBench |
experiments/code-vs-json-action-space |
JSON action representation vs. Python code as actions. Task success rate, error recovery, generalization to unseen tasks | Executable Code Actions |
experiments/dspy-vs-manual-prompting |
Hand-tuned prompts vs. DSPy compiled prompts. Accuracy, iteration time, sensitivity to model version changes | DSPy |
| Folder | What is being compared | Key papers |
|---|---|---|
experiments/guard-agent |
No guard vs. rule-based filter vs. Constitutional AI critic. Dangerous action detection rate and false positive rate | Constitutional AI, R-Judge |
experiments/prompt-injection-defense |
Baseline agent vs. sanitized input agent vs. instruction-hierarchy agent. Injection success rate under adversarial inputs | Prompt Injection Attacks |
Projects originate from findings in comprehensive-experiments. No project is started before the experiment that justifies it is complete.
See: research-projects/README.md
What reading and completing this list covers:
- Architectural vocabulary and design patterns for single and multi-agent systems
- The full reasoning evolution from Chain-of-Thought to LATS
- Memory taxonomy and retrieval architecture from naive RAG to cognitive layering
- Tool use from first principles through large-scale benchmarking
- Production evaluation methodology
- Safety as a structural design constraint
What the experiments and systems projects are specifically designed to build, because reading alone does not produce it:
- Inference economics intuition: KV cache sizing, memory bandwidth ceilings, batching tradeoffs under real load
- The ability to profile a running agent system and identify where the actual bottleneck is
- An informed position on the 2024 architecture debates: single long-context model vs. multi-agent with smaller windows, code-as-action vs. JSON-as-action, static workflow vs. dynamic planning
- Enough benchmark familiarity to identify when a paper's claimed improvement is real vs. benchmark overfitting
- A realistic threat model for production agent systems, built from empirical failure analysis rather than theory
The systems-level projects in section E are the primary mechanism for closing that gap.