Agent AI Architecture — Solo Research Lab

Personal research conducted with the rigor of an institutional lab.

The pipeline is: read paper -> run experiment -> build project -> iterate.

Repository Structure

/
├── papers/                      # Paper notes: summary, core ideas, implementation angles
├── comprehensive-experiments/   # Controlled pattern comparison experiments
└── research-projects/           # Projects justified by experiment findings

Papers
Comprehensive Experiments
Research Projects

Reading Priority

Start here before opening any paper folder.

1.  CoALA                                    Establish design vocabulary and architecture taxonomy first
2.  Attention Is All You Need                Understand the substrate everything runs on
3.  CoT -> Self-Consistency -> ToT -> LATS   Trace the full reasoning evolution in sequence
4.  ReAct -> Reflexion -> FuseMind           Core agent loop design
5.  AutoGen + MetaGPT                        Most-referenced frameworks during implementation
6.  MemGPT + CoALA (revisit)                 Memory layer design decisions
7.  Self-RAG + RAPTOR                        Selective retrieval and long-document indexing
8.  Toolformer -> Gorilla -> ToolBench       Tool use from first principles to benchmark
9.  vLLM + FlashAttention 2                  Inference economics: every agent runs on a serving stack
10. SWE-bench + LLM-as-Judge                 Set measurement baselines before designing experiments
11. Constitutional AI + Prompt Injection      Safety is an architectural constraint, not a post-hoc filter
12. DSPy                                     Rethink how you approach prompt engineering entirely
13. 2024-2025 papers                         Close the gap to the current frontier

Papers

A. Big Picture and Multi-Agent

The map of what components exist, what control-flow patterns are available, and how agents are deployed in real software systems.

Paper	Role
LLM-Enabled Multi-Agent Systems (Survey)	Component taxonomy and control-flow pattern overview
Multi-Agent Collaboration Mechanisms: A Survey of LLMs	Collaboration mechanism classification
LLM-Powered Multi-Agent Systems: A Technical Framework (IEEE)	Technical framework and architectural perspective
LLM-Based Multi-Agent Systems for Software Engineering (ACM)	Application to real software engineering workflows
NVIDIA: Smaller LMs for Agents (Nemotron / SLM)	Small model roles in agent systems, inference cost perspective

B. Reasoning Foundation

ReAct without understanding its reasoning lineage is a partial understanding. Read these in order. Each paper is a direct response to the limitations of the previous one.

Paper	Role
Chain-of-Thought Prompting (Wei et al., 2022)	The root of all agent reasoning. Explicit step-by-step reasoning elicitation
Self-Consistency (Wang et al., 2023)	Multiple reasoning paths + majority vote = reliability improvement. Directly feeds into aggregation experiments
Tree of Thoughts (Yao et al., 2023)	Breaks linear reasoning into tree search. Enables backtracking and exploration. Direct predecessor to LATS
LATS: Language Agent Tree Search (2023)	Integrates ToT + MCTS + ReAct. Currently the strongest single-agent planning architecture

Experiments: experiments/reasoning-evolution

C. ReAct / Tool-Use / Reflection / Planning

The canonical agent loop and its direct extensions.

Paper	Role
ReAct: Synergizing Reasoning and Acting (Yao et al., 2022)	The origin of the Reason + Act loop. Reference point for all agent loop design
Reflexion (Shinn et al., 2023)	Post-task verbal reflection for self-improvement without gradient updates
ReflAct	Combines Reflexion + ReAct into a unified structure
FuseMind: Fusing Reflection and Prediction in Agents	Adds next-action prediction to reflection. Efficiency improvement through anticipation

Experiments: experiments/react-vs-cot, experiments/react-vs-reflexion-vs-fusemind

D. Tool Use and Function Calling

Tool calling is the agent's interface to the world. Without this lineage, half of agent system design is missing.

Paper	Role
Toolformer (Schick et al., 2023)	First work on LLMs learning where to insert tool calls via self-supervised training
Gorilla (Patil et al., 2023)	API-call-specialized LLM. Core reference for how tool specifications are injected into context
ToolBench / ToolLLM (Qin et al., 2023)	16,000-API benchmark + DFS decision tree for tool selection. Largest tool-use evaluation suite

Experiments: experiments/tool-use-strategies

E. Multi-Agent Framework Papers

Not surveys. These are the papers that actually defined the implementation patterns in common use today.

Paper	Role
AutoGen (Wu et al., Microsoft, 2023)	Closest to an industry standard for multi-agent conversation. Conversable agent abstraction
MetaGPT (Hong et al., 2023)	Role-based agents (PM / Engineer / QA). Encodes SOPs into agent collaboration structure
CAMEL (Li et al., 2023)	Role-playing for autonomous agent-to-agent collaboration. Inception prompting concept
HuggingGPT / JARVIS (Shen et al., 2023)	LLM as controller, external specialized models as tools. Canonical orchestration pattern

Experiments: experiments/multi-agent-patterns

F. RAG / Memory / Context

RAG alone is not a memory architecture. This section covers the full stack from naive retrieval to cognitive memory design.

Paper	Role
Retrieval-Augmented Generation (Lewis et al., 2020)	Dense Passage Retriever + joint fine-tuning. The baseline for all agent knowledge layers
MemGPT (Packer et al., 2023)	Applies OS virtual memory concepts to LLMs. The reference design for infinite-context agents
CoALA: Cognitive Architectures for Language Agents (2023)	Classifies agent memory into Working / Episodic / Semantic / Procedural. The most rigorous architectural taxonomy available. Without this, design language is unstable
Self-RAG (Asai et al., 2023)	Model decides whether retrieval is needed at all. Defines the always-retrieve vs. selective-retrieve tradeoff
RAPTOR (Sarthi et al., 2024)	Recursive document summarization into tree-indexed structures. Current best practice for long-document RAG

Experiments: experiments/rag-comparison, experiments/rag-retriever-strategies, experiments/self-rag-vs-naive-rag, experiments/memory-architecture, experiments/raptor-vs-flat-rag

G. Reliability / Ops / Evaluation

How to aggregate agent outputs to improve reliability, and how real deployments fail.

Paper	Role
Reliable Decision-Making for Multi-Agent LLM Systems	Aggregation and ensemble methods for reliability improvement across agent runs
AI Agents in 2025: Expectations vs. Reality (IBM)	Realistic operational limits and deployment failure modes
SWE-bench (Jimenez et al., 2024)	Real GitHub issues as agent tasks. De facto standard benchmark for software agents
AgentBench (Liu et al., 2023)	Comprehensive agent evaluation across 8 environments: web, database, OS, etc.
WebArena (Zhou et al., 2023)	Realistic web task automation benchmark with live environments
LLM-as-a-Judge (Zheng et al., 2023)	Using LLMs to evaluate LLM outputs. Foundation for automated evaluation pipelines

Experiments: experiments/aggregation-reliability, experiments/monitoring-ops, experiments/llm-as-judge-pipeline

H. Safety and Alignment

Once an agent takes real actions, safety is an architectural constraint, not an add-on.

Paper	Role
Constitutional AI (Anthropic, 2022)	Model-level safety principles. Thinking framework for designing guard agents
R-Judge / Agent Safety Bench (2024)	Benchmark for detecting dangerous actions in agent environments
Prompt Injection Attacks on LLM Agents (2023+)	Instruction injection from external data. The primary attack surface for RAG + tool agents

Experiments: experiments/guard-agent, experiments/prompt-injection-defense

I. LLM Foundations

An agent architect who does not understand what is happening inside the model is essentially an API wrapper engineer. These are non-negotiable.

Paper	Role
Attention Is All You Need (Vaswani et al., 2017)	Mechanical understanding of transformers. Required to reason about latency, context window cost, and orchestration overhead
Scaling Laws for Neural Language Models (Kaplan et al., 2020)	Informs every decision about model size selection for specific agent roles
GPT-4 Technical Report (OpenAI, 2023)	Understanding the current capability frontier you are architecting on top of
LLaMA / LLaMA 2 / LLaMA 3 (Meta)	The open-weight foundation. Most production inference infrastructure is built around these
Mistral 7B / Mixtral 8x7B	MoE architecture. Directly relevant to efficient agent inference and role-specialized routing

J. Inference and Systems

This is the category most agent builders ignore, and the one NVIDIA operates at. An architect who cannot identify where the compute bottleneck is in their system is not an architect.

Paper / Resource	Role
FlashAttention 1 and 2 (Dao et al., 2022 / 2023)	Why long-context agents are expensive and how it is being solved at the hardware level
Orca: Continuous Batching (Yu et al., 2022)	How inference servers handle concurrent agent requests. The foundation for vLLM
vLLM: PagedAttention (Kwon et al., 2023)	The actual system running most production agent backends today. KV cache paging
Speculative Decoding (Leviathan et al., 2023)	How fast token generation is achieved. Critical for latency-sensitive agent loops
SGLang (Zheng et al., 2024)	Structured generation and agent-specific inference optimization. Radix attention for prefix caching
TensorRT-LLM (NVIDIA documentation)	If you are building on NVIDIA hardware, this is not optional reading

Note: Experiments for this section require dedicated server infrastructure and are deferred. Folder scaffolds will be added when the environment is available.

K. Control Flow and Compiler-Level Architecture

What separates someone who builds agents from someone who architects agent systems is the ability to reason about control flow at the structural level.

Paper / Resource	Role
Executable Code Actions Elevate LLM Agents (Wang et al., 2024)	Code as action space vs. JSON as action space. One of the most consequential architectural decisions in agent design
LangGraph design documentation and LCEL	Control flow patterns for stateful agent graphs. Cyclic vs. DAG structures and why it matters
Flows: Building Blocks for Multi-Agent Systems (EPFL, 2024)	Formal treatment of agent composition. Most rigorous available framework for agent interface design
DSPy (Khattab et al., 2023 / 2024)	Programmatic prompt optimization. Changes how you think about prompt engineering: from manual tuning to compiled programs

Experiments: experiments/code-vs-json-action-space, experiments/dspy-vs-manual-prompting

L. 2024-2025 Frontier Papers

The earlier sections skew heavily toward 2023. These close the gap to the current frontier.

Paper	Role
Agent Workflow Memory (AWM, 2024)	Agents that learn and reuse workflow patterns from experience, beyond episodic memory
OpenDevin / SWE-agent (2024)	Current state of the art for coding agents. SWE-bench is the benchmark; these are the actual systems
AgentScope (Alibaba, 2024)	Production-grade multi-agent framework with serious fault-tolerance and scheduling design
LLM Agent Survey 2024 (Xi et al.)	Most comprehensive and current survey. Should supplement or replace the 2023 surveys
Anthropic Model Specification (2024)	How alignment is operationalized at the model level. Required reading for safety agent design

M. Reference Collections

Resource	Use
Multi-Agent-Papers GitHub Collection	Starting point for new paper discovery in the multi-agent space

Comprehensive Experiments

Every experiment follows the same measurement structure: identical task, varying architecture or strategy, evaluated across performance / latency / token cost / failure rate.

Experiment Group 1: Core Patterns

Run these first. They establish the empirical intuitions that everything else builds on.

Folder	What is being compared	Key papers
`experiments/react-vs-cot`	CoT only vs. ReAct + tool calls on identical tasks. Success rate, failure mode analysis	ReAct, Chain-of-Thought
`experiments/react-vs-reflexion-vs-fusemind`	ReAct vs. ReAct + Reflection vs. FuseMind on multi-step reasoning. Accuracy, attempt count, token cost	Reflexion, ReflAct, FuseMind
`experiments/reasoning-evolution`	CoT -> Self-Consistency -> ToT -> LATS: accuracy and cost at each step of the evolution	CoT, Self-Consistency, ToT, LATS
`experiments/multi-agent-patterns`	Single LLM vs. Planner-Worker vs. 3-4 agent collaboration on a complex task. Performance, latency, cost	AutoGen, MetaGPT, CAMEL

Experiment Group 2: Memory and Retrieval

Folder	What is being compared	Key papers
`experiments/rag-comparison`	No-RAG vs. naive RAG vs. task-specific structured RAG. Answer quality and hallucination rate	RAG original
`experiments/rag-retriever-strategies`	DPR dense retrieval vs. embedding search vs. keyword search. Quality and speed tradeoffs	RAG, RAPTOR
`experiments/self-rag-vs-naive-rag`	Always-retrieve vs. model-decides-when-to-retrieve. Accuracy, latency, unnecessary retrieval rate	Self-RAG
`experiments/memory-architecture`	No memory vs. flat conversation history vs. CoALA-style layered memory. Task coherence over long sessions	MemGPT, CoALA
`experiments/raptor-vs-flat-rag`	Flat chunk indexing vs. RAPTOR recursive tree indexing on long documents. Recall and coherence	RAPTOR

Experiment Group 3: Reliability and Aggregation

Folder	What is being compared	Key papers
`experiments/aggregation-reliability`	Single agent vs. majority vote vs. weighted vote vs. critic-agent final selection. Confidence calibration	Reliable Decision-Making, Self-Consistency
`experiments/llm-as-judge-pipeline`	Human evaluation vs. LLM-as-Judge correlation measurement. Failure mode analysis under adversarial outputs	LLM-as-Judge
`experiments/monitoring-ops`	Collect agent execution logs, cluster failure patterns, identify input types with high failure rates	AI Agents: Expectations vs. Reality

Experiment Group 4: Tool Use and Action Space

Folder	What is being compared	Key papers
`experiments/tool-use-strategies`	Tool spec injection methods: description density, example count, structured vs. free-form. DFS vs. greedy tool selection	Toolformer, Gorilla, ToolBench
`experiments/code-vs-json-action-space`	JSON action representation vs. Python code as actions. Task success rate, error recovery, generalization to unseen tasks	Executable Code Actions
`experiments/dspy-vs-manual-prompting`	Hand-tuned prompts vs. DSPy compiled prompts. Accuracy, iteration time, sensitivity to model version changes	DSPy

Experiment Group 6: Safety

Folder	What is being compared	Key papers
`experiments/guard-agent`	No guard vs. rule-based filter vs. Constitutional AI critic. Dangerous action detection rate and false positive rate	Constitutional AI, R-Judge
`experiments/prompt-injection-defense`	Baseline agent vs. sanitized input agent vs. instruction-hierarchy agent. Injection success rate under adversarial inputs	Prompt Injection Attacks

Research Projects

Projects originate from findings in comprehensive-experiments. No project is started before the experiment that justifies it is complete.

See: research-projects/README.md

Scope of This Curriculum and Expectations

What reading and completing this list covers:

Architectural vocabulary and design patterns for single and multi-agent systems
The full reasoning evolution from Chain-of-Thought to LATS
Memory taxonomy and retrieval architecture from naive RAG to cognitive layering
Tool use from first principles through large-scale benchmarking
Production evaluation methodology
Safety as a structural design constraint

What the experiments and systems projects are specifically designed to build, because reading alone does not produce it:

Inference economics intuition: KV cache sizing, memory bandwidth ceilings, batching tradeoffs under real load
The ability to profile a running agent system and identify where the actual bottleneck is
An informed position on the 2024 architecture debates: single long-context model vs. multi-agent with smaller windows, code-as-action vs. JSON-as-action, static workflow vs. dynamic planning
Enough benchmark familiarity to identify when a paper's claimed improvement is real vs. benchmark overfitting
A realistic threat model for production agent systems, built from empirical failure analysis rather than theory

The systems-level projects in section E are the primary mechanism for closing that gap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent AI Architecture — Solo Research Lab

Repository Structure

Reading Priority

Papers

A. Big Picture and Multi-Agent

B. Reasoning Foundation

C. ReAct / Tool-Use / Reflection / Planning

D. Tool Use and Function Calling

E. Multi-Agent Framework Papers

F. RAG / Memory / Context

G. Reliability / Ops / Evaluation

H. Safety and Alignment

I. LLM Foundations

J. Inference and Systems

K. Control Flow and Compiler-Level Architecture

L. 2024-2025 Frontier Papers

M. Reference Collections

Comprehensive Experiments

Experiment Group 1: Core Patterns

Experiment Group 2: Memory and Retrieval

Experiment Group 3: Reliability and Aggregation

Experiment Group 4: Tool Use and Action Space

Experiment Group 6: Safety

Research Projects

Scope of This Curriculum and Expectations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
comprehensive-experiments		comprehensive-experiments
papers		papers
research-projects		research-projects
.gitattributes		.gitattributes
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Agent AI Architecture — Solo Research Lab

Repository Structure

Reading Priority

Papers

A. Big Picture and Multi-Agent

B. Reasoning Foundation

C. ReAct / Tool-Use / Reflection / Planning

D. Tool Use and Function Calling

E. Multi-Agent Framework Papers

F. RAG / Memory / Context

G. Reliability / Ops / Evaluation

H. Safety and Alignment

I. LLM Foundations

J. Inference and Systems

K. Control Flow and Compiler-Level Architecture

L. 2024-2025 Frontier Papers

M. Reference Collections

Comprehensive Experiments

Experiment Group 1: Core Patterns

Experiment Group 2: Memory and Retrieval

Experiment Group 3: Reliability and Aggregation

Experiment Group 4: Tool Use and Action Space

Experiment Group 6: Safety

Research Projects

Scope of This Curriculum and Expectations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages