Skip to content

ichbinlucaskim/research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent AI Architecture — Solo Research Lab

Personal research conducted with the rigor of an institutional lab.

The pipeline is: read paper -> run experiment -> build project -> iterate.


Repository Structure

/
├── papers/                      # Paper notes: summary, core ideas, implementation angles
├── comprehensive-experiments/   # Controlled pattern comparison experiments
└── research-projects/           # Projects justified by experiment findings

Reading Priority

Start here before opening any paper folder.

1.  CoALA                                    Establish design vocabulary and architecture taxonomy first
2.  Attention Is All You Need                Understand the substrate everything runs on
3.  CoT -> Self-Consistency -> ToT -> LATS   Trace the full reasoning evolution in sequence
4.  ReAct -> Reflexion -> FuseMind           Core agent loop design
5.  AutoGen + MetaGPT                        Most-referenced frameworks during implementation
6.  MemGPT + CoALA (revisit)                 Memory layer design decisions
7.  Self-RAG + RAPTOR                        Selective retrieval and long-document indexing
8.  Toolformer -> Gorilla -> ToolBench       Tool use from first principles to benchmark
9.  vLLM + FlashAttention 2                  Inference economics: every agent runs on a serving stack
10. SWE-bench + LLM-as-Judge                 Set measurement baselines before designing experiments
11. Constitutional AI + Prompt Injection      Safety is an architectural constraint, not a post-hoc filter
12. DSPy                                     Rethink how you approach prompt engineering entirely
13. 2024-2025 papers                         Close the gap to the current frontier

Papers


A. Big Picture and Multi-Agent

The map of what components exist, what control-flow patterns are available, and how agents are deployed in real software systems.

Paper Role
LLM-Enabled Multi-Agent Systems (Survey) Component taxonomy and control-flow pattern overview
Multi-Agent Collaboration Mechanisms: A Survey of LLMs Collaboration mechanism classification
LLM-Powered Multi-Agent Systems: A Technical Framework (IEEE) Technical framework and architectural perspective
LLM-Based Multi-Agent Systems for Software Engineering (ACM) Application to real software engineering workflows
NVIDIA: Smaller LMs for Agents (Nemotron / SLM) Small model roles in agent systems, inference cost perspective

B. Reasoning Foundation

ReAct without understanding its reasoning lineage is a partial understanding. Read these in order. Each paper is a direct response to the limitations of the previous one.

Paper Role
Chain-of-Thought Prompting (Wei et al., 2022) The root of all agent reasoning. Explicit step-by-step reasoning elicitation
Self-Consistency (Wang et al., 2023) Multiple reasoning paths + majority vote = reliability improvement. Directly feeds into aggregation experiments
Tree of Thoughts (Yao et al., 2023) Breaks linear reasoning into tree search. Enables backtracking and exploration. Direct predecessor to LATS
LATS: Language Agent Tree Search (2023) Integrates ToT + MCTS + ReAct. Currently the strongest single-agent planning architecture

Experiments: experiments/reasoning-evolution


C. ReAct / Tool-Use / Reflection / Planning

The canonical agent loop and its direct extensions.

Paper Role
ReAct: Synergizing Reasoning and Acting (Yao et al., 2022) The origin of the Reason + Act loop. Reference point for all agent loop design
Reflexion (Shinn et al., 2023) Post-task verbal reflection for self-improvement without gradient updates
ReflAct Combines Reflexion + ReAct into a unified structure
FuseMind: Fusing Reflection and Prediction in Agents Adds next-action prediction to reflection. Efficiency improvement through anticipation

Experiments: experiments/react-vs-cot, experiments/react-vs-reflexion-vs-fusemind


D. Tool Use and Function Calling

Tool calling is the agent's interface to the world. Without this lineage, half of agent system design is missing.

Paper Role
Toolformer (Schick et al., 2023) First work on LLMs learning where to insert tool calls via self-supervised training
Gorilla (Patil et al., 2023) API-call-specialized LLM. Core reference for how tool specifications are injected into context
ToolBench / ToolLLM (Qin et al., 2023) 16,000-API benchmark + DFS decision tree for tool selection. Largest tool-use evaluation suite

Experiments: experiments/tool-use-strategies


E. Multi-Agent Framework Papers

Not surveys. These are the papers that actually defined the implementation patterns in common use today.

Paper Role
AutoGen (Wu et al., Microsoft, 2023) Closest to an industry standard for multi-agent conversation. Conversable agent abstraction
MetaGPT (Hong et al., 2023) Role-based agents (PM / Engineer / QA). Encodes SOPs into agent collaboration structure
CAMEL (Li et al., 2023) Role-playing for autonomous agent-to-agent collaboration. Inception prompting concept
HuggingGPT / JARVIS (Shen et al., 2023) LLM as controller, external specialized models as tools. Canonical orchestration pattern

Experiments: experiments/multi-agent-patterns


F. RAG / Memory / Context

RAG alone is not a memory architecture. This section covers the full stack from naive retrieval to cognitive memory design.

Paper Role
Retrieval-Augmented Generation (Lewis et al., 2020) Dense Passage Retriever + joint fine-tuning. The baseline for all agent knowledge layers
MemGPT (Packer et al., 2023) Applies OS virtual memory concepts to LLMs. The reference design for infinite-context agents
CoALA: Cognitive Architectures for Language Agents (2023) Classifies agent memory into Working / Episodic / Semantic / Procedural. The most rigorous architectural taxonomy available. Without this, design language is unstable
Self-RAG (Asai et al., 2023) Model decides whether retrieval is needed at all. Defines the always-retrieve vs. selective-retrieve tradeoff
RAPTOR (Sarthi et al., 2024) Recursive document summarization into tree-indexed structures. Current best practice for long-document RAG

Experiments: experiments/rag-comparison, experiments/rag-retriever-strategies, experiments/self-rag-vs-naive-rag, experiments/memory-architecture, experiments/raptor-vs-flat-rag


G. Reliability / Ops / Evaluation

How to aggregate agent outputs to improve reliability, and how real deployments fail.

Paper Role
Reliable Decision-Making for Multi-Agent LLM Systems Aggregation and ensemble methods for reliability improvement across agent runs
AI Agents in 2025: Expectations vs. Reality (IBM) Realistic operational limits and deployment failure modes
SWE-bench (Jimenez et al., 2024) Real GitHub issues as agent tasks. De facto standard benchmark for software agents
AgentBench (Liu et al., 2023) Comprehensive agent evaluation across 8 environments: web, database, OS, etc.
WebArena (Zhou et al., 2023) Realistic web task automation benchmark with live environments
LLM-as-a-Judge (Zheng et al., 2023) Using LLMs to evaluate LLM outputs. Foundation for automated evaluation pipelines

Experiments: experiments/aggregation-reliability, experiments/monitoring-ops, experiments/llm-as-judge-pipeline


H. Safety and Alignment

Once an agent takes real actions, safety is an architectural constraint, not an add-on.

Paper Role
Constitutional AI (Anthropic, 2022) Model-level safety principles. Thinking framework for designing guard agents
R-Judge / Agent Safety Bench (2024) Benchmark for detecting dangerous actions in agent environments
Prompt Injection Attacks on LLM Agents (2023+) Instruction injection from external data. The primary attack surface for RAG + tool agents

Experiments: experiments/guard-agent, experiments/prompt-injection-defense


I. LLM Foundations

An agent architect who does not understand what is happening inside the model is essentially an API wrapper engineer. These are non-negotiable.

Paper Role
Attention Is All You Need (Vaswani et al., 2017) Mechanical understanding of transformers. Required to reason about latency, context window cost, and orchestration overhead
Scaling Laws for Neural Language Models (Kaplan et al., 2020) Informs every decision about model size selection for specific agent roles
GPT-4 Technical Report (OpenAI, 2023) Understanding the current capability frontier you are architecting on top of
LLaMA / LLaMA 2 / LLaMA 3 (Meta) The open-weight foundation. Most production inference infrastructure is built around these
Mistral 7B / Mixtral 8x7B MoE architecture. Directly relevant to efficient agent inference and role-specialized routing

J. Inference and Systems

This is the category most agent builders ignore, and the one NVIDIA operates at. An architect who cannot identify where the compute bottleneck is in their system is not an architect.

Paper / Resource Role
FlashAttention 1 and 2 (Dao et al., 2022 / 2023) Why long-context agents are expensive and how it is being solved at the hardware level
Orca: Continuous Batching (Yu et al., 2022) How inference servers handle concurrent agent requests. The foundation for vLLM
vLLM: PagedAttention (Kwon et al., 2023) The actual system running most production agent backends today. KV cache paging
Speculative Decoding (Leviathan et al., 2023) How fast token generation is achieved. Critical for latency-sensitive agent loops
SGLang (Zheng et al., 2024) Structured generation and agent-specific inference optimization. Radix attention for prefix caching
TensorRT-LLM (NVIDIA documentation) If you are building on NVIDIA hardware, this is not optional reading

Note: Experiments for this section require dedicated server infrastructure and are deferred. Folder scaffolds will be added when the environment is available.


K. Control Flow and Compiler-Level Architecture

What separates someone who builds agents from someone who architects agent systems is the ability to reason about control flow at the structural level.

Paper / Resource Role
Executable Code Actions Elevate LLM Agents (Wang et al., 2024) Code as action space vs. JSON as action space. One of the most consequential architectural decisions in agent design
LangGraph design documentation and LCEL Control flow patterns for stateful agent graphs. Cyclic vs. DAG structures and why it matters
Flows: Building Blocks for Multi-Agent Systems (EPFL, 2024) Formal treatment of agent composition. Most rigorous available framework for agent interface design
DSPy (Khattab et al., 2023 / 2024) Programmatic prompt optimization. Changes how you think about prompt engineering: from manual tuning to compiled programs

Experiments: experiments/code-vs-json-action-space, experiments/dspy-vs-manual-prompting


L. 2024-2025 Frontier Papers

The earlier sections skew heavily toward 2023. These close the gap to the current frontier.

Paper Role
Agent Workflow Memory (AWM, 2024) Agents that learn and reuse workflow patterns from experience, beyond episodic memory
OpenDevin / SWE-agent (2024) Current state of the art for coding agents. SWE-bench is the benchmark; these are the actual systems
AgentScope (Alibaba, 2024) Production-grade multi-agent framework with serious fault-tolerance and scheduling design
LLM Agent Survey 2024 (Xi et al.) Most comprehensive and current survey. Should supplement or replace the 2023 surveys
Anthropic Model Specification (2024) How alignment is operationalized at the model level. Required reading for safety agent design

M. Reference Collections

Resource Use
Multi-Agent-Papers GitHub Collection Starting point for new paper discovery in the multi-agent space

Comprehensive Experiments

Every experiment follows the same measurement structure: identical task, varying architecture or strategy, evaluated across performance / latency / token cost / failure rate.


Experiment Group 1: Core Patterns

Run these first. They establish the empirical intuitions that everything else builds on.

Folder What is being compared Key papers
experiments/react-vs-cot CoT only vs. ReAct + tool calls on identical tasks. Success rate, failure mode analysis ReAct, Chain-of-Thought
experiments/react-vs-reflexion-vs-fusemind ReAct vs. ReAct + Reflection vs. FuseMind on multi-step reasoning. Accuracy, attempt count, token cost Reflexion, ReflAct, FuseMind
experiments/reasoning-evolution CoT -> Self-Consistency -> ToT -> LATS: accuracy and cost at each step of the evolution CoT, Self-Consistency, ToT, LATS
experiments/multi-agent-patterns Single LLM vs. Planner-Worker vs. 3-4 agent collaboration on a complex task. Performance, latency, cost AutoGen, MetaGPT, CAMEL

Experiment Group 2: Memory and Retrieval

Folder What is being compared Key papers
experiments/rag-comparison No-RAG vs. naive RAG vs. task-specific structured RAG. Answer quality and hallucination rate RAG original
experiments/rag-retriever-strategies DPR dense retrieval vs. embedding search vs. keyword search. Quality and speed tradeoffs RAG, RAPTOR
experiments/self-rag-vs-naive-rag Always-retrieve vs. model-decides-when-to-retrieve. Accuracy, latency, unnecessary retrieval rate Self-RAG
experiments/memory-architecture No memory vs. flat conversation history vs. CoALA-style layered memory. Task coherence over long sessions MemGPT, CoALA
experiments/raptor-vs-flat-rag Flat chunk indexing vs. RAPTOR recursive tree indexing on long documents. Recall and coherence RAPTOR

Experiment Group 3: Reliability and Aggregation

Folder What is being compared Key papers
experiments/aggregation-reliability Single agent vs. majority vote vs. weighted vote vs. critic-agent final selection. Confidence calibration Reliable Decision-Making, Self-Consistency
experiments/llm-as-judge-pipeline Human evaluation vs. LLM-as-Judge correlation measurement. Failure mode analysis under adversarial outputs LLM-as-Judge
experiments/monitoring-ops Collect agent execution logs, cluster failure patterns, identify input types with high failure rates AI Agents: Expectations vs. Reality

Experiment Group 4: Tool Use and Action Space

Folder What is being compared Key papers
experiments/tool-use-strategies Tool spec injection methods: description density, example count, structured vs. free-form. DFS vs. greedy tool selection Toolformer, Gorilla, ToolBench
experiments/code-vs-json-action-space JSON action representation vs. Python code as actions. Task success rate, error recovery, generalization to unseen tasks Executable Code Actions
experiments/dspy-vs-manual-prompting Hand-tuned prompts vs. DSPy compiled prompts. Accuracy, iteration time, sensitivity to model version changes DSPy


Experiment Group 6: Safety

Folder What is being compared Key papers
experiments/guard-agent No guard vs. rule-based filter vs. Constitutional AI critic. Dangerous action detection rate and false positive rate Constitutional AI, R-Judge
experiments/prompt-injection-defense Baseline agent vs. sanitized input agent vs. instruction-hierarchy agent. Injection success rate under adversarial inputs Prompt Injection Attacks

Research Projects

Projects originate from findings in comprehensive-experiments. No project is started before the experiment that justifies it is complete.

See: research-projects/README.md


Scope of This Curriculum and Expectations

What reading and completing this list covers:

  • Architectural vocabulary and design patterns for single and multi-agent systems
  • The full reasoning evolution from Chain-of-Thought to LATS
  • Memory taxonomy and retrieval architecture from naive RAG to cognitive layering
  • Tool use from first principles through large-scale benchmarking
  • Production evaluation methodology
  • Safety as a structural design constraint

What the experiments and systems projects are specifically designed to build, because reading alone does not produce it:

  • Inference economics intuition: KV cache sizing, memory bandwidth ceilings, batching tradeoffs under real load
  • The ability to profile a running agent system and identify where the actual bottleneck is
  • An informed position on the 2024 architecture debates: single long-context model vs. multi-agent with smaller windows, code-as-action vs. JSON-as-action, static workflow vs. dynamic planning
  • Enough benchmark familiarity to identify when a paper's claimed improvement is real vs. benchmark overfitting
  • A realistic threat model for production agent systems, built from empirical failure analysis rather than theory

The systems-level projects in section E are the primary mechanism for closing that gap.

About

Agentic AI research

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors