2026 Complete Guide

AI Agent
Complete Guide

From simple LLM calls to multi-agent systems
Everything about AI Agents in 7 levels

7 Agent Levels
10+ Key Papers
6 Frameworks

What is an AI Agent?

According to Lilian Weng (OpenAI), an AI Agent is a combination of four core components

Agent = LLM + Memory + Planning + Tools

Source: Lilian Weng, "LLM Powered Autonomous Agents" (June 2023)

LLM (Brain)

The core engine for reasoning and decision-making. Understands natural language, makes plans, and decides tool usage.

Core Engine

Memory

Short-term (context window) and long-term (vector DB) memory. Accumulates experience and references the past.

State Management

Planning

Task decomposition and self-reflection. Breaks complex goals into actionable steps.

Strategy

Tools

External APIs, search engines, code executors, etc. Extends LLM capabilities to the real world.

External Actions

Workflow vs Agent: Key Distinction

Anthropic's "Building Effective Agents" (2024) clearly distinguishes between Workflows and Agents

Workflow

Deterministic
  • Execution flow is predefined in code
  • Same input = same path
  • Predictable and easy to debug
  • Suitable for most business problems
  • Predictable cost
Example: Document translation pipeline, email classification system
vs

Agent

Dynamic
  • LLM dynamically determines execution flow
  • Different paths possible even with same input
  • Requires observability tools
  • Strong on open-ended problems
  • Variable cost
Example: Code debugging agent, research agent

7 Levels of Agent Maturity

Exploring the structure and characteristics of each level, from simple LLM calls to multi-agent systems

Simple Autonomous Multi-Agent
L0

Simple LLM Call

No Tools No Memory Single Turn
Autonomy

The most basic form. A simple call where you input a prompt and get a response, relying solely on the LLM's trained knowledge. Without access to external tools, hallucination risk is highest.

U
User
Prompt
L
LLM
Response
R
Output

Use Cases

  • "Write me an email draft"
  • "Review this code"
  • "Create marketing copy"

Limitations

  • Cannot access latest information
  • Cannot reference internal data
  • Can fabricate information

Related Technologies

  • Chain-of-Thought (Wei et al., 2022)
  • Zero-shot / Few-shot Prompting
  • Basic ChatGPT / Claude calls
L1

Augmented LLM (Tool Use)

Function Calling RAG Single Cycle
Autonomy

The LLM can call external tools once when needed. It decides whether to use a tool, receives the result, and generates the final response. However, it ends in a single cycle — no retries even if results are insufficient.

U
User
Query
L
LLM
Tool Call
API / DB / Search
Response
R
Output

Tool Types

  • Search: RAG, Web Search
  • API: Weather, Stocks, DB Query
  • Execution: Code Interpreter, Calculator

Use Cases

  • "What's the weather in Seoul today?" → weather API
  • "Find Q3 sales data" → DB Query
  • Naive RAG: Vector Search → Generate Answer

Related Technologies

  • Toolformer (Schick et al., 2023)
  • MRKL Systems (Karpas et al., 2022)
  • Function Calling / Structured Output
L2

Chained / Sequential Agent

Pipeline Deterministic Multi-Step
Autonomy

Executes multiple steps in a predefined order. The output of each step becomes the input for the next. Quality gates can be added for validation, but the overall flow is fixed in code.

1
Analyze
2
Process
G
Validate
3
Output
Document Translation Pipeline
Analyze Source Draft Translation Validate Terms Final Polish
Code Generation Pipeline
Requirements Code Generation Lint/Test Review/Fix

Features

  • Execution order is fixed in code (deterministic)
  • LLM operates at each step, but doesn't control the overall flow
  • Latency = sum of all steps

Limitations

  • No branching
  • Even simple requests must go through the full pipeline
  • Single step failure breaks everything
L3

Router / Branching Agent

Dynamic Routing Classifier Parallel
Autonomy

Analyzes input and branches to the appropriate path. The LLM acts as a classifier, and each branch is an independent workflow. Resource-efficient with support for parallel execution.

U
Input
R
Router
Simple Direct Answer
Data Search Pipeline
Code Code Generation
Customer Support System
Customer Message
"Refund Request" → Refund Processing Workflow
"Technical Inquiry" → Technical Support Workflow
"General Inquiry" → FAQ-based Response
L4

ReAct / Loop Agent

True Agent Self-Correction Autonomous Loop
Autonomy
This is where true Agents begin

The LLM loops autonomously — judging situations, selecting tools, evaluating results, and retrying when needed. The ReAct (Reasoning + Acting) pattern is core, and the LLM itself determines when to stop.

Loop
T
Thought

Analyze & Judge

A
Action

Select & Execute Tool

O
Observe

Review & Evaluate

!
Answer

ReAct Execution Example: Research Agent

Thought
The user is asking about South Korea's 2024 GDP. This requires a search for recent data.
Action
web_search("2024 한국 GDP")
Observe
Search results only contain 2023 data. A more recent query is needed.
Thought
Let me change the query and search again.
Action
web_search("South Korea GDP 2024 IMF estimate")
Observe
Found $1.7 trillion per IMF estimate. Reliable source.
Answer
South Korea's 2024 GDP is approximately $1.7 trillion based on IMF estimates...

Key Features

  • LLM controls execution flow
  • Self-correction capability
  • LLM determines exit condition
  • max_iterations setting required

Related Papers

  • ReAct (Yao et al., 2022) - ICLR 2023
  • Reflexion (Shinn et al., 2023)
  • Tree of Thoughts (Yao et al., 2023)
L5

Planning Agent

Task Decomposition Adaptive Replanning Reflection
Autonomy

Creates a complete plan before execution, executes step by step according to the plan, and dynamically modifies the plan as situations change. Capable of handling long-horizon tasks.

P
Plan

Task Decomposition

Task 1 Task 2 Task 3 Task 4
E
Execute

Step-by-step ReAct Execution

R
Reflect

Evaluate Results & Modify Plan

Re-plan
"Create a competitive analysis report"
1. Finalize competitor list
2. Collect financial data Cannot find Company B data
2'. Re-plan: Search alternative sources
2'. Collection from alternatives complete
3. SWOT Analysis
4. Write Report
L6

Multi-Agent System

Collaboration Specialization Distributed
Autonomy

Multiple agents collaborate with their own roles, tools, and prompts. The Orchestrator distributes tasks and integrates results, while each agent maintains independent context to mitigate context window limitations.

O
Orchestrator
R
Researcher
Search, Fetch
C
Coder
IDE, Terminal
Q
Reviewer
Lint, Test
W
Writer
Docs, Format
Orchestrator-Worker

Central manager distributes tasks and integrates results. The most common pattern.

Claude Code's sub-agents
Evaluator-Optimizer

Generator creates, Evaluator provides feedback. Iterative improvement.

Automated code review
Debate / Adversarial

Agent A argues, Agent B counters. Moderator makes final judgment.

Decision support system

Level-by-Level Comparison

Level Autonomy Tool Use Flow Control Key Technology
L0 None None Code Basic ChatGPT Calls
L1 Low Single Turn Code RAG, Function Calling
L2 Low Sequential Code LangChain Chains
L3 Medium Branching Code+LLM Semantic Router
L4 High Loop LLM ReAct, Claude Tool Use
L5 High Plan+Loop LLM Plan-and-Execute, ADK
L6 Very High Distributed Multi LLM CrewAI, AutoGen

Core Architecture Concepts

Core concepts that make up agent system internals

Agent Memory System

Core Component

Inspired by human memory systems, agent memory is divided into three types.

S

Short-term Memory

Working memory within the context window. Maintains current conversation and immediately needed information.

Impl: Context Window, Working Memory
L

Long-term Memory

Memory persisting across sessions. Stores structured knowledge like facts, definitions, and rules.

Impl: Vector DB, Knowledge Graph
E

Episodic Memory

Records past experiences and episodes. References past experiences in similar situations.

Impl: Vector DB + Semantic Retrieval
Source: Park et al., "Generative Agents" (2023) | IBM, "AI Agent Memory" (2025)

Agentic RAG vs Traditional RAG

Evolution

Traditional RAG

Query
Vector Search
Retrieve Docs
Generate Answer

Single pass. No retry even if search results are insufficient. Like borrowing one book from a library.

vs

Agentic RAG

Plan
Retrieve
Evaluate
↓ / ↺
Re-retrieve / Tool Use
Synthesize

Iterative search, evaluation, re-search. Like a research assistant finding multiple sources and cross-verifying.

Guardrails Architecture

Safety

Guardrails are designed on the principle of layered defense. No single guardrail can catch everything.

Input Guardrails
PII Detection Prompt Injection Defense Harmful Content Filtering
Agent Core (LLM + Tools)
Output Guardrails
Hallucination Detection Content Review PII Removal
Tool Guardrails
Pre-execution Validation Permission Check Human-in-the-Loop

Cost Optimization Strategies

Production

Agent loops can consume 10-100x more tokens than a single call. Key optimization strategies:

Prompt Caching 60-80% savings

Cached tokens are 75% cheaper. Reuse system prompts and tool schemas

Multi-Model Routing 30-60% savings

Use cheaper models for simple tasks, advanced models only for complex reasoning

Batch Processing ~50% savings

Apply discounts through async batch processing (OpenAI, Google, Mistral)

Prompt Engineering 15-40% savings

Concise prompts, structured JSON output, remove unused tools

Agent Communication Protocols

Two core protocols connecting the agent ecosystem

Model Context Protocol

by Anthropic (Nov 2024)
Vertical Agent ↔ Tools & Data

Standardizes how agents access external tools and data. Reduces the N x M integration problem to M + N.

Tools Functions the LLM can call
Resources Accessible data sources
Prompts Templates for optimal usage
JSON-RPC 2.0 | stdio / HTTP+SSE

Agent-to-Agent Protocol

by Google (Apr 2025)
Horizontal Agent ↔ Agent

Standardizes how agents delegate tasks and exchange results. Advertises capabilities via Agent Cards.

Agent Cards Capability advertisement JSON
Tasks Task units & lifecycle
Messages Exchange context, results, artifacts
HTTP + JSON | SSE Streaming | Apache 2.0
MCP Agent ↔ Tools
Complementary
A2A Agent ↔ Agent

These two protocols are complementary, not competitive. In December 2025, OpenAI, Anthropic, Google, Microsoft, and AWS joined the Linux Foundation's AAIF (Agentic AI Foundation) for joint governance.

Practical Guide

Practical guidance for successfully building and operating agents

Framework Comparison

LangGraph

Graph-based

Graph-based workflows. Nodes define actions, edges define flow. Centralized state management.

Durable Execution Human-in-the-Loop Conditional Branching
Best for: Production systems requiring complex workflows

CrewAI

Role-based

Role-based multi-agent. Assigns Role, Goal, and Backstory to each agent.

Hierarchical Role Specialization Task Delegation
Best for: Team simulations, tasks requiring diverse perspectives

AutoGen

Event-driven

Async event-driven architecture. Actor model-based message exchange.

Cross-language Distributed Network OpenTelemetry
Best for: Enterprise-grade distributed agent systems

Google ADK

Code-first

Code-first development. Runner-centric design with event streaming.

Model-agnostic Built-in Eval Vertex AI Deploy
Best for: Google Cloud environments, streaming-critical apps

OpenAI Agents SDK

Minimal

Intentionally minimal Python-native approach. Three primitive types.

Handoffs Guardrails Built-in Tracing
Best for: Rapid prototyping, simple agent systems

Claude Code

Terminal Agent

Single-threaded master loop + parallel sub-agent execution. ~40 tools, permission gate.

1M Context Permission Gate Sub-agents
Best for: Codebase work, complex multi-file changes

Top 10 Common Mistakes in Agent Development

88% of AI agent projects fail before reaching production. Key causes:

01
Over-Engineering

Introducing complex multi-agent frameworks when simple LLM + prompting would suffice

02
Ignoring Data Quality

Building agents on incomplete data pipelines

03
No Evaluation Framework

Only 15% of AI teams perform comprehensive evaluation

04
Missing Observability

Only 5% of production agents have mature monitoring

05
Treating Like RPA

"Build-deploy-forget" approach fails. Continuous improvement needed

06
Tool Overload

Every tool definition consumes tokens. Remove unused tools

07
No Human-in-the-Loop

Fully automating critical decisions increases accident risk

08
Cost Management Failure

Agent loops can consume 10-100x tokens of a single call

09
Poor Tool Documentation

Tool descriptions are as important as UX design (Anthropic recommendation)

10
No Exit Criteria

Autonomous agents without exit criteria can enter infinite loops

Key Statistics

88%
Agent projects fail before production
1,445%
Multi-agent inquiry growth rate (Gartner, Q1'24→Q2'25)
85%
Developers using AI coding tools (2025)
$2.1M
Avg. cost savings with AI security controls
80.9%
SWE-bench Verified top score (Claude Opus)
33%
Enterprise apps with agentic AI by 2028 (Gartner)

Practical Recommendations

1

Simple First

Both Anthropic and OpenAI recommend the same: start simple. Level 2-3 can solve most problems. Level 4+ is only needed for truly complex open-ended tasks.

2

Evaluate Early

Build the evaluation framework first. Combine LLM-as-Judge, automated benchmarks, and A/B testing. You can't improve what you can't measure.

3

Human-in-the-Loop

Always include human approval for critical decisions. Gradually expand autonomy as trust builds. Don't aim for full automation from the start.

4

Observe Everything

Track every agent action with LangSmith, Braintrust, or OpenTelemetry. 62% of production agents cite observability improvement as their top priority.

Key Papers & Resources

Essential papers and resources in the agent field

Foundational

ReAct: Synergizing Reasoning and Acting

Yao et al. (Princeton, Google) | ICLR 2023

Foundation of agent loops. Proposed the Thought-Action-Observation pattern. Overcame hallucination on HotpotQA and improved ALFWorld success rate by 34%.

Level 4 Agent Loop
Foundational

Chain-of-Thought Prompting

Wei et al. (Google) | NeurIPS 2022

The beginning of step-by-step reasoning. Achieved GSM8K SOTA with 8 CoT examples on a 540B model. An emergent capability at 100B+ parameters.

Level 0 Reasoning
Foundational

Toolformer

Schick et al. (Meta AI) | Feb 2023

LLMs learn tool use through self-supervision. Autonomously decides which API to call, when, and with what arguments.

Level 1 Tool Use
Advanced

Generative Agents: Interactive Simulacra

Park et al. (Stanford) | UIST 2023

25 agents living in a Sims-like village. Demonstrated human-like social behavior through Observation-Reflection-Retrieval architecture.

Level 5-6 Memory Social
Advanced

Reflexion

Shinn et al. | NeurIPS 2023

Learning through verbal self-reflection. Learns from trial and error without weight updates. Achieved HumanEval 67%→88% pass@1.

Level 4 Self-Improvement
Advanced

Tree of Thoughts

Yao et al. (Princeton) | NeurIPS 2023

Search-based reasoning generalizing CoT. Explores thought trees via BFS/DFS. Game of 24: CoT 4% → ToT 74%.

Level 2-3 Planning
System Design

MRKL Systems

Karpas et al. (AI21 Labs) | May 2022

Theoretical foundation for neuro-symbolic architecture. Router forwards input to appropriate modules (LLM, calculator, DB, API).

Level 1-2 Router
System Design

HuggingGPT

Shen et al. | NeurIPS 2023

Uses ChatGPT as controller to orchestrate Hugging Face's specialized models. Pioneer of multimodal task processing.

Level 5 Orchestration
Industry

Building Effective Agents

Schluntz & Zhang (Anthropic) | Dec 2024

The most influential practical guide. Six composable patterns and the "start simple" philosophy.

All Levels Best Practice
Industry

LLM Powered Autonomous Agents

Lilian Weng (OpenAI) | Jun 2023

Agent = LLM + Memory + Planning + Tools. The de facto standard reference for agent architecture.

All Levels Architecture

Key Benchmarks

SWE-bench Verified

Evaluates real GitHub issue resolution. The key metric for production coding agents.

Top: ~80.9%

GAIA

Tasks easy for humans but requiring multimodal tool use for AI.

Tool Use + Reasoning

AgentBench

Evaluates agent task performance in simulated OS environments.

Multi-Environment