AI Agent Complete Guide

Overview

What is an AI Agent?

According to Lilian Weng (OpenAI), an AI Agent is a combination of four core components

Agent = LLM + Memory + Planning + Tools

Source: Lilian Weng, "LLM Powered Autonomous Agents" (June 2023)

LLM (Brain)

The core engine for reasoning and decision-making. Understands natural language, makes plans, and decides tool usage.

Core Engine

Memory

Short-term (context window) and long-term (vector DB) memory. Accumulates experience and references the past.

State Management

Planning

Task decomposition and self-reflection. Breaks complex goals into actionable steps.

Strategy

Tools

External APIs, search engines, code executors, etc. Extends LLM capabilities to the real world.

External Actions

Workflow vs Agent: Key Distinction

Anthropic's "Building Effective Agents" (2024) clearly distinguishes between Workflows and Agents

Workflow

Deterministic

Execution flow is predefined in code
Same input = same path
Predictable and easy to debug
Suitable for most business problems
Predictable cost

Example: Document translation pipeline, email classification system

Agent

Dynamic

LLM dynamically determines execution flow
Different paths possible even with same input
Requires observability tools
Strong on open-ended problems
Variable cost

Example: Code debugging agent, research agent

Levels 0-6

7 Levels of Agent Maturity

Exploring the structure and characteristics of each level, from simple LLM calls to multi-agent systems

Simple Autonomous Multi-Agent

Simple LLM Call

No Tools No Memory Single Turn

Autonomy

The most basic form. A simple call where you input a prompt and get a response, relying solely on the LLM's trained knowledge. Without access to external tools, hallucination risk is highest.

User

Prompt

LLM

Response

Output

Use Cases

"Write me an email draft"
"Review this code"
"Create marketing copy"

Limitations

Cannot access latest information
Cannot reference internal data
Can fabricate information

Related Technologies

Chain-of-Thought (Wei et al., 2022)
Zero-shot / Few-shot Prompting
Basic ChatGPT / Claude calls

Augmented LLM (Tool Use)

Function Calling RAG Single Cycle

Autonomy

The LLM can call external tools once when needed. It decides whether to use a tool, receives the result, and generates the final response. However, it ends in a single cycle — no retries even if results are insufficient.

User

Query

LLM

Tool Call

API / DB / Search

Response

Output

Tool Types

Search: RAG, Web Search
API: Weather, Stocks, DB Query
Execution: Code Interpreter, Calculator

Use Cases

"What's the weather in Seoul today?" → weather API
"Find Q3 sales data" → DB Query
Naive RAG: Vector Search → Generate Answer

Related Technologies

Toolformer (Schick et al., 2023)
MRKL Systems (Karpas et al., 2022)
Function Calling / Structured Output

Chained / Sequential Agent

Pipeline Deterministic Multi-Step

Autonomy

Executes multiple steps in a predefined order. The output of each step becomes the input for the next. Quality gates can be added for validation, but the overall flow is fixed in code.

Analyze

Process

Validate

Output

Document Translation Pipeline

Analyze Source → Draft Translation → Validate Terms → Final Polish

Code Generation Pipeline

Requirements → Code Generation → Lint/Test → Review/Fix

Features

Execution order is fixed in code (deterministic)
LLM operates at each step, but doesn't control the overall flow
Latency = sum of all steps

Limitations

No branching
Even simple requests must go through the full pipeline
Single step failure breaks everything

Router / Branching Agent

Dynamic Routing Classifier Parallel

Autonomy

Analyzes input and branches to the appropriate path. The LLM acts as a classifier, and each branch is an independent workflow. Resource-efficient with support for parallel execution.

Input

Router

Simple Direct Answer

Data Search Pipeline

Code Code Generation

Customer Support System

Customer Message

"Refund Request" → Refund Processing Workflow

"Technical Inquiry" → Technical Support Workflow

"General Inquiry" → FAQ-based Response

ReAct / Loop Agent

True Agent Self-Correction Autonomous Loop

Autonomy

This is where true Agents begin

The LLM loops autonomously — judging situations, selecting tools, evaluating results, and retrying when needed. The ReAct (Reasoning + Acting) pattern is core, and the LLM itself determines when to stop.

Loop

Thought

Analyze & Judge

Action

Select & Execute Tool

Observe

Review & Evaluate

Answer

ReAct Execution Example: Research Agent

Thought

The user is asking about South Korea's 2024 GDP. This requires a search for recent data.

Action

web_search("2024 한국 GDP")

Observe

Search results only contain 2023 data. A more recent query is needed.

Thought

Let me change the query and search again.

Action

web_search("South Korea GDP 2024 IMF estimate")

Observe

Found $1.7 trillion per IMF estimate. Reliable source.

Answer

South Korea's 2024 GDP is approximately $1.7 trillion based on IMF estimates...

Key Features

LLM controls execution flow
Self-correction capability
LLM determines exit condition
max_iterations setting required

Planning Agent

Task Decomposition Adaptive Replanning Reflection

Autonomy

Creates a complete plan before execution, executes step by step according to the plan, and dynamically modifies the plan as situations change. Capable of handling long-horizon tasks.

Plan

Task Decomposition

Task 1 Task 2 Task 3 Task 4

Execute

Step-by-step ReAct Execution

Reflect

Evaluate Results & Modify Plan

Re-plan

"Create a competitive analysis report"

✓ 1. Finalize competitor list

✗ 2. Collect financial data Cannot find Company B data

↻ 2'. Re-plan: Search alternative sources

✓ 2'. Collection from alternatives complete

● 3. SWOT Analysis

● 4. Write Report

Multi-Agent System

Collaboration Specialization Distributed

Autonomy

Multiple agents collaborate with their own roles, tools, and prompts. The Orchestrator distributes tasks and integrates results, while each agent maintains independent context to mitigate context window limitations.

Orchestrator

Researcher

Search, Fetch

Coder

IDE, Terminal

Reviewer

Lint, Test

Writer

Docs, Format

Orchestrator-Worker

Central manager distributes tasks and integrates results. The most common pattern.

Claude Code's sub-agents

Evaluator-Optimizer

Generator creates, Evaluator provides feedback. Iterative improvement.

Automated code review

Debate / Adversarial

Agent A argues, Agent B counters. Moderator makes final judgment.

Decision support system

Level-by-Level Comparison

Level	Autonomy	Tool Use	Flow Control	Key Technology
L0	None	None	Code	Basic ChatGPT Calls
L1	Low	Single Turn	Code	RAG, Function Calling
L2	Low	Sequential	Code	LangChain Chains
L3	Medium	Branching	Code+LLM	Semantic Router
L4	High	Loop	LLM	ReAct, Claude Tool Use
L5	High	Plan+Loop	LLM	Plan-and-Execute, ADK
L6	Very High	Distributed	Multi LLM	CrewAI, AutoGen

Deep Dive

Core Architecture Concepts

Core concepts that make up agent system internals

Agent Memory System

Core Component

Inspired by human memory systems, agent memory is divided into three types.

Short-term Memory

Working memory within the context window. Maintains current conversation and immediately needed information.

Impl: Context Window, Working Memory

Long-term Memory

Memory persisting across sessions. Stores structured knowledge like facts, definitions, and rules.

Impl: Vector DB, Knowledge Graph

Episodic Memory

Records past experiences and episodes. References past experiences in similar situations.

Impl: Vector DB + Semantic Retrieval

Source: Park et al., "Generative Agents" (2023) | IBM, "AI Agent Memory" (2025)

Agentic RAG vs Traditional RAG

Evolution

Traditional RAG

Query

↓

Vector Search

↓

Retrieve Docs

↓

Generate Answer

Single pass. No retry even if search results are insufficient. Like borrowing one book from a library.

Agentic RAG

Plan

↓

Retrieve

↓

Evaluate

↓ / ↺

Re-retrieve / Tool Use

↓

Synthesize

Iterative search, evaluation, re-search. Like a research assistant finding multiple sources and cross-verifying.

Guardrails Architecture

Safety

Guardrails are designed on the principle of layered defense. No single guardrail can catch everything.

Input Guardrails

PII Detection Prompt Injection Defense Harmful Content Filtering

↓

Agent Core (LLM + Tools)

↓

Output Guardrails

Hallucination Detection Content Review PII Removal

Tool Guardrails

Pre-execution Validation Permission Check Human-in-the-Loop

Cost Optimization Strategies

Production

Agent loops can consume 10-100x more tokens than a single call. Key optimization strategies:

Prompt Caching 60-80% savings

Cached tokens are 75% cheaper. Reuse system prompts and tool schemas

Multi-Model Routing 30-60% savings

Use cheaper models for simple tasks, advanced models only for complex reasoning

Batch Processing ~50% savings

Apply discounts through async batch processing (OpenAI, Google, Mistral)

Prompt Engineering 15-40% savings

Concise prompts, structured JSON output, remove unused tools

Protocols

Agent Communication Protocols

Two core protocols connecting the agent ecosystem

MCP

Model Context Protocol

by Anthropic (Nov 2024)

Vertical Agent ↔ Tools & Data

Standardizes how agents access external tools and data. Reduces the N x M integration problem to M + N.

Tools Functions the LLM can call

Resources Accessible data sources

Prompts Templates for optimal usage

JSON-RPC 2.0 | stdio / HTTP+SSE

A2A

Agent-to-Agent Protocol

by Google (Apr 2025)

Horizontal Agent ↔ Agent

Standardizes how agents delegate tasks and exchange results. Advertises capabilities via Agent Cards.

Agent Cards Capability advertisement JSON

Tasks Task units & lifecycle

Messages Exchange context, results, artifacts

HTTP + JSON | SSE Streaming | Apache 2.0

MCP Agent ↔ Tools

Complementary

A2A Agent ↔ Agent

These two protocols are complementary, not competitive. In December 2025, OpenAI, Anthropic, Google, Microsoft, and AWS joined the Linux Foundation's AAIF (Agentic AI Foundation) for joint governance.

Practice

Practical Guide

Practical guidance for successfully building and operating agents

Framework Comparison

LangGraph

Graph-based

Graph-based workflows. Nodes define actions, edges define flow. Centralized state management.

Durable Execution Human-in-the-Loop Conditional Branching

Best for: Production systems requiring complex workflows

CrewAI

Role-based

Role-based multi-agent. Assigns Role, Goal, and Backstory to each agent.

Hierarchical Role Specialization Task Delegation

Best for: Team simulations, tasks requiring diverse perspectives

AutoGen

Event-driven

Async event-driven architecture. Actor model-based message exchange.

Cross-language Distributed Network OpenTelemetry

Best for: Enterprise-grade distributed agent systems

Google ADK

Code-first

Code-first development. Runner-centric design with event streaming.

Model-agnostic Built-in Eval Vertex AI Deploy

Best for: Google Cloud environments, streaming-critical apps

OpenAI Agents SDK

Minimal

Intentionally minimal Python-native approach. Three primitive types.

Handoffs Guardrails Built-in Tracing

Best for: Rapid prototyping, simple agent systems

Claude Code

Terminal Agent

Single-threaded master loop + parallel sub-agent execution. ~40 tools, permission gate.

1M Context Permission Gate Sub-agents

Best for: Codebase work, complex multi-file changes

Top 10 Common Mistakes in Agent Development

88% of AI agent projects fail before reaching production. Key causes:

Over-Engineering

Introducing complex multi-agent frameworks when simple LLM + prompting would suffice

Ignoring Data Quality

Building agents on incomplete data pipelines

No Evaluation Framework

Only 15% of AI teams perform comprehensive evaluation

Missing Observability

Only 5% of production agents have mature monitoring

Treating Like RPA

"Build-deploy-forget" approach fails. Continuous improvement needed

Tool Overload

Every tool definition consumes tokens. Remove unused tools

No Human-in-the-Loop

Fully automating critical decisions increases accident risk

Cost Management Failure

Agent loops can consume 10-100x tokens of a single call

Poor Tool Documentation

Tool descriptions are as important as UX design (Anthropic recommendation)

No Exit Criteria

Autonomous agents without exit criteria can enter infinite loops

Key Statistics

88%

Agent projects fail before production

1,445%

Multi-agent inquiry growth rate (Gartner, Q1'24→Q2'25)

85%

Developers using AI coding tools (2025)

$2.1M

Avg. cost savings with AI security controls

80.9%

SWE-bench Verified top score (Claude Opus)

33%

Enterprise apps with agentic AI by 2028 (Gartner)

Practical Recommendations

Simple First

Both Anthropic and OpenAI recommend the same: start simple. Level 2-3 can solve most problems. Level 4+ is only needed for truly complex open-ended tasks.

Evaluate Early

Build the evaluation framework first. Combine LLM-as-Judge, automated benchmarks, and A/B testing. You can't improve what you can't measure.

Human-in-the-Loop

Always include human approval for critical decisions. Gradually expand autonomy as trust builds. Don't aim for full automation from the start.

Observe Everything

Track every agent action with LangSmith, Braintrust, or OpenTelemetry. 62% of production agents cite observability improvement as their top priority.

References

Key Papers & Resources

Essential papers and resources in the agent field

Foundational

ReAct: Synergizing Reasoning and Acting

Yao et al. (Princeton, Google) | ICLR 2023

Foundation of agent loops. Proposed the Thought-Action-Observation pattern. Overcame hallucination on HotpotQA and improved ALFWorld success rate by 34%.

Level 4 Agent Loop

Foundational

Chain-of-Thought Prompting

Wei et al. (Google) | NeurIPS 2022

The beginning of step-by-step reasoning. Achieved GSM8K SOTA with 8 CoT examples on a 540B model. An emergent capability at 100B+ parameters.

Level 0 Reasoning

Foundational

Toolformer

Schick et al. (Meta AI) | Feb 2023

LLMs learn tool use through self-supervision. Autonomously decides which API to call, when, and with what arguments.

Level 1 Tool Use

Advanced

Generative Agents: Interactive Simulacra

Park et al. (Stanford) | UIST 2023

25 agents living in a Sims-like village. Demonstrated human-like social behavior through Observation-Reflection-Retrieval architecture.

Level 5-6 Memory Social

Advanced

Reflexion

Shinn et al. | NeurIPS 2023

Learning through verbal self-reflection. Learns from trial and error without weight updates. Achieved HumanEval 67%→88% pass@1.

Level 4 Self-Improvement

Advanced

Tree of Thoughts

Yao et al. (Princeton) | NeurIPS 2023

Search-based reasoning generalizing CoT. Explores thought trees via BFS/DFS. Game of 24: CoT 4% → ToT 74%.

Level 2-3 Planning

System Design

MRKL Systems

Karpas et al. (AI21 Labs) | May 2022

Theoretical foundation for neuro-symbolic architecture. Router forwards input to appropriate modules (LLM, calculator, DB, API).

Level 1-2 Router

System Design

HuggingGPT

Shen et al. | NeurIPS 2023

Uses ChatGPT as controller to orchestrate Hugging Face's specialized models. Pioneer of multimodal task processing.

Level 5 Orchestration

Industry

Building Effective Agents

Schluntz & Zhang (Anthropic) | Dec 2024

The most influential practical guide. Six composable patterns and the "start simple" philosophy.

All Levels Best Practice

Industry

LLM Powered Autonomous Agents

Lilian Weng (OpenAI) | Jun 2023

Agent = LLM + Memory + Planning + Tools. The de facto standard reference for agent architecture.

All Levels Architecture

Key Benchmarks

SWE-bench Verified

Evaluates real GitHub issue resolution. The key metric for production coding agents.

Top: ~80.9%

GAIA

Tasks easy for humans but requiring multimodal tool use for AI.

Tool Use + Reasoning

AgentBench

Evaluates agent task performance in simulated OS environments.

Multi-Environment

What is an AI Agent?

LLM (Brain)

Memory

Planning

Tools

Workflow vs Agent: Key Distinction

Workflow

Agent

7 Levels of Agent Maturity

Simple LLM Call

Use Cases

Limitations

Related Technologies

Augmented LLM (Tool Use)

Tool Types

Use Cases

Related Technologies

Chained / Sequential Agent

Document Translation Pipeline

Code Generation Pipeline

Features

Limitations

Router / Branching Agent

Customer Support System

ReAct / Loop Agent

ReAct Execution Example: Research Agent

Key Features

Related Papers

Planning Agent

Plan

Execute

Reflect

"Create a competitive analysis report"

Multi-Agent System

Orchestrator-Worker

Evaluator-Optimizer

Debate / Adversarial

Level-by-Level Comparison

Core Architecture Concepts

Agent Memory System

Short-term Memory

Long-term Memory

Episodic Memory

Agentic RAG vs Traditional RAG

Traditional RAG

Agentic RAG

Guardrails Architecture

Cost Optimization Strategies

Agent Communication Protocols

Model Context Protocol

Agent-to-Agent Protocol

Practical Guide

Framework Comparison

LangGraph

CrewAI

AutoGen

Google ADK

OpenAI Agents SDK

Claude Code

Top 10 Common Mistakes in Agent Development

Over-Engineering

Ignoring Data Quality

No Evaluation Framework

Missing Observability

Treating Like RPA

Tool Overload

No Human-in-the-Loop

Cost Management Failure

Poor Tool Documentation

No Exit Criteria

Key Statistics

Practical Recommendations

Simple First

Evaluate Early

Human-in-the-Loop

Observe Everything

Key Papers & Resources

ReAct: Synergizing Reasoning and Acting

Chain-of-Thought Prompting

AI Agent
Complete Guide