From simple LLM calls to multi-agent systems
Everything about AI Agents in 7 levels
According to Lilian Weng (OpenAI), an AI Agent is a combination of four core components
Source: Lilian Weng, "LLM Powered Autonomous Agents" (June 2023)
The core engine for reasoning and decision-making. Understands natural language, makes plans, and decides tool usage.
Short-term (context window) and long-term (vector DB) memory. Accumulates experience and references the past.
Task decomposition and self-reflection. Breaks complex goals into actionable steps.
External APIs, search engines, code executors, etc. Extends LLM capabilities to the real world.
Anthropic's "Building Effective Agents" (2024) clearly distinguishes between Workflows and Agents
Exploring the structure and characteristics of each level, from simple LLM calls to multi-agent systems
The most basic form. A simple call where you input a prompt and get a response, relying solely on the LLM's trained knowledge. Without access to external tools, hallucination risk is highest.
The LLM can call external tools once when needed. It decides whether to use a tool, receives the result, and generates the final response. However, it ends in a single cycle — no retries even if results are insufficient.
Executes multiple steps in a predefined order. The output of each step becomes the input for the next. Quality gates can be added for validation, but the overall flow is fixed in code.
Analyzes input and branches to the appropriate path. The LLM acts as a classifier, and each branch is an independent workflow. Resource-efficient with support for parallel execution.
The LLM loops autonomously — judging situations, selecting tools, evaluating results, and retrying when needed. The ReAct (Reasoning + Acting) pattern is core, and the LLM itself determines when to stop.
Analyze & Judge
Select & Execute Tool
Review & Evaluate
web_search("2024 한국 GDP")web_search("South Korea GDP 2024 IMF estimate")Creates a complete plan before execution, executes step by step according to the plan, and dynamically modifies the plan as situations change. Capable of handling long-horizon tasks.
Task Decomposition
Step-by-step ReAct Execution
Evaluate Results & Modify Plan
Multiple agents collaborate with their own roles, tools, and prompts. The Orchestrator distributes tasks and integrates results, while each agent maintains independent context to mitigate context window limitations.
Central manager distributes tasks and integrates results. The most common pattern.
Generator creates, Evaluator provides feedback. Iterative improvement.
Agent A argues, Agent B counters. Moderator makes final judgment.
| Level | Autonomy | Tool Use | Flow Control | Key Technology |
|---|---|---|---|---|
| L0 | None | None | Code | Basic ChatGPT Calls |
| L1 | Low | Single Turn | Code | RAG, Function Calling |
| L2 | Low | Sequential | Code | LangChain Chains |
| L3 | Medium | Branching | Code+LLM | Semantic Router |
| L4 | High | Loop | LLM | ReAct, Claude Tool Use |
| L5 | High | Plan+Loop | LLM | Plan-and-Execute, ADK |
| L6 | Very High | Distributed | Multi LLM | CrewAI, AutoGen |
Core concepts that make up agent system internals
Inspired by human memory systems, agent memory is divided into three types.
Working memory within the context window. Maintains current conversation and immediately needed information.
Memory persisting across sessions. Stores structured knowledge like facts, definitions, and rules.
Records past experiences and episodes. References past experiences in similar situations.
Single pass. No retry even if search results are insufficient. Like borrowing one book from a library.
Iterative search, evaluation, re-search. Like a research assistant finding multiple sources and cross-verifying.
Guardrails are designed on the principle of layered defense. No single guardrail can catch everything.
Agent loops can consume 10-100x more tokens than a single call. Key optimization strategies:
Cached tokens are 75% cheaper. Reuse system prompts and tool schemas
Use cheaper models for simple tasks, advanced models only for complex reasoning
Apply discounts through async batch processing (OpenAI, Google, Mistral)
Concise prompts, structured JSON output, remove unused tools
Two core protocols connecting the agent ecosystem
Standardizes how agents access external tools and data. Reduces the N x M integration problem to M + N.
Standardizes how agents delegate tasks and exchange results. Advertises capabilities via Agent Cards.
These two protocols are complementary, not competitive. In December 2025, OpenAI, Anthropic, Google, Microsoft, and AWS joined the Linux Foundation's AAIF (Agentic AI Foundation) for joint governance.
Practical guidance for successfully building and operating agents
Graph-based workflows. Nodes define actions, edges define flow. Centralized state management.
Role-based multi-agent. Assigns Role, Goal, and Backstory to each agent.
Async event-driven architecture. Actor model-based message exchange.
Code-first development. Runner-centric design with event streaming.
Intentionally minimal Python-native approach. Three primitive types.
Single-threaded master loop + parallel sub-agent execution. ~40 tools, permission gate.
88% of AI agent projects fail before reaching production. Key causes:
Introducing complex multi-agent frameworks when simple LLM + prompting would suffice
Building agents on incomplete data pipelines
Only 15% of AI teams perform comprehensive evaluation
Only 5% of production agents have mature monitoring
"Build-deploy-forget" approach fails. Continuous improvement needed
Every tool definition consumes tokens. Remove unused tools
Fully automating critical decisions increases accident risk
Agent loops can consume 10-100x tokens of a single call
Tool descriptions are as important as UX design (Anthropic recommendation)
Autonomous agents without exit criteria can enter infinite loops
Both Anthropic and OpenAI recommend the same: start simple. Level 2-3 can solve most problems. Level 4+ is only needed for truly complex open-ended tasks.
Build the evaluation framework first. Combine LLM-as-Judge, automated benchmarks, and A/B testing. You can't improve what you can't measure.
Always include human approval for critical decisions. Gradually expand autonomy as trust builds. Don't aim for full automation from the start.
Track every agent action with LangSmith, Braintrust, or OpenTelemetry. 62% of production agents cite observability improvement as their top priority.
Essential papers and resources in the agent field
Foundation of agent loops. Proposed the Thought-Action-Observation pattern. Overcame hallucination on HotpotQA and improved ALFWorld success rate by 34%.
The beginning of step-by-step reasoning. Achieved GSM8K SOTA with 8 CoT examples on a 540B model. An emergent capability at 100B+ parameters.
LLMs learn tool use through self-supervision. Autonomously decides which API to call, when, and with what arguments.
25 agents living in a Sims-like village. Demonstrated human-like social behavior through Observation-Reflection-Retrieval architecture.
Learning through verbal self-reflection. Learns from trial and error without weight updates. Achieved HumanEval 67%→88% pass@1.
Search-based reasoning generalizing CoT. Explores thought trees via BFS/DFS. Game of 24: CoT 4% → ToT 74%.
Theoretical foundation for neuro-symbolic architecture. Router forwards input to appropriate modules (LLM, calculator, DB, API).
Uses ChatGPT as controller to orchestrate Hugging Face's specialized models. Pioneer of multimodal task processing.
The most influential practical guide. Six composable patterns and the "start simple" philosophy.
Agent = LLM + Memory + Planning + Tools. The de facto standard reference for agent architecture.
Evaluates real GitHub issue resolution. The key metric for production coding agents.
Tasks easy for humans but requiring multimodal tool use for AI.
Evaluates agent task performance in simulated OS environments.