第2章 Agent的本质

你一定听过这句话:2025年是Agent元年。

但如果你去问十个人”Agent是什么”,你会得到不同的答案。有人说它是自动化的升级版,有人说它是能用工具的ChatBot,有人说它是数字员工。这些说法都对,也都不够精确。就像问”手机是什么”——有人说它是电话,有人说它是相机,有人说它是电脑。每种说法都抓住了一个侧面,但没触及本质。

这一章,我想用最直白的方式,把Agent的技术本质拆开给你看。不需要你懂编程,不需要你懂数学。你只需要带一个问题来:这个东西到底是怎么运转的?

读完这章,你会搞清楚Agent的基本公式、它怎么”想”的、它过去三年怎么进化的,还有最关键的问题——它能做什么,做不了什么。


2.0 从模型质变到 Agent 拐点

我认为 Agent 拐点,本质是任务链变长之后的新工作方式。

2025 年底前后,许多人观察到一种质变:大模型不再只是「你说了我才动一步」的实习生式助手,而是在编程、文档处理、数据分析等场景里,开始能接手更长链路、多步骤的任务。这不是「又快了一点」的渐进改良,而是可靠性跨过阈值之后的新工作方式——业界常称之为 Agent 拐点

这意味着:要理解 Agent,不能只看它「做了什么」,还要理解 Model 这颗大脑能稳定输出什么,以及 Harness 如何把它装备成能思考、能调用工具、能观察结果的系统。第 1 章讲后者之前的地基;本章起讲 Agent 本身。


2.1 Agent = Model + Harness

先给结论:Agent就是模型加上一套运行框架。

用公式写出来,就是 Agent = Model + Harness

看起来简单,但这两个词背后的含义值得展开讲。

Model:推理能力,Agent的大脑

Model指的是大语言模型(LLM),比如GPT-4、Claude、Gemini、DeepSeek。它是Agent的大脑,负责”想”。

这个”想”包括什么?

首先是理解。你给它一句话、一段文字、一张图、一份PDF,它能理解里面的意思。其次是推理。它能根据已知信息做判断、下结论、做选择。最后是生成。它能写出文字、代码、方案,输出你需要的结果。

你可以把Model类比成CPU。CPU是一台电脑的核心计算单元,但它单独存在是没有意义的。一块CPU放在桌子上,不能上网、不能打字、不能播放视频。它需要主板、内存、硬盘、操作系统,需要有人告诉它”先做这个,再做那个”。

Model也一样。单独一个大语言模型,只是一段能接受输入、产生输出的程序。它不知道该怎么使用工具,不知道该怎么记住之前的对话,不知道该怎么把复杂任务拆成步骤执行。

举个具体的例子。你打开ChatGPT的对话框,输入”帮我查一下今天北京的天气”,它会告诉你:“我无法查询实时天气信息,因为我没有联网能力。“它的推理能力足够理解你的需求,但它缺乏执行这个需求的手段。这就是Model的局限——它能想,但不能做。

所以你需要Harness。

Harness:操作系统,Agent的身体

Harness这个词直译是”挽具”——就是马身上那套缰绳、马鞍、笼头的总称。你骑马的时候,靠的不是你自己的腿在跑,而是通过Harness控制马的方向、速度和行为。

用在Agent上,Harness是一套运行框架,把模型的推理能力”装备”起来,让它能真正做事。

Harness包含六个组件:

编排逻辑(Orchestration):决定Agent的行动流程。先做什么、后做什么、遇到错误怎么办、什么时候该停下来请示人类。这是Agent的”做事方法论”。编排逻辑决定了Agent是走一步看一步(ReAct模式),还是先制定计划再执行(Plan-and-Execute模式),或者是在完成任务后自我反思(Reflexion模式)。没有编排逻辑,Agent就像一个没有工作流程的新人——有能力,但不知道怎么用。

工具(Tools):Agent可以使用的外部能力。搜索引擎、计算器、代码执行器、数据库查询、API调用——这些都是工具。模型本身不能上网,但Harness可以给它一个搜索引擎工具;模型本身不能执行代码,但Harness可以给它一个代码沙箱。工具是Agent与真实世界交互的桥梁。没有工具的Agent,就像一个被关在房间里的人——能思考,但无法获取外界信息,也无法对外界产生影响。工具的质量和数量直接影响Agent的能力边界。给Agent一个好的搜索引擎,它就能回答关于最新事件的问题;给它一个数据库查询工具,它就能分析企业数据;给它一个代码执行器,它就能进行数学计算和数据处理。

记忆(Memory):让Agent记住之前发生过什么。记忆分为两层。第一层是短期记忆,也就是当前对话的上下文——你前面说了什么,Agent前面做了什么,中间结果是什么。短期记忆存在上下文窗口里,受窗口大小限制,通常为几千到几十万个Token。第二层是长期记忆,跨对话的知识积累。比如你上周告诉Agent你的公司叫什么名字、主营业务是什么,这周再对话时它应该还记得。长期记忆通常存储在外部数据库中,通过检索的方式在需要时调入上下文。没有记忆的Agent,每次对话都像失忆了一样从头开始。这在很多场景下是不可接受的——你不会想每天跟你的”数字同事”重新做一遍自我介绍。

沙箱(Sandbox):安全的执行环境。当Agent需要运行代码、操作文件、访问系统时,沙箱确保这些操作在一个受控的范围内进行,不会搞坏你的电脑或泄露你的数据。你可以把沙箱想象成一个玻璃房间——Agent在里面做什么你都看得到,但它碰不到外面的东西。如果它在沙箱里运行了一个有问题的程序,崩溃的只是沙箱里的环境,你的电脑不受影响。这是Agent安全性的重要保障。一个没有沙箱的Agent,就像一个没有安全带的赛车——性能可能很好,但风险不可控。

状态管理(State):追踪任务的进度。现在做到哪一步了?之前的结果是什么?哪些子任务完成了,哪些还在进行中?哪些工具已经调用过了,返回了什么结果?状态管理让Agent在多步骤任务中保持连贯性。没有状态管理,Agent做一个复杂任务就会像一个没有笔记本的项目经理——做到一半就忘了自己在做什么,或者重复做已经完成的工作。

安全机制(Safety):防止Agent做不该做的事。权限控制、内容审核、行为限制——确保Agent不会越权操作,不会生成有害内容,不会在你没有授权的情况下访问敏感数据。安全机制包括多个层面:输入层过滤(防止恶意指令)、执行层控制(限制Agent的操作权限)、输出层审核(检查Agent的输出是否合规)。一个没有安全机制的Agent,就像一个没有规章制度的公司——短期内可能高效运转,但迟早会出大问题。

Harness组件的依赖关系:

这六个组件不是孤立的,它们相互依赖、协同工作。编排逻辑决定什么时候调用工具;工具返回的结果需要状态管理来追踪;记忆为编排逻辑提供历史上下文;沙箱为工具执行提供安全边界;安全机制在所有组件的边界上设防。

这六个组件组合在一起,就是Harness。它和Model的关系,就像操作系统和CPU的关系。Model提供计算能力,Harness提供运行环境、资源调度和行为规范。

为什么这个公式重要

Agent = Model + Harness,这个公式的价值在于它帮你建立了一个清晰的思考框架。

当你遇到一个Agent产品时,你可以用这个公式去拆解它:它的Model用的是什么?推理能力够不够强?它的Harness设计得好不好?工具是否充足?记忆是否可靠?安全机制是否完善?

很多看起来很炫的Agent产品,其实只是在Model上堆了一堆花哨的Harness组件,但Model本身的推理能力撑不起来。就像给一辆三轮车装上飞机的方向盘和仪表盘——看起来很专业,但它还是飞不起来。也有另一些产品,用的是顶级Model,但Harness设计得太粗糙,工具不够用,记忆丢失严重,安全形同虚设。这就像给一架飞机配了一个刚拿到驾照的司机——硬件没问题,但操控系统不匹配。

优秀的Agent产品,是Model和Harness的精心匹配。就像一辆好车,发动机马力要够,变速箱、底盘、悬挂、刹车也要跟得上。光有发动机不行,光有底盘也不行。

有个容易忽略的点:Model越强,Harness可以越简单。

Model越强,需要的Harness就越简单。一个推理能力极强的Model,可能只需要很少的工具就能完成任务,因为它的”想”的能力弥补了”做”的不足。反过来,如果Model的能力一般,就需要更复杂的Harness来弥补——更多的工具、更精细的编排逻辑、更严格的结果验证。

实际案例对比:

场景弱Model + 复杂Harness强Model + 简单Harness效果对比
代码生成需要10+个工具(语法检查、风格检查、测试运行、依赖分析),多层验证只需要代码执行工具 + 基本沙箱强Model方案开发速度快3倍,错误率更低
数据分析需要预定义分析模板、数据清洗工具、可视化工具、统计检验工具只需要Python执行环境 + 数据读取工具强Model能处理更复杂的分析逻辑
客服对话需要意图分类器、实体提取器、知识库检索、回复模板引擎只需要知识库检索工具强Model理解上下文更准确,回复更自然

Coinbase的案例很有说服力。他们使用Claude构建客服Agent,只给Agent提供了几个核心工具(查询账户信息、查询交易记录、搜索知识库、创建工单),但Claude强大的推理能力让它能灵活组合这些工具解决复杂问题。结果:每小时处理数千条消息,可用性99.99%,35-50个内部AI应用由此衍生。

Tines的案例则展示了Harness设计的精妙。他们的安全运维Agent使用Claude动态处理工作流逻辑,把复杂的多步骤安全操作压缩成单Agent操作,时间价值提升了100倍。关键不是工具多,而是编排逻辑让Agent能”聪明地”使用工具。

Anthropic在他们那篇被广泛引用的《Building Effective Agents》报告中,特别强调了一个原则:从最简单的方案开始

这意味着什么?很多人一上来就想构建一个全能的Agent——给它几十个工具、复杂的记忆系统、多层次的安全机制。但很多场景根本不需要这么复杂。一个强大的Model加上精心设计的几个工具,往往比一个平庸的Model加上五十个工具更有效。

报告中还有一个重要的区分:Workflow(工作流)和Agent(智能体)是两种不同的东西

Workflow是用预定义的代码路径来编排LLM的调用。开发者事先设计好流程——先调用模型做A,根据结果调用模型做B,再把B的结果传给C——整个流程是固定的、可预测的。这就像一条流水线:每个环节做什么、怎么做,都是提前设计好的。

Agent则是让LLM自己动态引导自己的过程。模型自己决定下一步做什么、用什么工具、什么时候停下来。这更像一个有自主权的员工:你给他一个目标,他用自己的判断力去完成。

关键的区别在于控制权。Workflow中,控制权在代码手里;Agent中,控制权在模型手里。

那么什么时候用Workflow,什么时候用Agent?

Anthropic的建议是:先用最简单的方法。如果一个纯Model(不加任何Harness组件)就能完成任务,那就不要加Harness。如果加一个工具就能搞定,就不要加十个。如果一个固定的Workflow就能满足需求,就不要做成自由度更高的Agent。只有当简单方案确实无法满足需求时,再逐步增加复杂性。

Workflow vs Agent 决策矩阵:

维度Workflow(工作流)Agent(智能体)适用场景
控制权代码控制,固定路径模型控制,动态决策流程明确→Workflow;需要灵活应变→Agent
可预测性高,每次执行路径相同中低,可能走不同路径金融合规、医疗诊断→Workflow;创意策划→Agent
成本低,调用次数固定高,可能多次循环成本敏感→Workflow;效果优先→Agent
延迟低,执行时间确定高,不确定循环次数实时响应→Workflow;异步处理→Agent
错误恢复需预定义所有分支模型自动调整策略异常情况多→Agent;标准流程→Workflow
调试难度低,路径可追踪高,行为难复现需要审计→Workflow;探索性任务→Agent

实际案例:

  • 银行信用风险评估:用Workflow。流程固定:收集数据→计算指标→比对阈值→生成报告→人工复核。每一步都有明确规则,不需要模型”自由发挥”。某零售银行使用AI Agent后,信用风险备忘录创建效率提升20-60%,周转时间缩短30%。

  • 客服投诉处理:用Agent。客户投诉的原因千差万别,有的需要查询订单,有的需要技术支持,有的涉及退款,有的需要升级。固定Workflow无法覆盖所有情况,Agent可以根据对话内容动态选择处理路径。Gradient Labs的金融客服Agent实现了80-90%的问题解决率。

  • 代码生成:混合模式。整体框架用Workflow(需求分析→架构设计→代码生成→测试→部署),但每个步骤内部用Agent(如代码生成步骤让模型自主决定如何拆分模块、选择算法)。这种”Workflow骨架 + Agent血肉”的模式,是2026年最主流的企业级架构。

这个原则,贯穿了整个Agent设计哲学。


2.2 ReAct循环:推理→行动→观察

现在你知道Agent = Model + Harness。但你可能会好奇:Harness具体是怎么驱动Agent”做事”的?

答案是:通过一个叫ReAct的循环。

ReAct是Agent最核心的运行模式,全称是Reasoning + Acting(推理+行动)。理解了这个循环,你就理解了Agent的”思维方式”。

一个真实的例子

假设你让Agent帮你做这件事:“帮我查一下过去一周特斯拉的股价走势,分析一下原因,然后写一份简报。”

一个人类分析师会怎么做?

第一步,想:我需要先查股价数据。我可以用搜索引擎或者金融数据网站。

第二步,做:打开浏览器,搜索特斯拉股价。

第三步,看:看到了数据,过去一周从245美元跌到了231美元,周三有一次明显下跌。

第四步,想:周三下跌,可能和什么事件有关?我需要查一下那天的新闻。

第五步,做:搜索”特斯拉 5月14日 新闻”。

第六步,看:发现特斯拉那天发布了财报,营收低于预期。

第七步,想:现在我有足够的信息了,可以写简报了。

第八步,做:撰写简报。

你看,人类做事的过程就是:想→做→看→想→做→看……不断循环。ReAct把这个过程形式化了。

Agent做的和这个人类分析师做的事情几乎一模一样。唯一的区别是,Agent的每一步都更快——搜索只需要几秒,整理信息只需要几秒,写简报只需要几十秒。但思考的模式是一样的。

三步循环的内部机制

ReAct循环由三个步骤组成:

第一步:Reasoning(推理)

Agent先”想一想”。它会分析当前的情况,决定下一步该做什么,选择用什么工具。这一步完全由Model(大语言模型)完成。

Agent会生成一段”内心独白”,类似于:“我现在有了股价数据,但我还需要了解下跌的原因,所以我应该搜索相关新闻。“这段内心独白不是给人看的——它是Agent的推理过程,会作为下一步行动的依据。

这个推理过程的质量,直接决定了Agent的表现。如果推理能力强,Agent会做出正确的判断——选择对的工具、设定对的参数、制定对的策略。如果推理能力弱,Agent可能做出错误的选择——用错了工具、搜索了无关的关键词、遗漏了重要的信息。

第二步:Acting(行动)

Agent”动手做事”。它调用一个工具——比如搜索引擎、代码执行器、API——来获取信息或执行操作。这一步由Harness完成,因为Harness负责管理工具和执行环境。

行动这一步有一个重要的细节:参数生成。Agent不只是决定”我要搜索”,它还需要决定”搜索什么关键词”。这个参数由Model生成,然后传递给Harness管理的工具。工具执行完毕,返回结果。

所以你看,行动这一步其实是Model和Harness的协同:Model负责决策和参数生成,Harness负责工具管理和执行。

第三步:Observation(观察)

Agent”看看结果”。工具返回了什么?搜索到了什么?代码输出了什么?Agent把结果拿到,作为下一轮推理的输入。然后回到第一步,根据新的信息继续思考。

观察这一步的结果通常有两种情况:一种是正面的,工具返回了有用的信息,Agent可以基于这个信息继续推进。另一种是负面的,工具返回了错误、空结果或意外信息,Agent需要调整策略。

推理→行动→观察→推理→行动→观察……这个循环不断重复,直到Agent认为任务已经完成,或者遇到了无法自行解决的问题需要求助人类。

为什么ReAct有效

ReAct的精妙之处在于它把”思考”和”行动”紧密地交织在一起。

在ReAct出现之前,有两种主流方法:

一种是**“纯思考”**。给模型一个大任务,让它一次性想清楚所有步骤,然后一口气输出结果。问题是:模型的”上下文窗口”有限,它不可能在没有中间结果的情况下,一次性推理出完整答案。就像让你不打草稿、不查资料,一口气写出一份完整的行业分析报告——几乎不可能。而且,纯思考的方法容易产生”幻觉”——模型没有真实数据作为依据,只能靠”编造”来填充信息空白。

另一种是**“纯行动”**。写一个固定的程序,先调用这个API,再调用那个API,把结果拼在一起。问题是:程序是死的,它不会根据中间结果调整策略。如果股价数据里有异常值,程序不会”意识到”这个异常并去找原因。如果搜索结果里混入了无关信息,程序不会”意识到”需要换关键词重新搜索。

ReAct把两者结合了。每一步都先想再做,做完再想。这让Agent既能利用模型的推理能力来灵活决策,又能通过实际行动获取真实世界的反馈。这就像一个人在迷宫里走路——不是提前规划好每一步(因为迷宫可能随时变化),也不是无脑乱走(因为那样永远走不出去),而是走一步看一步,每走一步都根据看到的新情况调整方向。

ReAct的局限

但ReAct不是万能的。它有几个明显的局限。

第一,成本和速度问题。 每一步都要调用一次大语言模型,而模型调用是要花钱的。一个10步的ReAct循环,可能要调用10次模型、5次搜索引擎、3次代码执行器。如果模型调用一次花0.05美元,搜索一次花0.01美元,一个任务的总成本可能在0.5-1美元之间。听起来不多,但如果一个企业每天要执行几千个这样的任务,成本就很可观了。

第二,错误累积问题。 如果每一步的成功率是95%,10步之后的总体成功率只有约60%。这是因为每一步的错误会影响后续步骤的推理——如果第二步搜索到了错误的信息,后面的分析全部建立在错误的基础上。步数越多,累积错误越严重。

第三,上下文窗口限制。 随着循环轮次增加,Agent需要记住越来越多的中间结果和工具返回信息。上下文窗口是有限的,当信息太多时,要么截断早期信息(导致”忘记”),要么压缩信息(导致细节丢失)。

业界总结出了一个规律:做得好的Agent都是短流程。经验数据表明,Agent在10步以内就应当有人类的介入或检查点。超过这个步数,出错概率和成本都会急剧上升。

ReAct循环的成本分析(以GPT-5.5为例):

循环步数模型调用次数估计Token消耗单次任务成本累计成功率(每步95%)
3步3次~6,000$0.01285.7%
5步5次~10,000$0.02077.4%
10步10次~20,000$0.04059.9%
20步20次~40,000$0.08035.8%

可以看到,10步之后成功率跌破60%,成本却翻倍。这就是为什么生产级Agent通常采用”分而治之”策略:把大任务拆成多个小任务,每个小任务5步以内完成,中间加入人工检查点。

Anthropic在《Building Effective Agents》中也明确指出:不要为了用Agent而用Agent。很多场景下,一个预定义好的工作流(workflow)——也就是固定的程序流程,加上几个关键节点调用模型——比一个完全自主的Agent更可靠、更便宜、更好控制。

ReAct是一个强大的基础模式,但在实际应用中,它往往需要和其他设计模式组合使用。比如,你可以用Plan-and-Execute模式先制定一个整体计划,然后在每个子任务内部使用ReAct循环。或者,你可以用Reflexion模式在ReAct循环结束后进行自我检查和修正。

这就是下一节要讲的内容。


2.3 从ChatBot到Agent:三年进化路径

要真正理解Agent,最好的办法是看它是怎么来的。

过去三年,AI产品的形态经历了一条清晰的进化路径:Copilot → Agent → 数字劳动力。每一步都代表了能力的质变,而不只是量变。

第一阶段:Copilot(2023-2024)

2023年初,ChatGPT引爆了全球AI热潮。那时候大家谈论最多的是”AI助手”或”副驾驶”(Copilot)。

Copilot的核心模式是:人做主导,AI辅助

你写代码写到一半,卡住了,Copilot帮你补全后半段。你写邮件写到一半,不知道怎么措辞了,Copilot帮你润色一下。你做PPT不知道怎么组织内容,Copilot帮你列个大纲。

在Copilot模式下,AI不主动行动。它不会自己去搜索信息,不会自己去调用API,不会自己去做决策。它只是在你需要的时候,提供一个建议、一个草稿、一个参考。

这就像一个坐在你旁边的实习生。你问他问题,他回答。你说”帮我写个这个”,他写出来给你看。但他不会自己主动去做事情,不会自己安排工作优先级,不会自己去跟其他人协调。

Copilot的产品形态也很直观:一个对话框,你输入文字,AI输出文字。交互是线性的、单轮的、被动的。

这个阶段的标志性产品是GitHub Copilot(代码补全)、ChatGPT(对话助手)、Notion AI(写作辅助)。它们都有一个共同特征:等待用户发起请求,然后给出一个回应。没有用户的输入,它们什么都不会做。

这个阶段主要的技术挑战是模型本身聪不聪明——能不能给出高质量的回答。所有的创新都集中在Model层面:更大的参数量、更好的训练数据、更强的推理能力。Harness在这一阶段几乎不存在——最复杂的”Harness”可能就是一个对话历史管理功能。

第二阶段:Agent(2025-2026)

2024年底到2025年初,行业开始从Copilot向Agent转型。

Agent的模式变成了:AI做主导,人监督

区别在哪里?

Copilot是”你让它做什么,它做什么”。Agent是”你告诉它目标,它自己规划怎么做”。

举个例子。你跟Copilot说”帮我分析一下竞品”,它会给你一个分析框架或者一份模板,具体的事情还是你自己去做——搜索竞品信息、收集数据、整理对比表格、写分析文字。你跟Agent说同样的话,它会自己去搜索竞品信息、收集数据、整理对比表格、分析优劣势,最后把报告交给你。你可能只需要等几分钟,一份完整的竞品分析报告就出现在你面前。

从”建议者”到”执行者”,这是一个根本性的变化。

为了让AI能够”执行”而不只是”建议”,Agent需要具备Copilot所没有的能力:

  • 使用工具:能调用搜索引擎、数据库、API、代码执行器
  • 规划能力:能把一个大目标拆解成多个子步骤
  • 记忆能力:能记住之前的对话和中间结果
  • 自主决策:能根据中间结果调整策略
  • 错误恢复:遇到失败能换一种方式重试

这些能力加在一起,就是我们前面说的Harness。

这个阶段有五大设计模式,理解它们就理解了Agent的技术全景:

ReAct(推理+行动):上一节详细讲过。推理→行动→观察的循环。适合需要实时信息和灵活应变的任务。它的优势是灵活性——Agent可以根据每一步的结果动态调整策略。劣势是效率——每一步都需要调用模型,速度慢、成本高。

Plan-and-Execute(规划+执行):Agent先制定一个完整计划,然后逐步执行。适合复杂任务,因为规划阶段可以一次性考虑全局,避免ReAct的”走一步看一步”带来的短视问题。想象一下装修一套房子——你不会一边装一边想,而是先画好设计图,然后按图施工。Plan-and-Execute就是这个逻辑。它的优势是全局视野和执行效率。劣势是缺乏灵活性——如果执行过程中遇到了计划外的情况,可能需要重新规划。

Reflexion(反思):Agent在完成任务后,会自我反思——我做对了什么?做错了什么?下次怎么改进?这模仿了人类的学习方式,通过”复盘”来提升能力。一个使用Reflexion的Agent,写完代码后会自己检查一遍,如果发现逻辑错误就自己修正,而不是直接把有bug的代码交给你。这在编程和写作等需要质量控制的场景中特别有用。

Multi-Agent(多智能体):多个Agent协作完成任务。一个Agent负责搜索,一个负责分析,一个负责写作,一个负责审核。就像一个团队,每个成员有自己的专长。Multi-Agent系统的优势是分工带来的专业性和效率。每个Agent可以专注于自己擅长的事情,不需要一个Agent做所有事情。劣势是协调成本——Agent之间需要沟通、同步、解决冲突,这些都需要额外的机制来管理。

Tool Use(工具使用):Agent根据需要调用外部工具。这是最基础也最常用的能力,其他模式都依赖于它。一个没有Tool Use能力的Agent,就像一个被关在房间里的人——能思考,但无法与外界交互。Tool Use的核心挑战是”工具选择”——给定一个任务,Agent需要从可用工具中选择最合适的那个,并正确地生成调用参数。这取决于模型的推理能力,也取决于工具描述的质量。如果工具的说明书写得不清楚,Agent可能用错工具或传错参数。

这五种模式不互斥。一个实际的Agent产品,往往会组合使用多种模式。比如一个做深度研究的Agent,可能会用Plan-and-Execute来做总体规划,每个子任务用ReAct循环来执行,最后用Reflexion来检查报告质量。

第三阶段:数字劳动力(2026+)

正在发生、但还没有完全展开的第三个阶段,是Agent进化为”数字劳动力”。

这个阶段的特征是:AI不只是执行任务,而是承担角色

什么意思?

在Agent阶段,你给它一个具体任务,它去完成。任务是离散的、一次性的。做完一个,等下一个指令。在数字劳动力阶段,AI占据一个持续的角色——比如”客服专员""数据分析师""市场研究员”——它持续地在这个角色上工作,有明确的职责范围、工作流程和考核标准。

它不是你偶尔用一下的工具,而是一个”同事”。你不需要每次给它下达具体的指令,它知道自己该做什么,什么时候做,怎么做。它有自己的”工作日程”——早上检查邮件、上午处理客服工单、下午分析数据、傍晚生成报告。它能在多个任务之间切换,能在遇到不确定情况时主动请示,能在完成工作后主动汇报。

想象一下,你的公司有一个”数字市场研究员”。它每天早上自动浏览行业新闻,整理出对你公司可能有影响的动态;每周自动生成竞品分析报告;每月自动生成市场趋势洞察。你不需要每天给它下达指令——它知道自己的职责,并持续地执行。

这个阶段的技术难度比前两个大得多。Agent需要长期记忆、持续学习、跨任务协调、人际沟通等能力。安全性、可控性和可靠性的要求也跟着暴涨——你不能让一个”数字同事”做出未经授权的决策,不能让它在跟客户沟通时说出不合适的话,不能让它在处理财务数据时犯下不可挽回的错误。

业界有一个令人清醒的数据:88%的企业Agent项目卡在了生产环境,无法真正上线使用

为什么?因为从Demo到生产环境之间,有一条巨大的鸿沟。

Demo环境下的Agent,面对的是精心准备的输入、简单的任务、容忍错误的观众。你给它一个清晰的指令,它执行几个步骤,给出一个不错的结果。观众看了很高兴,觉得AI真的能做事了。

生产环境下的Agent,面对的是千变万化的用户输入(“用户说的话,一半是错别字,一半是省略句”)、复杂的边缘情况(“这个用户的数据格式和标准格式不一样”)、零容忍的错误(“你给客户发了一封语法错误的邮件?不能接受。”)。

而且,生产环境还有成本、延迟、并发、监控、审计等一系列工程问题。一个Demo级别的Agent,可能每执行一个任务花30秒、花1美元。但如果你的业务需要每天执行1万个这样的任务,成本就是每天1万美元、等效30万秒(将近3.5天的串行执行时间)。你需要考虑并行执行、缓存、降级策略、成本优化——这些都不是Agent”智能”的问题,而是工程的问题。

从Copilot到Agent到数字劳动力,这不是一条平滑的上升曲线,而是一段一段的阶梯。每上一阶,技术难度和工程复杂度都会成倍增长。


2.4 认知架构进阶:Function Calling 与四层模型

第 1 章讲清了 Model 的能力边界;本节说明 Harness 如何把模型组织成可执行的 Agent 系统(与 2.2 的 ReAct 循环衔接,侧重接口与分层)。

从字符串解析到原生 Function Calling

早期实现常依赖字符串解析:模型输出 Action: search_web(query='...'),再用正则提取工具与参数——脆弱且易错。

2024-2026 的关键进化是原生 Function Calling:主流 API 支持结构化工具调用,模型输出 JSON 而非自由文本。

优势:

  • 解析成功率: 从约 85% 提升到 99%+
  • 类型安全: 参数在 API 层校验
  • 并行调用: 可同时调用多个独立工具
  • 标准化: OpenAI Functions、MCP、Anthropic Tools 等格式趋同

设计思路: Function Calling 是模型与外部世界的类型安全接口——模型决策「调哪个工具」,接口层保证格式正确并执行;换底层模型时,上层 Agent 逻辑可尽量不变。

Agent 架构的四层模型

层次职责关键组件代表技术
认知层(Cognitive)决策「做什么」ReAct、规划、推理ReAct、Plan-and-Execute、LLM Compiler
接口层(Interface)规范「怎么调用」Function Calling、Tool SchemaOpenAI Functions、MCP
执行层(Execution)实际「跑起来」工具实现、沙箱Docker、E2B
编排层(Orchestration)管理流程与多 Agent状态机、工作流LangGraph、CrewAI、AutoGen

简单任务(如查天气)的瓶颈常在接口与执行层;复杂任务(如行业分析报告)的瓶颈常在认知层——识别瓶颈才能把优化精力放在正确位置。

2026 年综述还将 Agent 推理算法归纳为多条路线(ReAct、CoT、ToT、Plan-and-Execute、Reflexion 等共 17 类范式),可与 2.2、后续章节对照阅读。


2.5 Agent能做什么、做不了什么

Agent的技术本质和进化路径聊完了,最后来聊一个最实际的问题:Agent到底能做什么?做不了什么?

这个问题很实际,因为现在市场上对Agent的期望有点飘。很多人觉得Agent无所不能,或者很快就会无所不能。抱着这种期望做产品,大概率会做出错误的决策。

反过来,也有人因为看到Agent的几次失败,就断言”Agent只是个噱头”。这种看法同样不准确。

直接评估一下。

Agent擅长什么

Agent最擅长的,是满足以下三个条件的任务:

第一,有明确目标。 Agent需要清楚地知道”成功”长什么样。“帮我写一份竞品分析报告”——目标明确,Agent知道要输出什么。“帮我理解一下市场”——目标模糊,Agent不知道做到什么程度算是”理解了”。目标越明确,Agent表现越好。

这不是Agent独有的问题。你让一个人类员工”帮我了解一下市场”,他也不知道该做到什么程度。但人类可以追问:“你想要什么方面的了解?深度还是广度?用来做什么决策?“Agent目前的追问能力还远不如人类,所以更依赖于目标的清晰度。

第二,可以分解为步骤。 Agent不是一次性想出答案的天才,而是一个”把大任务拆成小步骤、一步一步执行”的工人。如果一个任务可以拆成搜索→整理→分析→输出这样的流程,Agent通常能做得不错。

第三,每一步的结果可以验证。 Agent调用搜索工具后,能判断搜索结果是否有用。写完一段代码后,能运行它看有没有报错。每一步的结果可以被检查,是Agent自我纠错的基础。如果某一步的结果不对,Agent可以回到推理阶段,调整策略重试。

符合这三个条件的典型任务包括:

  • 信息收集和整理:从多个来源搜集信息,汇总成结构化的报告。比如”帮我收集过去一周AI行业的重大新闻”。这个任务目标明确(收集新闻)、可分解(搜索→筛选→整理→输出)、结果可验证(每条新闻是否相关,一目了然)。
  • 代码生成和调试:写代码、跑代码、看报错、修复——这个流程天然适合ReAct循环。Agent写一段代码,运行它,如果报错就分析错误信息,修改代码,再运行。这个”写→跑→改”的循环,是Agent表现最亮眼的领域之一。
  • 文档处理:读取PDF、提取表格数据、做格式转换。这类任务规则明确,步骤清晰,Agent执行起来很少出错。
  • 数据分析:查询数据库、计算指标、生成图表。给Agent一个数据库访问权限和一组分析需求,它通常能自动生成SQL查询、执行计算、输出可视化结果。
  • 内容创作:基于素材写文章、做翻译、改格式。注意,这里说的是”基于素材”的创作——给定原材料,Agent负责组织和表达。从零创造有深度的原创内容,目前还不是Agent的强项。
  • 流程自动化:按照规则处理邮件、填写表单、更新系统。这类任务的流程是固定的,Agent可以高效地批量执行。

这些任务的共同特点是:目标清晰、流程可拆分、结果可验证

Agent做不了什么

反过来,Agent在以下场景中表现不佳:

第一,开放式探索。 “帮我找一个商业机会”——这种任务没有明确的成功标准,Agent不知道什么时候算是”找到了”。它可能会给你列一堆方向,但每一个都浅尝辄止,缺乏真正的洞察。真正的商业洞察需要深度的行业理解、对市场的直觉判断、对机会成本的权衡——这些都是目前模型的弱项。

第二,需要深度领域知识。 Agent的推理能力来自模型。如果模型本身对某个专业领域(比如法律、医学、金融)的理解不够深,那么再好的Harness也弥补不了这个缺陷。你不能指望一个不懂医学的Agent去做疾病诊断,不能指望一个不懂中国税法的Agent去做税务筹划。模型的领域知识有明确的边界,超过这个边界,Agent的输出就变得不可靠。

第三,涉及真实世界的物理操作。 Agent可以帮你写一封邮件,但它不能帮你端一杯咖啡。它可以在虚拟世界里飞速运转——搜索信息、生成代码、分析数据——但物理世界是一个完全不同的挑战。这就是所谓的”具身智能”问题,目前远未解决。Agent的”手”是API调用,不是机械臂。

第四,需要长期一致性的角色。 让Agent执行一个任务容易,让它持续扮演一个角色、保持一致的行为模式和决策标准,非常困难。上下文窗口的限制意味着Agent在对话太长后会”忘记”之前的决定。它今天说”这个功能不应该加”,明天可能在新的上下文中又说”这个功能应该加”。人类可以保持长期一致的价值观和决策标准,Agent目前做不到这一点。长期记忆技术还在快速发展中,但距离真正可靠的”持续角色”还有距离。

第五,高风险决策。 涉及金钱、安全、法律责任的决策,不应该完全交给Agent。Agent可以提供建议、整理信息、模拟方案、计算概率,但最终的决策权必须留在人类手中。原因很简单:Agent会犯错,而且它犯错的方式可能和人类不一样——不是因为粗心,而是因为对情况的根本性误判。一个经验丰富的人类决策者犯错时,你知道他错在哪里,可以复盘和改进。一个Agent犯错时,它的”推理过程”可能是不透明的,你很难诊断错误的根本原因。

成功率的实际数据

直接看数据。

在编码领域,SWE-bench(一个标准化的代码问题解决基准测试)显示,最好的Agent系统可以解决约50-60%的真实GitHub问题。这意味着,在相对标准化的编程任务上,Agent已经达到了中等水平工程师的能力——但对于另外40-50%的问题,它仍然束手无策。

在信息检索和分析领域,Agent的表现通常好于纯模型,因为它可以实际搜索互联网获取实时信息,而不是依赖模型训练时的静态知识。但搜索结果的质量参差不齐,Agent可能被低质量信息误导。

在复杂的多步骤任务中,成功率随着步骤数的增加而急剧下降。如果每一步的成功率是95%,那么10步之后的总体成功率只有约60%(0.95^10 ≈ 0.60)。20步后,这个数字降到了36%。如果你要求Agent完成一个50步的任务,每步成功率95%,总体成功率只有不到8%。

所以”好的Agent都是短流程”不是偏好,是数学决定的。

一个实用的判断框架

面对一个具体的场景,你可以用以下五个问题来判断Agent是否适合:

1. 这个任务能用10个步骤以内的流程描述吗?

如果不能,Agent可能不是最佳选择。超过10步的任务,成功率和成本都会急剧恶化。

2. 每一步的结果可以被检查或验证吗?

如果不能,Agent无法自我纠错,错误会累积。最好的Agent场景是每一步都有明确的”对/错”反馈。

3. 任务的容错率高吗?

如果犯一个错误就不可接受(比如发送了一封不该发的邮件),Agent目前还不够可靠。如果错误可以被修正(比如写错了一段代码,可以改),Agent就更适合。

4. 任务需要的领域知识,主流模型具备吗?

如果需要非常专业的知识(比如特定国家的法律条款),可能需要微调或RAG(检索增强生成)来补充模型的知识。

5. 人类介入的成本高吗?

如果每一步都需要人类审批,Agent的自动化优势就消失了。理想情况是人类只在关键节点介入,Agent自行处理大部分步骤。

如果这五个问题的回答都是肯定的,那么Agent是一个很好的选择。如果有三个以上回答否定,你需要认真考虑是否值得投入。

最后的话

Agent不是万能的,但它也不是一个空洞的概念。它是一种真实的、正在快速进化的技术范式。

理解Agent的本质——Model + Harness,理解它的运行方式——ReAct循环,理解它的进化路径——从Copilot到数字劳动力,理解它的能力边界——擅长什么、做不了什么,这四件事加在一起,构成了你判断一切Agent产品的认知基础。

下一章会深入Harness的具体设计,看那六个组件怎么协同工作。但不管后面技术细节多复杂,记住这个公式就够:Agent = Model + Harness

所有Agent产品的成败,都可以从这个公式的两个变量上找到原因。

Chapter 2: The Nature of Agents

You must have heard this saying: 2025 is the year of the Agent.

But if you ask ten people “what is an Agent,” you’ll get different answers. Some say it’s an upgraded version of automation, some say it’s a ChatBot that can use tools, some say it’s a digital employee. All these descriptions are correct, yet none are precise enough. It’s like asking “what is a smartphone” — some say it’s a phone, some say it’s a camera, some say it’s a computer. Each description captures one aspect, but none touch the essence.

In this chapter, I want to break open the technical nature of Agents in the simplest way possible. You don’t need to know programming, you don’t need to know math. You just need to bring one question: how exactly does this thing work?

After reading this chapter, you’ll figure out the Agent’s basic formula, how it “thinks,” how it evolved over the past three years, and most importantly — what it can do, and what it can’t.


2.0 From Model Shift to the Agent Inflection Point

Around late 2025, many observers noted a qualitative change: models were no longer “intern-style” assistants that only moved when you gave the next step. In programming, document work, and data analysis, they began handling longer, multi-step work reliably — an Agent inflection point, not just incremental speed or IQ gains.

To understand Agents, you need both what the Model can stably produce and how the Harness equips it to think, call tools, and observe results. Chapter 1 laid the model foundation; this chapter is about Agents themselves.


2.1 Agent = Model + Harness

Let’s start with the conclusion: an Agent is a model plus a runtime framework.

Written as a formula, it’s Agent = Model + Harness.

Sounds simple, but the meaning behind these two words deserves elaboration.

Model: Reasoning Capability, the Agent’s Brain

Model refers to the Large Language Model (LLM), such as GPT-4, Claude, Gemini, DeepSeek. It’s the Agent’s brain, responsible for “thinking.”

What does this “thinking” include?

First is understanding. You give it a sentence, a paragraph, an image, a PDF — it can understand the meaning within. Second is reasoning. It can make judgments, draw conclusions, and make choices based on known information. Finally is generation. It can write text, code, plans, and output the results you need.

You can analogy the Model to a CPU. A CPU is the core computing unit of a computer, but it’s meaningless on its own. A CPU sitting on a desk can’t access the internet, can’t type, can’t play video. It needs a motherboard, memory, hard drive, operating system — it needs someone to tell it “do this first, then do that.”

The Model is the same. A standalone large language model is just a program that can accept input and produce output. It doesn’t know how to use tools, doesn’t know how to remember previous conversations, doesn’t know how to break down complex tasks into steps for execution.

Here’s a concrete example. You open ChatGPT’s dialog box, type “help me check today’s weather in Beijing,” and it will tell you: “I can’t query real-time weather information because I don’t have internet access.” Its reasoning capability is sufficient to understand your request, but it lacks the means to execute this request. This is the Model’s limitation — it can think, So you need Harness.

Harness: Operating System, the Agent’s Body

The word “Harness” literally translates to “tack” — the collective term for the reins, saddle, and bridle on a horse. When you ride a horse, you don’t rely on your own legs to run; you control the horse’s direction, speed, and behavior through the Harness.

Applied to Agents, Harness is a runtime framework that “equips” the model’s reasoning capabilities, enabling it to actually do things.

Harness consists of six components:

Orchestration: Decides the Agent’s action flow. What to do first, what to do next, what to do when encountering errors, when to stop and ask humans. This is the Agent’s “work methodology.” Orchestration determines whether the Agent takes it one step at a time (ReAct mode), or makes a plan first then executes (Plan-and-Execute mode), or reflects on itself after completing a task (Reflexion mode). Without orchestration logic, an Agent is like a new employee without a workflow — has capability, Tools: External capabilities the Agent can use. Search engines, calculators, code executors, database queries, API calls — these are all tools. The model itself can’t access the internet, but Harness can give it a search engine tool; the model itself can’t execute code, but Harness can give it a code sandbox. Tools are the bridge for Agents to interact with the real world. An Agent without tools is like a person locked in a room — can think, Memory: Lets the Agent remember what happened before. Memory is divided into two layers. The first layer is short-term memory, which is the current conversation’s context — what you said before, what the Agent did before, what the intermediate results are. Short-term memory resides in the context window, limited by window size, typically thousands to hundreds of thousands of Tokens. The second layer is long-term memory, which is cross-conversation knowledge accumulation. For example, if you told the Agent last week that your company’s name and main business, it should still remember this week when you talk again. Long-term memory is usually stored in an external database and retrieved into context when needed. An Agent without memory starts every conversation like having amnesia. This is unacceptable in many scenarios — you wouldn’t want to re-introduce yourself to your “digital colleague” every day.

Sandbox: A secure execution environment. When an Agent needs to run code, manipulate files, or access systems, the sandbox ensures these operations happen within a controlled scope, without breaking your computer or leaking your data. You can imagine the sandbox as a glass room — you can see everything the Agent does inside, but it can’t touch things outside. If it runs a problematic program inside the sandbox, only the sandbox environment crashes; your computer is unaffected. This is an important safeguard for Agent security. An Agent without a sandbox is like a race car without seat belts — performance might be good, State Management: Tracks task progress. What step are we at now? What were the previous results? Which subtasks are completed, which are still in progress? Which tools have been called already, and what results were returned? State management gives the Agent coherence in multi-step tasks. Without state management, an Agent doing a complex task is like a project manager without a notebook — forgets what they were doing halfway through, or repeats work already completed.

Safety Mechanisms: Prevent the Agent from doing things it shouldn’t. Access control, content moderation, behavior restrictions — ensuring the Agent won’t overreach, won’t generate harmful content, won’t access sensitive data without your authorization. Safety mechanisms include multiple layers: input-layer filtering (preventing malicious instructions), execution-layer controls (limiting the Agent’s operational permissions), output-layer review (checking whether the Agent’s output is compliant). An Agent without safety mechanisms is like a company without rules and regulations — might operate efficiently in the short term, Dependencies between Harness components:

These six components are not isolated; they interdepend and collaborate. Orchestration logic decides when to call tools; tool return results need state management to track; memory provides historical context for orchestration logic; the sandbox provides security boundaries for tool execution; safety mechanisms set defenses at the boundaries of all components.

These six components combined together are Harness. Its relationship with the Model is like the relationship between an operating system and a CPU. The Model provides computing capability; Harness provides the runtime environment, resource scheduling, and behavior norms.

Why This Formula Matters

Agent = Model + Harness — the value of this formula lies in it helping you establish a clear thinking framework.

When you encounter an Agent product, you can use this formula to deconstruct it: what Model does it use? Is the reasoning capability strong enough? Is its Harness well-designed? Are the tools sufficient? Is memory reliable? Are the safety mechanisms robust?

Many Agent products that look flashy are actually just stacking a bunch of flashy Harness components on top of a Model, but the Model’s own reasoning capability can’t support it. It’s like putting an airplane’s steering wheel and dashboard on a tricycle — looks professional, There’s a point that’s easily overlooked: the stronger the Model, the simpler the Harness can be.

The stronger the Model, the simpler the Harness needed. A Model with extremely strong reasoning capability might only need a few tools to complete tasks, because its “thinking” capability compensates for the “doing” deficiency. Conversely, if the Model’s capability is average, more complex Harness is needed to compensate — more tools, more refined orchestration logic, stricter result verification.

Practical case comparisons:

The Coinbase case is quite convincing. They used Claude to build a customer service Agent, giving the Agent only a few core tools (query account information, query transaction records, search knowledge base, create tickets), but Claude’s powerful reasoning capability allowed it to flexibly combine these tools to solve complex problems. Result: processing thousands of messages per hour, 99.99% availability, 35-50 internal AI applications derived from this.

The Tines case demonstrates the delicacy of Harness design. Their security operations Agent uses Claude to dynamically process workflow logic, compressing complex multi-step security operations into single Agent operations, improving time value by 100x. The key isn’t more tools, but the orchestration logic letting the Agent “smartly” use tools.

Anthropic, in their widely cited report “Building Effective Agents,” specifically emphasized one principle: start with the simplest solution.

What does this mean? Many people want to build an all-powerful Agent right from the start — giving it dozens of tools, complex memory systems, multi-layered safety mechanisms. But many scenarios simply don’t need that much complexity. A powerful Model plus a few carefully designed tools is often more effective than a mediocre Model plus fifty tools.

The report also has an important distinction: Workflow and Agent are two different things.

Workflow uses pre-defined code paths to orchestrate LLM calls. Developers design the process in advance — first call the model to do A, then call the model to do B based on results, then pass B’s results to C — the entire process is fixed and predictable. This is like an assembly line: what each step does and how it does it are all pre-designed.

An Agent lets the LLM dynamically guide its own process. The model itself decides what to do next, which tools to use, and when to stop. This is more like an employee with autonomy: you give them a goal, and they use their own judgment to complete it.

The key distinction lies in control. In Workflow, control is in the code’s hands; in Agent, control is in the model’s hands.

So when to use Workflow, and when to use Agent?

Anthropic’s suggestion is: use the simplest method first. If a pure Model (without any Harness components) can complete the task, then don’t add Harness. If adding one tool can handle it, don’t add ten. If a fixed Workflow can meet the requirements, don’t make it a higher-autonomy Agent. Only when simple solutions truly can’t meet the requirements should you gradually increase complexity.

This principle runs through the entire Agent design philosophy.


2.2 The ReAct Loop: Reasoning → Action → Observation

Now you know Agent = Model + Harness. But you might be curious: how exactly does Harness drive the Agent to “do things”?

The answer is: through a loop called ReAct.

ReAct is the most core execution mode of Agents, fully named Reasoning + Acting. Understanding this loop, you understand the Agent’s “thinking pattern.”

A Real Example

Suppose you ask an Agent to help you do this: “Help me check Tesla’s stock price trend over the past week, analyze the reasons, and then write a brief report.”

How would a human analyst do this?

Step 1, think: I need to first check stock price data. I can use a search engine or financial data website.

Step 2, act: Open a browser, search Tesla stock price.

Step 3, observe: Saw the data, past week from $245 dropped to $231, with a significant drop on Wednesday.

Step 4, think: Wednesday’s drop might be related to some event? I need to check the news of that day.

Step 5, act: Search “Tesla May 14 news.”

Step 6, observe: Discovered Tesla released earnings that day, revenue was below expectations.

Step 7, think: Now I have sufficient information, I can write the brief.

Step 8, act: Write the brief.

You see, the process of human work is: think → act → observe → think → act → observe… constantly looping. ReAct formalizes this process.

What the Agent does is almost exactly the same as what this human analyst does. The only difference is that each step of the Agent is faster — searching only takes seconds, organizing information only takes seconds, writing the brief only takes tens of seconds. But the thinking pattern is the same.

Internal Mechanism of the Three-Step Loop

The ReAct loop consists of three steps:

Step 1: Reasoning

The Agent first “thinks.” It analyzes the current situation, decides what to do next, selects which tool to use. This step is entirely completed by the Model (Large Language Model).

The Agent generates a segment of “inner monologue,” like: “I now have the stock price data, but I still need to understand the reason for the drop, so I should search for related news.” This inner monologue is not for human eyes — it’s the Agent’s reasoning process, and will serve as the basis for the next action.

The quality of this reasoning process directly determines the Agent’s performance. If the reasoning capability is strong, the Agent will make correct judgments — selecting the right tool, setting the right parameters, formulating the right strategy. If the reasoning capability is weak, the Agent might make wrong choices — using the wrong tool, searching for irrelevant keywords, omitting important information.

Step 2: Acting

The Agent “acts.” It calls a tool — such as a search engine, code executor, API — to acquire information or execute an operation. This step is completed by Harness, because Harness is responsible for managing tools and the execution environment.

There’s an important detail in the action step: parameter generation. The Agent doesn’t just decide “I want to search,” it also needs to decide “what keyword to search for.” This parameter is generated by the Model, then passed to the tool managed by Harness. After the tool finishes executing, it returns a result.

So you see, the action step is actually a collaboration between Model and Harness: the Model is responsible for decision-making and parameter generation; Harness is responsible for tool management and execution.

Step 3: Observation

The Agent “looks at the result.” What did the tool return? What was searched? What did the code output? The Agent takes the result and uses it as input for the next round of reasoning. Then it returns to step 1, continuing to think based on new information.

The result of the observation step typically has two situations: one is positive, where the tool returned useful information, and the Agent can continue to push forward based on this information. The other is negative, where the tool returned an error, empty result, or unexpected information, and the Agent needs to adjust its strategy.

Reasoning → Action → Observation → Reasoning → Action → Observation… this loop repeats continuously until the Agent believes the task is completed, or encounters a problem it can’t solve on its own and needs to ask a human for help.

Why ReAct Works

The brilliance of ReAct lies in how it tightly interweaves “thinking” and “acting.”

Before ReAct emerged, there were two mainstream methods:

One was “pure thinking.” Give the model a large task, let it figure out all the steps at once, then output the result in one breath. The problem is: the model’s “context window” is limited; it’s impossible for it to reason out a complete answer without intermediate results. It’s like asking you to write a complete industry analysis report without drafts or reference materials — almost impossible. Moreover, the pure thinking method easily produces “hallucinations” — the model has no real data as basis, and can only “fabricate” to fill information gaps.

The other was “pure acting.” Write a fixed program that calls this API first, then calls that API, and pieces the results together. The problem is: the program is dead; it won’t adjust strategy based on intermediate results. If there’s an anomaly in the stock price data, the program won’t “realize” this anomaly and go look for the cause. If the search results mix in irrelevant information, the program won’t “realize” it needs to change keywords and re-search.

ReAct combines both. Each step thinks first, then acts, then thinks after finishing. This allows the Agent to both utilize the model’s reasoning capability for flexible decision-making, and acquire real-world feedback through actual actions. This is like a person walking in a maze — not planning every step in advance (because the maze might change at any time), nor walking blindly (because that way you’ll never get out), but walking one step, looking at the situation, then adjusting direction based on the new situation seen at each step.

Limitations of ReAct

But ReAct isn’t omnipotent. It has several obvious limitations.

First, cost and speed issues. Each step requires calling the large language model once, and model calls cost money. A 10-step ReAct loop might call the model 10 times, search engine 5 times, code executor 3 times. If one model call costs $0.05, one search costs $0.01, the total cost of one task might be between $0.5-1. This doesn’t sound like much, Second, error accumulation. If the success rate of each step is 95%, the overall success rate after 10 steps is only about 60%. This is because errors at each step affect the reasoning of subsequent steps — if the second step searches for wrong information, all subsequent analysis is built on the wrong foundation. The more steps, the more serious the cumulative error.

Third, context window limitations. As the loop progresses, the Agent needs to remember more and more intermediate results and tool return information. The context window is limited; when there’s too much information, either early information gets truncated (causing “forgetting”), or information gets compressed (causing loss of details).

The industry has concluded a rule: well-done Agents are all short flows. Empirical data shows that Agents should have human intervention or checkpoints within 10 steps. Beyond this number of steps, error probability and costs both increase sharply.

Cost analysis of the ReAct loop (using GPT-5.5 as example):

Loop StepsModel Call CountEstimated Token ConsumptionSingle Task CostCumulative Success Rate (95% per step)
3 steps3 times~6,000$0.01285.7%
5 steps5 times~10,000$0.02077.4%
10 steps10 times~20,000$0.04059.9%
20 steps20 times~40,000$0.08035.8%

As you can see, after 10 steps the success rate drops below 60%, while costs double. This is why production-grade Agents typically adopt a “divide and conquer” strategy: break large tasks into multiple small tasks, each completed within 5 steps, with human checkpoints added in between.

Anthropic also clearly stated in “Building Effective Agents”: don’t use an Agent for the sake of using an Agent. In many scenarios, a pre-defined workflow — that is, a fixed program flow, plus model calls at a few key nodes — is more reliable, cheaper, and more controllable than a fully autonomous Agent.

ReAct is a powerful foundational pattern, but in practical applications, it often needs to be combined with other design patterns. For example, you can use the Plan-and-Execute pattern to first formulate an overall plan, then use ReAct loops within each subtask. Or, you can use the Reflexion pattern to perform self-checking and correction after the ReAct loop finishes.

This is what the next section will cover.


2.3 From ChatBot to Agent: Three-Year Evolution Path

To truly understand Agents, the best way is to look at how they came to be.

Over the past three years, the form of AI products has undergone a clear evolution path: Copilot → Agent → Digital Labor. Each step represents a qualitative change in capability, not just quantitative.

Stage 1: Copilot (2023-2024)

In early 2023, ChatGPT ignited the global AI boom. Back then, what everyone talked about most was “AI assistant” or “copilot.”

The core pattern of Copilot is: human leads, AI assists.

You’re writing code and get stuck halfway, Copilot helps you complete the second half. You’re writing an email and don’t know how to phrase it, Copilot helps you polish it. You’re making a PPT and don’t know how to organize the content, Copilot helps you outline it.

In Copilot mode, AI doesn’t take initiative. It won’t go search for information on its own, won’t call APIs on its own, won’t make decisions on its own. It just provides a suggestion, a draft, a reference when you need it.

This is like an intern sitting next to you. You ask them questions, they answer. You say “help me write this,” they write it and show it to you. But they won’t proactively do things on their own, won’t arrange work priorities on their own, won’t coordinate with others on their own.

The product form of Copilot is also very intuitive: a dialog box, you type text, AI outputs text. The interaction is linear, single-turn, passive.

The iconic products of this stage are GitHub Copilot (code completion), ChatGPT (conversation assistant), Notion AI (writing assistance). They all share one characteristic: waiting for the user to initiate a request, then giving a response. Without user input, they do nothing.

The main technical challenge of this stage is whether the model itself is smart — can it give high-quality answers. All innovation was concentrated at the Model level: larger parameter counts, better training data, stronger reasoning capabilities. Harness barely existed at this stage — the most complex “Harness” might just be a conversation history management feature.

Stage 2: Agent (2025-2026)

From late 2024 to early 2025, the industry began shifting from Copilot to Agent.

The Agent’s pattern became: AI leads, human supervises.

What’s the difference?

Copilot is “you tell it what to do, it does it.” Agent is “you tell it the goal, it plans how to do it itself.”

Here’s an example. You say to Copilot: “help me analyze competing products,” it will give you an analysis framework or a template, but the specific things are still for you to do yourself — search for competitor information, collect data, organize comparison tables, write analysis text. You say the same thing to an Agent, and it will go search for competitor information, collect data, organize comparison tables, analyze strengths and weaknesses, and finally hand the report to you. You might only need to wait a few minutes, and a complete competitor analysis report appears in front of you.

From “advisor” to “executor,” this is a fundamental change.

To enable AI to “execute” rather than just “suggest,” Agents need capabilities that Copilots don’t have:

  • Using tools: Can call search engines, databases, APIs, code executors
  • Planning capability: Can break down a large goal into multiple sub-steps
  • Memory capability: Can remember previous conversations and intermediate results
  • Autonomous decision-making: Can adjust strategies based on intermediate results
  • Error recovery: Can try a different approach when encountering failure

These capabilities combined together are what we earlier called Harness.

This stage has five major design patterns. Understanding them helps you understand the technical panorama of Agents:

ReAct (Reasoning + Acting): Discussed in detail in the previous section. The loop of reasoning → action → observation. Suitable for tasks requiring real-time information and flexible adaptation. Its advantage is flexibility — the Agent can dynamically adjust strategies based on each step’s results. The disadvantage is efficiency — each step requires calling the model, slow speed, high cost.

Plan-and-Execute (Planning + Execution): The Agent first formulates a complete plan, then executes it step by step. Suitable for complex tasks, because the planning stage can consider the global picture all at once, avoiding the short-sightedness brought by ReAct’s “one step at a time” approach. Imagine renovating a house — you don’t design as you go; you draw the design first, then construct according to the drawing. Plan-and-Execute is this logic. Its advantage is global vision and execution efficiency. The disadvantage is lack of flexibility — if unexpected situations arise during execution, re-planning might be needed.

Reflexion (Reflection): After completing a task, the Agent will self-reflect — what did I do right? What did I do wrong? How to improve next time? This mimics human learning, improving capability through “review.” An Agent using Reflexion will check its own code after writing it; if it finds logic errors, it will correct them itself, rather than directly handing you code with bugs. This is particularly useful in scenarios requiring quality control like programming and writing.

Multi-Agent: Multiple Agents collaborate to complete tasks. One Agent is responsible for searching, one for analysis, one for writing, one for review. Like a team, each member has their own expertise. The advantage of Multi-Agent systems is professionalism and efficiency brought by division of labor. Each Agent can focus on what it’s good at, without needing one Agent to do everything. The disadvantage is coordination cost — Agents need to communicate, synchronize, resolve conflicts, all of which require additional mechanisms to manage.

Tool Use: Agents call external tools as needed. This is the most basic and most frequently used capability; all other patterns depend on it. An Agent without Tool Use capability is like a person locked in a room — can think, These five patterns are not mutually exclusive. An actual Agent product often combines multiple patterns. For example, an Agent doing deep research might use Plan-and-Execute for overall planning, use ReAct loops for each subtask’s execution, and finally use Reflexion to check report quality.

Stage 3: Digital Labor (2026+)

The third stage, which is happening but hasn’t fully unfolded, is the evolution of Agents into “digital labor.”

The characteristic of this stage is: AI doesn’t just execute tasks, but assumes roles.

What does this mean?

In the Agent stage, you give it a specific task, and it goes to complete it. Tasks are discrete, one-time. After completing one, it waits for the next instruction. In the digital labor stage, AI occupies a continuous role — such as “customer service specialist,” “data analyst,” “market researcher” — it continuously works in this role, with clear responsibility scope, work processes, and evaluation criteria.

It’s not a tool you occasionally use, but a “colleague.” You don’t need to give it specific instructions each time; it knows what it should do, when to do it, how to do it. It has its own “work schedule” — check emails in the morning, process customer service tickets in the morning, analyze data in the afternoon, generate reports in the evening. It can switch between multiple tasks, can proactively ask for instructions when encountering uncertain situations, can proactively report after completing work.

Imagine your company has a “digital market researcher.” It automatically browses industry news every morning, organizing developments that might affect your company; automatically generates competitor analysis reports every week; automatically generates market trend insights every month. You don’t need to give it instructions every day — it knows its responsibilities and continuously executes.

The technical difficulty of this stage is much greater than the previous two. Agents need capabilities like long-term memory, continuous learning, cross-task coordination, interpersonal communication. Requirements for security, controllability, and reliability also surge — you can’t let a “digital colleague” make unauthorized decisions, can’t let it say inappropriate things when communicating with customers, can’t let it make irrecoverable errors when processing financial data.

The industry has a sobering statistic: 88% of enterprise Agent projects are stuck in production environments and can’t truly go live.

Why? Because there’s a huge gap between Demo and production environment.

Agents in Demo environments face carefully prepared inputs, simple tasks, and error-tolerant audiences. You give it a clear instruction, it executes a few steps, gives a good result. The audience is happy, feels that AI can really do things.

Agents in production environments face ever-changing user inputs (“half of what users say are typos, half are omitted sentences”), complex edge cases (“this user’s data format is different from the standard format”), zero-tolerance errors (“you sent a grammatically incorrect email to a customer? Unacceptable.”).

Moreover, production environments also have a series of engineering issues like cost, latency, concurrency, monitoring, auditing. A Demo-level Agent might spend 30 seconds and $1 per task execution. But if your business needs to execute 10,000 such tasks per day, the cost is $10,000 per day, equivalent to 300,000 seconds (nearly 3.5 days of serial execution time). You need to consider parallel execution, caching, degradation strategies, cost optimization — these are all not Agent “intelligence” issues, but engineering issues.

From Copilot to Agent to digital labor, this is not a smooth upward curve, but a step-by-step ladder. Each step up, technical difficulty and engineering complexity multiply.


2.4 Cognitive Architecture: Function Calling and the Four Layers

Chapter 1 covered model limits; this section shows how the Harness turns the model into an executable Agent system (complements the ReAct loop in 2.2, with emphasis on interfaces and layering).

From String Parsing to Native Function Calling

Early stacks often used string parsing on model output — fragile and error-prone. Native Function Calling (2024-2026) returns structured JSON via APIs, with higher parse success, type checks, parallel calls, and converging standards (OpenAI Functions, MCP, Anthropic Tools).

Function Calling is a type-safe bridge between the model and the outside world: the model picks tools; the interface layer validates and runs them — upper Agent logic can stay stable when you swap models.

Four-Layer Agent Architecture

LayerResponsibilityKey componentsExamples
CognitiveWhat to doReAct, planningPlan-and-Execute, LLM Compiler
InterfaceHow to callFunction Calling, schemasOpenAI Functions, MCP
ExecutionRun safelyTools, sandboxDocker, E2B
OrchestrationFlow & multi-agentState, workflowsLangGraph, CrewAI, AutoGen

Simple tasks (weather) often bottleneck on interface/execution; complex tasks (industry reports) on cognition. 2026 surveys also map ~17 reasoning paradigms (ReAct, CoT, ToT, Reflexion, etc.) — see 2.2 and later chapters.


2.5 What Agents Can Do, What They Can’t

Having discussed the technical nature and evolution path of Agents, let’s finally discuss the most practical question: what exactly can Agents do? What can’t they do?

This question is practical, because currently market expectations for Agents are a bit inflated. Many people feel Agents can do anything, or will soon be able to do anything. Building products with such expectations will likely lead to wrong decisions.

Conversely, some people, after seeing a few failures of Agents, assert that “Agents are just hype.” This view is also inaccurate.

Let me evaluate directly.

What Agents Are Good At

What Agents are best at are tasks that satisfy these three conditions:

First, having a clear goal. Agents need to clearly know what “success” looks like. “Help me write a competitor analysis report” — the goal is clear, the Agent knows what to output. “Help me understand the market” — the goal is vague, the Agent doesn’t know to what extent “understanding” counts as “understood.” The clearer the goal, the better the Agent performs.

This is not a problem unique to Agents. You ask a human employee “help me understand the market,” he also doesn’t know to what extent he should go. But humans can ask: “What aspect of understanding do you want? Depth or breadth? For what decision?” Agents’ current questioning capability is still far inferior to humans, so they rely more on the clarity of goals.

Second, can be decomposed into steps. Agents aren’t geniuses who figure out answers in one shot, but “workers” who “break large tasks into small steps, execute step by step.” If a task can be broken into a process like search → organize → analyze → output, Agents can usually do quite well.

Third, results of each step can be verified. After calling a search tool, the Agent can judge whether the search results are useful. After writing a piece of code, it can run it to see if there are errors. Results of each step can be checked, which is the foundation of Agent self-correction. If a certain step’s result is wrong, the Agent can go back to the reasoning stage, adjust strategy and retry.

Typical tasks that meet these three conditions include:

  • Information collection and organization: Collect information from multiple sources, compile into a structured report. For example, “help me collect major news from the AI industry in the past week.” This task has a clear goal (collect news), can be decomposed (search → filter → organize → output), and results are verifiable (whether each news item is relevant is clear at a glance).

  • Code generation and debugging: Write code, run code, look at errors, fix — this process is naturally suitable for the ReAct loop. The Agent writes a piece of code, runs it, if there’s an error, analyzes the error message, modifies the code, then runs it again. This “write → run → fix” loop is one of the areas where Agents perform most impressively.

  • Document processing: Read PDFs, extract table data, do format conversion. Tasks of this type have clear rules, clear steps, and Agents rarely make mistakes when executing them.

  • Data analysis: Query databases, calculate metrics, generate charts. Give an Agent a database access permission and a set of analysis requirements, and it can usually automatically generate SQL queries, execute calculations, output visualization results.

  • Content creation: Write articles based on materials, do translation, modify formatting. Note that this refers to “material-based” creation — given raw materials, the Agent is responsible for organization and expression. Creating in-depth original content from scratch is not yet the Agent’s strong suit.

  • Process automation: Process emails according to rules, fill out forms, update systems. The process for tasks of this type is fixed, and Agents can execute them efficiently in batches.

The common characteristic of these tasks is: clear goals, decomposable processes, verifiable results.

What Agents Can’t Do

Conversely, Agents perform poorly in the following scenarios:

First, open-ended exploration. “Help me find a business opportunity” — tasks of this type have no clear success criteria, and the Agent doesn’t know when it counts as “found it.” It might give you a list of directions, but each one only scratches the surface, lacking true insight. Real business insight requires deep industry understanding, intuitive judgment of markets, trade-off analysis of opportunity costs — these are all currently the model’s weak points.

Second, requiring deep domain knowledge. The Agent’s reasoning capability comes from the model. If the model itself doesn’t have sufficiently deep understanding of a certain professional domain (such as law, medicine, finance), then no matter how good the Harness is, it can’t compensate for this deficiency. You can’t expect an Agent that doesn’t understand medicine to do disease diagnosis, can’t expect an Agent that doesn’t understand Chinese tax law to do tax planning. The model’s domain knowledge has clear boundaries; beyond these boundaries, the Agent’s output becomes unreliable.

Third, involving physical operations in the real world. Agents can help you write an email, but they can’t help you bring a cup of coffee. They can operate at high speed in the virtual world — searching for information, generating code, analyzing data — but the physical world is a completely different challenge. This is the so-called “embodied intelligence” problem, which is far from solved. The Agent’s “hands” are API calls, not mechanical arms.

Fourth, requiring long-term consistent roles. It’s easy to have an Agent execute a task, but having it continuously play a role, maintain consistent behavior patterns and decision criteria, is very difficult. The limitation of the context window means Agents will “forget” previous decisions after conversations get too long. It might say “this feature shouldn’t be added” today, and might say “this feature should be added” in a new context tomorrow. Humans can maintain long-term consistent values and decision criteria; Agents currently can’t do this. Long-term memory technology is still rapidly developing, but there’s still distance from truly reliable “continuous roles.”

Fifth, high-risk decisions. Decisions involving money, safety, legal liability shouldn’t be fully entrusted to Agents. Agents can provide suggestions, organize information, simulate scenarios, calculate probabilities, but the final decision-making power must remain in human hands. The reason is simple: Agents make mistakes, and the way they make mistakes might be different from humans — not because of carelessness, but because of fundamental misjudgment of situations. When an experienced human decision-maker makes a mistake, you know where they went wrong, and can review and improve. When an Agent makes a mistake, its “reasoning process” might be opaque, and you can hardly diagnose the root cause of the error.

Real Data on Success Rates

Let’s look directly at the data.

In the coding domain, SWE-bench (a standardized code problem-solving benchmark) shows that the best Agent systems can solve about 50-60% of real GitHub problems. This means that in relatively standardized programming tasks, Agents have reached the capability of mid-level engineers — but for the other 40-50% of problems, they remain at a loss.

In the information retrieval and analysis domain, Agents typically perform better than pure models, because they can actually search the internet for real-time information, rather than relying on static knowledge from model training. But the quality of search results varies, and Agents might be misled by low-quality information.

In complex multi-step tasks, success rate drops sharply as the number of steps increases. If the success rate of each step is 95%, then the overall success rate after 10 steps is only about 60% (0.95^10 ≈ 0.60). After 20 steps, this number drops to 36%. If you require an Agent to complete a 50-step task, with 95% success rate per step, the overall success rate is less than 8%.

So “good Agents are all short flows” isn’t a preference, it’s determined by math.

A Practical Judgment Framework

Facing a specific scenario, you can use the following five questions to judge whether an Agent is suitable:

1. Can this task be described in a process within 10 steps?

If not, Agent might not be the best choice. For tasks exceeding 10 steps, success rate and cost both deteriorate sharply.

2. Can the results of each step be checked or verified?

If not, the Agent can’t self-correct, and errors accumulate. The best Agent scenarios are those where each step has clear “right/wrong” feedback.

3. Is the task’s fault tolerance high?

If making one mistake is unacceptable (such as sending an email that shouldn’t be sent), Agents currently aren’t reliable enough. If errors can be corrected (such as writing a wrong piece of code, which can be fixed), Agents are more suitable.

4. Does the mainstream model possess the domain knowledge required by the task?

If very professional knowledge is needed (such as legal clauses of specific countries), fine-tuning or RAG (Retrieval-Augmented Generation) might be needed to supplement the model’s knowledge.

5. Is the cost of human intervention high?

If every step requires human approval, the Agent’s automation advantage disappears. The ideal situation is that humans only intervene at key nodes, and the Agent handles most steps on its own.

If the answers to all five questions are affirmative, then Agent is a very good choice. If more than three answers are negative, you need to seriously consider whether it’s worth investing.

Final Words

Agents aren’t omnipotent, but they also aren’t an empty concept. They are a real, rapidly evolving technological paradigm.

Understanding the nature of Agents — Model + Harness, understanding how they operate — the ReAct loop, understanding their evolution path — from Copilot to digital labor, understanding their capability boundaries — what they’re good at, what they can’t do — these four things combined together constitute the cognitive foundation for judging all Agent products.

The next chapter will dive into the specific design of Harness, looking at how those six components work together. But no matter how complex the technical details later become, remembering this formula is enough: Agent = Model + Harness.

The success or failure of all Agent products can be traced to the two variables of this formula.