第5章工具（Tools）— Agent的外部能力

# 第5章工具（Tools）— Agent的外部能力

模型能想，但不能做。

它不能上网搜索，不能执行代码，不能发邮件，不能查数据库。它的全部能力就是：接收一段文字，输出一段文字。

这是模型的天然局限。就像一个天才分析师，坐在办公室里什么都知道，但一出门就什么都干不了——因为他没有手没有脚。

工具就是给Agent装上手脚。

5.1 Function Calling的原理

我选工具时会先问：少接一个，核心流程还成立吗。

模型怎么”调用”工具

大语言模型本身不能调用任何东西。它只能生成文字。

Function Calling的原理是这样的：

第一步：定义工具。你告诉模型有哪些工具可用，每个工具的参数是什么。比如”搜索工具：接受一个query参数，返回搜索结果”。

第二步：模型决策。模型分析用户的请求，判断需要用哪个工具，然后生成一段结构化的JSON文本，描述它想调用什么工具、传什么参数。

第三步：系统执行。Harness读取模型生成的JSON，实际调用工具，拿到结果。

第四步：结果返回。工具的执行结果被拼回上下文，模型根据结果继续推理。

所以，模型并没有真正”调用”工具。它只是输出了”我想调用这个工具”的指令，由Harness来实际执行。

这就像一个导演说”我要这个场景用暖光”——导演不自己打光，灯光师来执行。模型是导演，Harness是灯光师。

为什么工具定义很重要

工具定义的质量直接影响Agent的表现。

你给Agent一个搜索工具，描述写的是”搜索互联网”。这个描述太模糊了。Agent可能在该搜索的时候不搜索，不该搜索的时候乱搜索。

更好的描述是：“当用户的问题需要最新信息、或者你不确定答案时，使用这个工具搜索互联网。不要用于已知事实的查询。”

描述越精确，Agent越不容易用错工具。

严格工具契约

Anthropic有一个重要建议：给工具定义严格的契约。

窄范围。每个工具只做一件事。搜索工具只负责搜索，不要同时承担”搜索+过滤+排序”的功能。功能越单一，模型越不容易用错。

类型化参数。用Pydantic Schema定义参数类型。query是字符串，limit是整数，filters是字典。类型越明确，模型生成的参数越不容易出错。

清晰描述。每个工具的描述要写清楚它能做什么、不能做什么、什么时候该用它。

反面教材是宽泛工具，比如execute_python——让模型写任意Python代码执行。这种工具太危险了，模型可能写出删除文件的代码，也可能写出死循环。宁可提供十个窄范围的专用工具，也不要一个万能的宽泛工具。

怎么实现：用Pydantic定义工具参数：

工具定义的最佳实践：

原则	反面教材	正面教材
范围窄	`execute_python`: “执行任意Python代码”	`calculate`: “执行数学计算表达式”
描述清晰	`search`: “搜索互联网”	`search`: “当需要最新信息或你不确定答案时，搜索互联网。不要用于已知事实。“
参数类型化	`query: any`	`query: str`, `limit: int (1-20)`
有边界	`read_file`: “读取文件”	`read_file`: “读取指定路径的文本文件，不支持二进制文件，最大1MB”
有示例	无示例	`query示例: "Python asyncio教程"`

5.2 MCP：Agent连接工具的USB接口

一个真实的痛点

2025年之前，每个Agent框架都有自己的工具集成方式。LangChain有LangChain的写法，AutoGen有AutoGen的写法，CrewAI有CrewAI的写法。

同一个工具——比如一个搜索API——要为不同框架写不同的适配器。你有10个工具、3个框架，就要写30个适配器。这太荒谬了。

就像USB出现之前，每个设备都有自己的接口。鼠标是圆口，键盘是PS/2，打印机是并口。你换一台电脑，所有线都要换。

MCP就是Agent世界的USB。

MCP怎么解决这个问题

MCP（Model Context Protocol）由Anthropic在2025年提出，2026年已捐赠给Linux Foundation，成为行业事实标准。

MCP定义了一套协议：工具怎么描述自己、Agent怎么发现工具、调用请求怎么传输、结果怎么返回。

有了MCP，工具开发者只需要写一次适配器，所有支持MCP的Agent框架都能直接使用。

核心概念

Server：提供工具的一方。比如一个搜索服务、一个数据库服务、一个文件系统服务。

Client：使用工具的一方。通常是Agent框架。

Tools：Server暴露的具体能力。每个Tool有名称、描述、输入参数的Schema。

Resources：Server可以提供的数据。比如文件内容、数据库记录。

MCP的2026现状

截至2026年5月，MCP生态已经非常成熟：

数千个活跃Server，覆盖了几乎所有常见工具——数据库、搜索引擎、文件系统、SaaS API、代码仓库。

Claude、Cursor、VS Code、Google Gemini全面支持MCP。你在Cursor里写代码，可以直接调用MCP Server提供的工具，不需要额外配置。

怎么实现：如果你要用MCP，最简单的方式是在Cursor或Claude Code中直接使用已有的MCP Server。如果你想自己开发MCP Server，Anthropic提供了Python和TypeScript的SDK：

MCP Server配置（Cursor/Claude Code中使用）：

2026年MCP生态热门Server：

Server	功能	使用场景
filesystem	文件系统操作	读写本地文件
postgres/sqlite	数据库查询	结构化数据访问
brave-search	网络搜索	获取最新信息
github	GitHub操作	代码仓库管理
slack	Slack消息	团队协作
puppeteer	浏览器自动化	网页抓取、测试
fetch	HTTP请求	调用任意API

5.3 A2A协议：Agent之间的名片交换

MCP vs A2A

MCP解决的是Agent与工具之间的连接问题。A2A（Agent-to-Agent）解决的是Agent与Agent之间的通信问题。

一个Agent想让另一个Agent帮忙做事，它们之间怎么沟通？

Agent Card

A2A协议的核心概念是Agent Card。每个Agent发布一个JSON格式的卡片，描述自己能做什么、怎么调用、需要什么参数。

其他Agent看到这个卡片后，就知道怎么和它协作。

这就像一张名片。你拿到名片，就知道对方是做什么的、怎么联系、能提供什么服务。

与MCP的配合

MCP和A2A不是竞争关系，而是互补关系。

MCP管Agent与工具的连接：搜索、数据库、文件系统。 A2A管Agent与Agent的协作：任务委托、结果传递、能力发现。

一个完整的Agent系统，通常同时需要MCP和A2A。

A2A Agent Card示例：

A2A与MCP的协作流程：

5.4 Computer Use：Agent操作计算机

概念

Computer Use让Agent能像人一样操作计算机——看屏幕、移动鼠标、点击按钮、输入文字。

Agent看到屏幕截图，用视觉模型理解界面上有什么元素，然后决定下一步操作。每一步操作后，屏幕会变化。Agent看到新的截图，继续决策。

什么时候用

Computer Use适合操作没有API的软件。比如操作一个老旧的企业系统、操作一个没有开放API的网站。

如果软件有API，直接调API更可靠。Computer Use是最后的手段，不是首选。

局限

Computer Use的速度比API调用慢很多——每一步都需要截图、识别、决策。一个API调用可能只要200毫秒，一次屏幕操作可能要2-5秒。

准确率也不如API。模型可能看错按钮的位置，可能误读界面上的文字。

所以，能用API就别用Computer Use。

5.5 工具编排：单工具、链式、并行

单工具调用

最简单的场景：Agent调用一个工具，拿到结果，继续推理。

比如用户问”今天北京天气怎么样”，Agent调用天气API，拿到结果，回答用户。

链式调用

多个工具按顺序调用，前一个的输出是后一个的输入。

比如：搜索”特斯拉最新财报”→拿到财报链接→读取PDF内容→提取关键数据→生成分析报告。

并行调用

多个工具同时调用，互不依赖。

比如：同时搜索三个不同来源的信息。搜索A、搜索B、搜索C同时进行，结果汇总后一起分析。

并行调用能把执行时间缩短不少。原来需要3次串行搜索（每次2秒，共6秒），并行后只要2秒。

怎么实现：在LangGraph中，并行调用通过同一个super-step内的多个节点实现：

本章小结

工具是Agent的外部能力。核心要点：

Function Calling：模型生成工具调用指令，Harness实际执行。
MCP：Agent连接工具的USB接口。一次适配，处处可用。2026年已成为行业标准。
A2A：Agent之间通信的名片交换协议。
Computer Use：操作计算机的备选方案。能用API就别用它。
工具编排：单工具、链式、并行，按任务结构选择。

下一章讲Harness的第三个子系统：记忆（Memory）。Agent怎么记住之前发生过什么。

# Ch05 Tools — Agent's External Capabilities

Models can think, but they cannot act.

They cannot search the internet, execute code, send emails, or query databases. Their entire capability is: receive text, output text.

This is the fundamental limitation of models. Like a genius analyst who knows everything while sitting in their office, but can’t do anything once they step outside — because they have no hands or feet.

Tools are what give Agent hands and feet.

5.1 How Function Calling Works

How Models “Call” Tools

Large language models cannot call anything by themselves. They can only generate text.

The principle of Function Calling is as follows:

Step 1: Define tools. You tell the model which tools are available and what parameters each tool accepts. For example, “Search tool: accepts a query parameter, returns search results.”

Step 2: Model decision. The model analyzes the user’s request, determines which tool to use, and then generates a structured JSON text describing which tool it wants to call and what parameters to pass.

Step 3: System execution. The Harness reads the JSON generated by the model, actually calls the tool, and gets the results.

Step 4: Return results. The tool execution results are appended back to the context, and the model continues reasoning based on the results.

So, the model doesn’t actually “call” the tool. It only outputs instructions saying “I want to call this tool”, and the Harness actually executes it.

This is like a director saying “I want warm lighting for this scene” — the director doesn’t set up the lights themselves; the lighting technician does it. The model is the director, the Harness is the lighting technician.

Why Tool Definitions Matter

The quality of tool definitions directly affects Agent performance.

If you give an Agent a search tool with the description “Search the internet”, this description is too vague. The Agent might not search when it should, or search recklessly when it shouldn’t.

A better description would be: “Use this tool to search the internet when the user’s question requires the latest information or when you’re uncertain about the answer. Do not use for queries about known facts.”

The more precise the description, the less likely the Agent is to misuse the tool.

Strict Tool Contracts

Anthropic has an important recommendation: define strict contracts for tools.

Narrow scope. Each tool does only one thing. A search tool is only responsible for searching, don’t let it also handle “search + filter + sort” functionality. The more singular the function, the less likely the model is to misuse it.

Typed parameters. Use Pydantic Schema to define parameter types. query is a string, limit is an integer, filters is a dictionary. The clearer the types, the less likely the model is to generate incorrect parameters.

Clear description. Each tool’s description should clearly state what it can do, what it cannot do, and when it should be used.

A counterexample is broad tools, like execute_python — letting the model write and execute arbitrary Python code. This type of tool is too dangerous; the model might write code that deletes files or creates infinite loops. Better to provide ten narrow-purpose specialized tools than one万能的 broad tool.

Implementation: Define tool parameters with Pydantic:

Best practices for tool definitions:

Principle	Anti-pattern	Best Practice
Narrow scope	`execute_python`: “Execute arbitrary Python code”	`calculate`: “Execute mathematical calculation expressions”
Clear description	`search`: “Search the internet”	`search`: “Search the internet when you need latest info or are unsure of answer. Don’t use for known facts.”
Typed parameters	`query: any`	`query: str`, `limit: int (1-20)`
Has boundaries	`read_file`: “Read file”	`read_file`: “Read text file at specified path, no binary files, max 1MB”
Has examples	No examples	`query example: "Python asyncio tutorial"`

5.2 MCP: The USB Interface for Agent Tool Connections

A Real Pain Point

Before 2025, each Agent framework had its own way of integrating tools. LangChain had its way, AutoGen had its way, CrewAI had its way.

The same tool — say, a search API — had to have different adapters for different frameworks. If you had 10 tools and 3 frameworks, you’d need to write 30 adapters. This was absurd.

Just like before USB existed, each device had its own interface. Mice used round ports, keyboards used PS/2, printers used parallel ports. When you switched computers, all the cables had to be changed.

MCP is the USB of the Agent world.

How MCP Solves This Problem

MCP (Model Context Protocol) was proposed by Anthropic in 2025, and donated to the Linux Foundation in 2026, becoming the industry de facto standard.

MCP defines a set of protocols: how tools describe themselves, how Agents discover tools, how invocation requests are transmitted, and how results are returned.

With MCP, tool developers only need to write an adapter once, and all Agent frameworks that support MCP can use it directly.

Core Concepts

Server: The party providing tools. For example, a search service, a database service, a file system service.

Client: The party using tools. Usually an Agent framework.

Tools: Specific capabilities exposed by the Server. Each Tool has a name, description, and input parameter Schema.

Resources: Data that the Server can provide. For example, file content, database records.

MCP in 2026

As of May 2026, the MCP ecosystem is very mature:

Thousands of active Servers, covering almost all common tools — databases, search engines, file systems, SaaS APIs, code repositories.

Claude, Cursor, VS Code, and Google Gemini fully support MCP. When you write code in Cursor, you can directly call tools provided by MCP Servers without additional configuration.

Implementation: If you want to use MCP, the simplest way is to directly use existing MCP Servers in Cursor or Claude Code. If you want to develop your own MCP Server, Anthropic provides Python and TypeScript SDKs:

MCP Server configuration (used in Cursor/Claude Code):

Popular MCP Servers in the 2026 ecosystem:

Server	Function	Use Case
filesystem	File system operations	Read/write local files
postgres/sqlite	Database queries	Structured data access
brave-search	Web search	Get latest information
github	GitHub operations	Code repository management
slack	Slack messages	Team collaboration
puppeteer	Browser automation	Web scraping, testing
fetch	HTTP requests	Call any API

5.3 A2A Protocol: Business Card Exchange Between Agents

MCP vs A2A

MCP solves the connection problem between Agents and tools. A2A (Agent-to-Agent) solves the communication problem between Agents.

When one Agent wants another Agent to help with a task, how do they communicate?

Agent Card

The core concept of the A2A protocol is Agent Card. Each Agent publishes a JSON-format card describing what it can do, how to call it, and what parameters it needs.

When other Agents see this card, they know how to collaborate with it.

This is like a business card. When you get a business card, you know what the other person does, how to contact them, and what services they can provide.

Cooperation with MCP

MCP and A2A are not competitive, but complementary.

MCP manages Agent-tool connections: search, databases, file systems. A2A manages Agent-Agent collaboration: task delegation, result passing, capability discovery.

A complete Agent system usually needs both MCP and A2A.

A2A Agent Card example:

A2A and MCP collaboration flow:

5.4 Computer Use: Agent Operating a Computer

Concept

Computer Use lets Agents operate computers like humans — see the screen, move the mouse, click buttons, type text.

The Agent sees a screenshot, uses a vision model to understand what elements are on the interface, and then decides the next action. After each operation, the screen changes. The Agent sees the new screenshot and continues making decisions.

When to Use

Computer Use is suitable for operating software that doesn’t have an API. For example, operating an old enterprise system, or operating a website that doesn’t have an open API.

If the software has an API, calling the API directly is more reliable. Computer Use is the last resort, not the first choice.

Limitations

Computer Use is much slower than API calls — each step requires screenshot, recognition, and decision-making. An API call might take only 200ms, but a screen operation might take 2-5 seconds.

Accuracy is also not as good as API. The model might misread the position of a button, or misinterpret text on the interface.

So, if you can use an API, don’t use Computer Use.

5.5 Tool Orchestration: Single Tool, Chained, Parallel

Single Tool Call

The simplest scenario: Agent calls one tool, gets the result, and continues reasoning.

For example, when a user asks “What’s the weather like in Beijing today”, the Agent calls the weather API, gets the result, and answers the user.

Chained Calls

Multiple tools are called in sequence, where the output of the previous one is the input of the next.

For example: search “Tesla latest financial report” → get the financial report link → read PDF content → extract key data → generate analysis report.

Parallel Calls

Multiple tools are called simultaneously, independent of each other.

For example: searching three different sources simultaneously. Search A, Search B, and Search C run in parallel, and the results are aggregated and analyzed together.

Parallel calls can significantly reduce execution time. Originally needing 3 serial searches (2 seconds each, 6 seconds total), after parallelization it only takes 2 seconds.

Implementation: In LangGraph, parallel calls are implemented through multiple nodes within the same super-step:

Chapter Summary

Tools are the Agent’s external capabilities. Key points:

Function Calling: The model generates tool call instructions, the Harness actually executes them.
MCP: The USB interface for Agent tool connections. One adaptation, usable everywhere. Became an industry standard in 2026.
A2A: The business card exchange protocol for communication between Agents.
Computer Use: Alternative for operating computers. If you can use an API, don’t use this.
Tool orchestration: Single tool, chained, parallel — choose according to task structure.

The next chapter covers the Harness’s third subsystem: Memory. How does an Agent remember what happened before.