第1章大模型——Agent的大脑

“编码”不再是正确的动词了。新的动词是”manifesting”——表达你的意图。 ——Andrej Karpathy，2025年12月

大语言模型看起来像「会思考」，但底层机制出奇朴素：不断预测下一个词。参数规模、训练数据与对齐方式叠上去之后，才涌现出写诗、写代码、做推理等我们日常感受到的能力。

这一章只讲大模型本身——它如何训练、如何计量（Token）、如何控制随机性（温度）、多模态与架构前沿（Scaling Law、MoE）、推理模型与安全对齐。与 Agent 编排、工具调用、ReAct 循环、Function Calling 相关的内容放在第 2 章及之后。

本章不讲产品方法论，只讲技术原理：不是为了让你变成工程师，而是为了让你知道这颗「大脑」怎么运转、擅长什么、会犯什么错、能力边界在哪里。学完你会发现，许多看似高级的语言能力，都从一个极其简单的起点生长出来——预测下一个词。

1.1 大语言模型的工作原理（用人话解释）

先说结论：大语言模型做的事情，就一件——预测下一个词。

没开玩笑。不管一个模型有10亿参数还是1.6万亿参数，不管它能写诗、能编程、能分析法律合同，底层机制都是一样的：给定前面的文字，预测下一个最可能出现的词（严格说是”token”，后面会解释）。然后把这个预测出来的词加入上下文，再预测下一个，一个接一个，像滚雪球一样。

我第一次听到这个解释时是有点失望的。“预测下一个词”听起来太不像智能了——这不就是手机输入法的联想功能吗？你打”我今天去吃”，它帮你补个”饭”。但读得越深越发现，这个看似简单的机制，恰恰是一切复杂能力的源头。打个比方：大海也是由水分子组成的——一个个H₂O并没什么神奇，但几十亿亿亿个水分子组合在一起，就有了洋流、潮汐、飓风。LLM也是类似，一个”预测下一个词”没什么神奇，但在万亿参数的规模上反复执行，涌现出了推理、理解、创造这些高级行为。

在经典课程与论文里，人们常把这概括为「一个公式统治一切」（one formula that rules everything）。听起来太简单了对吧？但关键在于，当你用互联网上几乎所有的文本去训练这个”预测下一个词”的模型时，它被迫学会了一些意想不到的东西。

想想看：要准确预测一个医学问题的下一句话，你得懂医学。要预测一段Python代码的下一行，你得懂编程逻辑。要预测一段推理的结论，你得学会推理。模型不是被教会这些知识的——它是在”预测下一个词”的压力下，自己压缩出了这些能力。这就像一个学生不是为了考试而学习，而是被扔进了一个纯外语环境——为了活下去，他不得不学会这门语言的所有微妙之处。

这就像让你做一道考试题：给你一本书的前99%，让你猜最后一页写的是什么。如果你能猜得八九不离十，说明你不仅读完了这本书，还理解了它的结构、逻辑和风格。模型每天都在做这件事，而它读过的”书”是整个互联网。

模型是怎么训练出来的

大模型训练通常分为三个阶段；理解它们能帮你判断模型能力从哪里来：

第一阶段：预训练（Pre-training）。 这是”读万卷书”的阶段。拿互联网上海量的文本——书籍、网页、代码、论文、论坛帖子——让模型一遍一遍地学习”预测下一个词”。这个阶段结束后，模型就像一个读了无数书但从没跟人交流过的天才：知识量大得吓人，但不知道怎么跟人对话，可能会给你背维基百科而不是回答你的问题。

以GPT-5.5为例，它的预训练数据量达到了惊人的15万亿Token，相当于阅读了7500万本书。训练使用了超过20万张NVIDIA H200 GPU，耗时数月，成本约50亿美元。这种规模的训练让模型压缩了人类文明的巨量知识，但也带来了”知识截止”问题——模型不知道训练数据截止日期之后发生的事。你问它”今年奥斯卡最佳影片是谁”，它会告诉你它知道的所有奥斯卡历史，但就是不告诉你今年的——因为今年的还没写进它的”课本”里。

第二阶段：监督微调（Supervised Fine-Tuning，SFT）。 这是”学习社交礼仪”的阶段。用人工编写的高质量对话数据——问什么该怎么答、什么格式、什么语气——手把手教模型怎么当一个”助手”。预训练后的模型知道世界的全部知识，但不理解”有人问你问题的时候应该直接回答”这种基本规则。SFT教会它这些。这有点像把一个深山里的隐士带到城市里——他的知识可能极为渊博，但不知道跟人说话时要看对方的眼睛，不知道”你今天怎么样”是一句客套话而不是邀请你讲述全天经历。

SFT的数据通常由专业人员精心编写，数量在数十万到数百万条对话之间。质量远比数量重要——一条高质量的SFT数据可能价值超过1000条低质量数据。这也是为什么开源模型和闭源模型的差距往往在SFT阶段拉开：OpenAI、Anthropic等公司投入大量资源构建高质量的SFT数据集，而开源社区很难复制这种投入。

第三阶段：人类反馈强化学习/可验证奖励强化学习（RLHF/RLVR）。 这是「在实践中打磨」的阶段。模型生成多个回答，由人类或自动评估系统打分，模型从反馈中学习什么是好回答、什么是差回答。其中 RLVR（Reinforcement Learning from Verifiable Rewards）用代码运行结果、数学证明等可验证标准训练，而不是依赖主观偏好——这让模型更容易发展出多步推理与自我检查行为。

RLVR的核心优势在于”可验证性”。传统RLHF中，人类标注员对回答质量打分，但打分标准主观且不一致——同一个人上午和下午的评分都可能不同。RLVR则用客观标准：代码是否能运行？数学答案是否正确？逻辑推导是否严密？这种客观反馈让模型学会了”自我检查”——在给出最终答案前，先在脑子里过一遍验证过程。一个有意思的发现是DeepSeek-R1的论文：他们尝试了不给模型任何人类示例、纯用RLVR训练，结果模型自己演化出了”等一下，我重新检查一下”这种行为。他们管这个叫”Aha Moment”——模型在训练中自发学会了自我质疑，就像人类解题时突然意识到自己犯了个错。

2025-2026年的关键突破是”推理时计算”（Inference-Time Compute）的规模化应用。OpenAI的o系列模型和DeepSeek-R1证明：让模型在回答前”多想一想”（生成更长的思维链），可以显著提升复杂任务的准确率。GPT-5.5在FrontierMath Tier 4测试中的准确率从GPT-5.4的27.1%提升到35.4%，再到GPT-5.5 Pro的39.6%，这种进步很大程度上来自推理时计算的优化。

“召唤出来的幽灵”

有一种很贴切的比喻：大语言模型不是进化出来的通用智能，而是「召唤出来的幽灵」（summoned ghosts）——通过训练被「召唤」出来，专门优化文本模仿与问题求解。

这个比喻很值得多想一层。幽灵有两个特征：第一，它们能做某些超越常人的事（穿墙、飞行），但在另一些看似简单的事上却无能为力（比如拿起一个实体的杯子）。第二，你永远不能完全确定它们在想什么——它们有自己的运行逻辑，跟人类完全不同。

模型也是这样。它可以在某些任务上展现出超人的能力——比如写出结构完美的代码、分析复杂的法律条文——但在另一些看似简单的事情上犯蠢，比如数不清一个句子里有几个字母”r”。不是它笨，是它根本不是按人类的方式在”思考”。它是一台很强的模式匹配和预测机器，而模式匹配的天花板和底限往往出入意料地远离人类直觉。

理解这一点，你就不会对它产生不切实际的期待，也不会在它犯错时感到幻灭。Agent的大脑很强大，但它有自己的”认知架构”，跟人类完全不同。知道它的边界在哪里，反而能帮你更好地利用它的能力。

1.2 核心概念：Token、上下文窗口、温度、推理

要跟Agent打交道，你不需要会写代码，但需要理解几个核心概念。它们决定了Agent的能力边界和行为特征。

Token：模型的”原子”

模型不直接处理文字，它处理的是Token。

一个Token可以是一个完整的词（比如”hello”），也可以是词的一部分（比如”un” + “break” + “able”），甚至可以是标点符号或空格。中文通常一个字对应1-2个Token，英文一个单词通常对应1-1.5个Token。

为什么这个概念重要？因为模型的所有计量单位都是Token，不是字数或词数。API按Token计费，上下文窗口按Token计算，模型生成速度也按Token/秒衡量。你跟模型说”帮我写一篇2000字的文章”，它脑子里想的是”大约3000个Token”。

伪代码示意：Token计算原理

设计思路： 分词器的核心是用最短的子词序列表示任意文本。高频词直接映射为独立Token，低频词被拆分为多个子词。中文因字符集大，通常1-2字符/Token；英文因词形变化多，通常1-1.5词/Token。Token化是模型理解世界的”视网膜”——所有语义处理都建立在Token序列之上。

计费影响： 假设你开发一个客服Agent，平均每次对话3000 Token（输入2000 + 输出1000），使用GPT-5.5（$2/百万Token），单次成本约$0.006。如果日活用户1万，每天对话3轮，月成本约$540。但如果使用DeepSeek V3.2（$0.5/百万Token），同样流量月成本仅$135——差距显著。

DeepSeek V4在这个方向上走得更远。它的1.6万亿参数架构（后面会详细讲）就是奔着更高效处理Token去的——用更少的计算资源理解和生成更多的Token。

上下文窗口：模型的”工作记忆”

上下文窗口就是模型一次能”看到”多少Token。你可以把它理解成一个人的工作台——台上能铺开多少张纸。

早期的模型上下文窗口很小，可能只有4096个Token（大约3000个汉字）。你把一篇长文章塞给它，它只能看到前几页，后面的就”看不见”了。这严重限制了Agent处理复杂任务的能力——想想看，如果一个Agent要分析一份50页的合同，但它一次只能看几页，它怎么理解合同各条款之间的关系？

到了2026年，这个限制基本被解决了。主流模型的上下文窗口已经扩展到百万级Token：

模型	上下文窗口	相当于	关键技术
Gemini 3 Ultra	1000万Token	整套维基百科	无限滑动窗口
Claude 4 Opus	500万Token	约4000页书	神经缓存技术
DeepSeek V4	200万Token	约1600页书	稀疏注意力优化
GPT-5.5	100万Token	约800页书	标准配置
Llama 4	10万Token	约80页书	消费级部署

这意味着Agent终于可以一次性处理大型文档、长代码库、完整对话历史，而不需要”分段阅读”再”拼凑理解”。

实际应用场景：

法律合同审查：一份100页的合同约5万Token，GPT-5.5可以一次性分析20份合同并找出条款冲突
代码库理解：一个中型项目的代码库约30万Token，Agent可以一次性”读”完整个项目，理解模块间依赖关系
客户全生命周期管理：把用户过去3年的所有交互记录（约50万Token）一次性加载，Agent能基于完整历史做个性化推荐

但这里有个常见的误解：上下文窗口越大越好。其实不是，上下文窗口大不等于模型能充分利用所有信息。就像你的工作台可以铺开100张纸，但你真的能同时关注100张纸上的内容吗？模型也有”注意力”的限制——它会自然地更关注最近的信息和开头的信息，中间部分容易被”忽略”。

“Lost in the Middle”现象： 2024年斯坦福大学的研究发现，模型对上下文中间部分的信息召回率显著低于开头和结尾。在Claude 4的500万Token窗口中，如果关键信息 buried 在中间位置，模型找到它的概率可能不到60%。所以在Agent设计中，上下文怎么组织和管理（Context Engineering）变得很重要——把关键信息放在开头或结尾，使用明确的标记分隔不同部分，必要时做摘要压缩。

温度：控制模型的”性格”

温度（Temperature）是模型生成文本时的一个参数，范围通常在0到2之间。它控制的是模型在预测下一个词时的”随机性”。

温度为0时，模型每次都选概率最高的那个词。输出最稳定、最确定，但可能显得死板、重复。就像一个极度谨慎的人，只说最有把握的话。

温度为0.7时，模型会在高概率的词之间做适度的随机选择。输出有变化、有创造力，但仍然基本靠谱。这是大多数场景的默认设置。

温度为1.5时，模型会更大胆地选择低概率的词。输出可能非常有创意，也可能变得胡言乱语。就像喝了两杯酒的人，话变多了，但也开始不着边际。

对Agent设计来说，温度的选择直接影响任务质量。写代码需要低温度（0-0.3），因为代码要么对要么错，不需要”创意”。头脑风暴需要高温度（0.8-1.2），因为需要多样化的想法。写报告可以中等温度（0.5-0.7），需要一些表达变化但不能跑题。

伪代码示意：温度参数如何影响采样

设计思路： 温度本质上是信息论中的”熵控制”。T→0时输出熵最小（确定性），T增大时输出熵增大（多样性）。Agent设计中的温度选择应遵循”任务确定性原则”：结果可验证的任务（代码、数学）用低温，结果主观评价的任务（创意、文案）用高温。生产环境中建议对同一任务做温度A/B测试，找到质量与多样性的最佳平衡点。

温度选择的决策矩阵：

任务类型	推荐温度	原因	典型应用
代码生成/调试	0.0-0.2	确定性输出，避免语法错误	GitHub Copilot, Cursor
数据分析/报告	0.2-0.4	事实准确，逻辑严谨	财务分析, 数据洞察
客服对话	0.4-0.6	一致性与自然度平衡	智能客服, 售后支持
内容创作	0.6-0.8	有创意但不离谱	营销文案, 社交媒体
头脑风暴	0.8-1.2	最大化多样性	创意策划, 产品创新
艺术创作/游戏	1.2-1.5	突破常规，探索边界	故事创作, 角色设计

推理：模型的”思考”能力

2024-2025年最重要的技术突破之一是”推理”（Reasoning）能力的涌现。

传统的模型是”直接回答”——你问什么，它直接输出答案，中间过程是隐性的。推理模型（如OpenAI的o系列、DeepSeek-R1）则会在输出答案之前，先展示一个”思考过程”（Chain of Thought），把解题步骤一步一步写出来。

为什么这对Agent这么关键？Agent执行的是多步骤任务。它需要：理解任务 → 分解步骤 → 执行每一步 → 检查结果 → 决定下一步。这整个过程就是”推理”。没有推理能力的模型只能做简单的单步任务；有了推理能力，模型才能像人一样”想一想再做”。

但推理能力是有代价的。推理模型在”思考”时会消耗大量Token（这些思考Token也计费），而且思考时间更长。所以DeepSeek V4的架构优化意义重大——它把推理FLOPs（浮点运算量）降到了V3的27%，也就是说做同样的推理，计算成本只有原来的不到三分之一。Agent需要频繁地进行多步推理，如果每次推理都又慢又贵，Agent就只能是实验室里的玩具。

2025-2026年，认知科学里的「双系统」常被用来类比模型：心理学家 Daniel Kahneman 区分 System 1（快、直觉）与 System 2（慢、逻辑）。传统生成更像 System 1；带可见思维链的推理模型更像 System 2——在回答前多消耗 Token、多花时间，换复杂任务上更高的准确率。具体该用哪种模式、如何与 Agent 编排结合，见第 2 章。

推理成本的量化对比（2026年数据）：

模型	输入价格($/百万Token)	输出价格($/百万Token)	推理速度(Token/秒)	适合场景
GPT-5.5 Pro	$8	$32	45	复杂推理、科研、高端分析
Claude 4 Opus	$8	$40	38	长文档分析、法律、创意写作
GPT-5.5	$2	$8	65	通用任务、代码、日常办公
DeepSeek V4	$0.5	$2	55	高性价比推理、批量处理
Gemini 3 Pro	$3	$12	72	多模态、长上下文、实时应用
Claude 4 Haiku	$1	$5	120	简单任务、高并发、低延迟
DeepSeek V3.2	$0.25	$1	85	成本敏感型应用、边缘部署

成本优化策略： 一个日处理100万Token的Agent应用，使用GPT-5.5 Pro月成本约$12,000，而使用DeepSeek V4仅约$750——相差16倍。这就是为什么多模型路由策略如此重要：把简单任务分配给低成本模型，只有复杂任务才调用高端模型。

1.3 多模态：不只是文字（视觉、语音、代码）

“多模态”这个词听起来很技术，但概念很简单：模型不只能处理文字，还能处理图片、声音、视频、代码——而且可以在这些不同”模态”之间自由转换。

视觉：模型也长了眼睛

现在的模型可以直接”看”图片。你给它一张照片，它能描述里面的内容、识别文字、分析图表、甚至理解讽刺性的梗图。

这让Agent能做的事一下子多了很多。想象一个客服Agent：以前它只能处理文字消息，用户发张产品故障照片它看不懂。现在它可以分析图片、识别问题、给出解决方案。一个数据分析Agent可以直接看图表，不需要你把数据转成文字再描述给它听。

多模态能力的量化表现（2026年基准测试）：

能力维度	GPT-5.5	Claude 4	Gemini 3	典型应用场景
图像描述准确率	94.2%	92.8%	96.1%	无障碍辅助、内容审核
OCR文字识别	97.5%	95.3%	98.2%	文档数字化、发票处理
图表理解	89.7%	91.2%	93.5%	财报分析、数据洞察
视觉推理	85.4%	88.1%	90.3%	科学实验分析、故障诊断
视频理解（1分钟）	82.3%	79.6%	91.7%	监控分析、内容推荐

更实际的场景：OCR（光学字符识别）。以前需要专门的OCR工具把图片里的文字提取出来，再喂给模型处理。现在一步到位——你把合同扫描件扔给Agent，它直接读取内容、提取关键条款、跟标准条款做对比。整个链条简化了。

伪代码示意：多模态处理流程

设计思路： 多模态的核心是”表示统一”。不同模态的信息通过各自的编码器映射到同一个高维向量空间，之后模型不再区分”这是文字” “这是图片”，而是在统一的Token序列上做自注意力计算。这种架构让模型能够进行跨模态推理，比如”图片中的红色物体对应文字描述中的哪个词”。Agent设计时，多模态输入的顺序和标记方式会影响模型对跨模态关系的理解。

语音：听见你的声音

语音能力让Agent从”打字工具”变成了可以对话的伙伴。2025-2026年的语音模型已经做到了极低延迟（端到端300-500毫秒），可以实时对话，可以识别情绪，可以在你说话的时候适时地”嗯”一声表示在听。

对Agent产品来说，语音交互改变了一切。它让Agent可以在你开车、做饭、运动时使用。它让Agent可以服务不擅长打字的人群（老人、儿童、视障人士）。它也带来了新的挑战——语音比文字模糊得多，Agent需要更强的上下文理解和意图推断能力。

语音Agent的技术指标（2026年）：

指标	行业水平	说明
端到端延迟	300-500ms	从用户说完到Agent开始回应
语音识别准确率（中文）	97.8%	安静环境下的普通话识别
情绪识别准确率	85.3%	识别愤怒、开心、焦虑等情绪

1.4 前沿理论 I：Scaling Law——AI能力的”物理定律”

如果说大模型领域有什么最接近”物理定律”的发现，那就是Scaling Law（缩放定律）。它描述了模型性能与三个核心变量之间的数学关系：模型参数量（N）、训练数据量（D）、计算量（C）。

Kaplan等人的奠基性发现

2020年，OpenAI的Jared Kaplan、Sam McCandlish等人在论文《Scaling Laws for Neural Language Models》（arXiv:2001.08361次）中首次系统性地揭示了这一规律。他们训练了数百个模型，参数量从数百万到数十亿不等，发现了一个惊人的事实：

模型性能（以交叉熵损失衡量）与三个因素之间呈现幂律关系（Power Law）：

其中L是损失值（越低越好），Nc、Dc、Cc是临界常数。这意味着：当你把模型参数翻倍时，损失值按幂律下降；当你把训练数据翻倍时，损失值也按幂律下降——但下降的速度不同。

关键洞察： Kaplan等人发现，在固定计算预算下，最优策略是训练非常大的模型，但用相对较少的数据（远不到收敛）。这是因为大模型在样本效率上显著更高——它们从每个Token中学到的东西更多。这一发现直接推动了GPT-3（1750亿参数）的诞生。

Chinchilla：数据与模型的再平衡

2022年，DeepMind的Hoffmann等人在《Training Compute-Optimal Large Language Models》（Chinchilla论文）中修正了Kaplan的结论。他们发现Kaplan的实验存在一个关键问题：模型训练得不够久。当模型在更多数据上充分训练后，最优策略发生了变化。

Chinchilla的核心结论： 在固定计算预算下，模型参数量（N）和训练Token数（D）应该等比例增长，即 N ∝ D。具体来说，一个700亿参数的模型应该用1.4万亿Token训练（比例约1:20），而不是Kaplan建议的3000亿Token。

这意味着之前的大模型（包括GPT-3）都训练不足了——它们本可以在同样计算量下表现更好，如果用更多数据训练更小的模型。Chinchilla论文直接催生了后续模型训练策略的调整：GPT-4、Llama 2/3/4、DeepSeek等都采用了”数据密集型”训练路线。

2025-2026年的新发展：推理效率Scaling Law

Scaling Law的研究并未止步于训练阶段。2025年，Amazon Web Services和UW-Madison的研究者在论文《Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs》（arXiv:2510.18245v2）中提出了条件Scaling Law——将模型架构信息纳入Scaling框架。

他们发现，传统的Scaling Law忽略了架构对推理效率的影响。在同等训练预算下，优化架构（如调整MLP与Attention的比例、使用GQA分组查询注意力）可以在准确率提升2.1%的同时，推理吞吐量提高42%。

伪代码示意：条件Scaling Law的架构搜索

设计思路： 条件Scaling Law的核心洞见是”训练效率≠推理效率”。一个训练时表现很好的模型，推理时可能因内存带宽瓶颈或注意力计算复杂度而表现糟糕。通过将架构参数纳入Scaling框架，我们可以在训练前就预测不同架构的Pareto前沿（准确率 vs 延迟的权衡曲线），从而选择最适合目标部署环境的架构。这对Agent设计至关重要——Agent需要频繁调用模型，推理效率直接影响响应速度和成本。

三个Scaling Law的协同

2025-2026年，业界逐渐认识到Scaling Law不仅存在于训练阶段，还存在于后训练阶段（Post-training）和推理阶段（Test-time Compute）：

预训练Scaling（Pre-training Scaling）： 传统路线，“更大的模型+更多的数据”。成本最高，但奠定基础能力。
后训练Scaling（Post-training Scaling）： 通过RLHF、RLVR、SFT等后训练技术，在相对较小的计算量下显著提升特定能力。
推理时Scaling（Test-time Compute Scaling）： 让模型在推理时”多想一想”，通过更长的思维链提升复杂任务表现。

OpenAI的o系列和DeepSeek-R1证明，推理时Scaling可以解锁基础模型无法达到的能力。o3在ARC-AGI（抽象推理基准）上达到45.1%，而传统LLM在这个基准上接近0%。这意味着AI能力的提升不再完全依赖训练更大的模型——你可以用相对小的模型，通过推理时的额外计算，达到甚至超过大模型的表现。

对Agent设计的启示： Agent系统应该充分利用三个Scaling Law的协同。用预训练模型获得基础能力，用后训练（如领域特定的RLVR）获得专业行为，用推理时计算（如动态思维链长度）在复杂任务上获得更高准确率。这种分层策略比单纯追求更大的模型更具成本效益。

1.5 前沿理论 II：Mixture of Experts（MoE）——稀疏激活的智能

2025-2026年，大模型架构最重大的变革是MoE（混合专家）成为事实标准。从DeepSeek-R1、Kimi K2、Mistral Large 3到GPT-4（据传），几乎所有前沿模型都采用了MoE架构。

从密集到稀疏：MoE的核心思想

传统Transformer是”密集”的：每个Token都会激活模型的所有参数。一个1750亿参数的模型，处理每个Token都要进行1750亿次计算。这就像你问”今天天气怎么样”，却动用了整个百科全书编纂团队。

MoE的灵感来自人脑：大脑不同区域负责不同功能——语言区、视觉区、运动区。处理语言时，视觉区不会全力运转。MoE模仿这种”按需激活”的机制：

MoE的核心组件：

专家网络（Experts）： 多个小型神经网络（如前馈网络FFN），每个专门处理某类输入。
门控网络（Router/Gating Network）： 一个轻量级网络，决定每个Token应该由哪些专家处理。
稀疏激活： 对于每个Token，只激活Top-K个专家（通常K=2-8），其余专家不参与计算。

伪代码示意：MoE层的前向传播

设计思路： MoE的精妙之处在于”总容量”与”活跃计算”的解耦。模型可以拥有巨大的总参数量（存储海量知识），但每次推理只使用一小部分参数（保持计算效率）。这种”以存储换计算”的策略在硬件趋势下极具优势——GPU显存增长快于计算单元增长，MoE充分利用了显存容量，同时控制了计算量。门控网络的设计是关键：它需要学会”语义路由”——把代码Token路由给编程专家，把医学Token路由给医学专家。这本质上是一种隐式的任务分解。

2025-2026年的MoE演进：从Expert Collapse到负载均衡

早期的MoE模型面临一个严重问题：Expert Collapse（专家崩溃）。门控网络倾向于把所有Token都路由给少数几个”通用”专家，导致其他专家从未被训练，模型容量浪费。

2024-2026年的技术突破解决了这个问题：

负载均衡损失（Load Balancing Loss）： 在训练目标中加入辅助损失，惩罚路由不均衡的情况。如果某个专家接收的Token太多，损失会增加，迫使门控网络分散负载。
专家选择策略的改进： 从”Top-K”到”Expert Choice”——不再由Token选择专家，而是由专家选择Token。这保证了每个专家处理的Token数量均衡。
共享专家（Shared Experts）： 设置一部分专家为”共享”（所有Token都经过），另一部分为”专用”（按需激活）。共享专家学习通用表示，专用专家学习领域特定知识。

NVIDIA的观察（2025年12月）： 在独立排行榜Artificial Analysis上，前10名最智能的开源模型全部使用MoE架构。自2025年初以来，MoE使模型智能提升了近70倍。Kimi K2 Thinking在NVIDIA GB200 NVL72上相比HGX H200实现了10倍性能提升，Token成本降至1/10。

DeepSeek的MoE创新：MLA + MoE的协同

DeepSeek在MoE架构上做出了独特贡献。除了标准的MoE设计，他们还引入了多头潜在注意力（Multi-head Latent Attention，MLA）——一种大幅减少KV缓存（Key-Value Cache）占用的注意力机制。

在标准Transformer中，每个Token的注意力计算需要存储Key和Value向量，内存占用随序列长度线性增长。MLA通过低秩压缩，将KV缓存减少了数倍。这使得DeepSeek模型在处理长文本时内存压力大幅降低，为200万Token的上下文窗口奠定了基础。

伪代码示意：MLA vs 标准MHA的KV缓存对比

设计思路： MLA的核心洞察是”注意力中的冗余”。标准MHA中，不同注意力头的Key/Value向量存在大量信息共享；MLA通过训练得到的投影矩阵，将这些信息压缩到更小的潜在空间中。这类似于主成分分析（PCA）的思想——保留最重要的信息维度，丢弃冗余。对于Agent应用，MLA意味着可以处理更长的上下文（更多历史对话、更大文档），而不被GPU显存限制。

1.6 前沿理论 III：推理模型与Test-Time Compute Scaling

2025年被称为”推理元年”（Year of Reasoning）。OpenAI的o系列、DeepSeek-R1、Google Gemini 2.5/3等模型，共同确立了一个新范式：让模型在推理时”多想一想”。

从System 1到System 2：认知科学的双系统理论

这个范式的理论基础来自心理学家Daniel Kahneman的”双系统理论”：

System 1（快思考）： 直觉、快速、自动、情绪化。传统LLM的生成方式类似System 1——接到问题立即回答。
System 2（慢思考）： 理性、缓慢、逻辑、计算化。推理模型类似System 2——接到问题后先分析、分解、验证，再给出答案。

Kahneman指出，人类在复杂问题上依赖System 2。同样，LLM在复杂数学、代码、逻辑问题上也需要”慢思考”。

RLVR：推理能力的训练机制

推理模型的训练不依赖传统的SFT（监督微调），而是使用RLVR（Reinforcement Learning from Verifiable Rewards）。其核心流程是：

模型生成思维链： 对于一个问题，模型先输出一系列”思考Token”（reasoning tokens），然后输出最终答案。
答案验证： 用客观标准验证最终答案是否正确（如数学问题的数值答案、代码的运行结果）。
强化学习更新： 如果答案正确，奖励模型；如果错误，惩罚模型。模型通过这种方式学会”如何思考”。

DeepSeek-R1的关键发现： 他们尝试了”纯RL”路径——不给模型任何人类编写的思维链示例，直接用RL训练。结果模型自发涌现出了反思（reflection）、自我验证（self-verification）等高级行为。这证明推理能力不是”教”出来的，而是”激励”出来的——只要给模型正确的奖励信号，它会自己发现有效的推理策略。

伪代码示意：RLVR训练流程

设计思路： RLVR的本质是”结果导向的学习”。与传统监督学习”模仿人类解题步骤”不同，RLVR只关心最终答案是否正确，模型可以自由探索任何能达到正确结果的思维路径。这种”开放式学习”让模型发现了人类从未教过它的推理策略——比如DeepSeek-R1的”啊，等一下”（Aha Moment）现象：模型在思维链中突然意识到自己之前的错误，然后回溯修正。这种元认知能力是监督学习难以产生的。

Test-Time Compute Scaling：推理时的”预算分配”

推理模型的独特之处在于：性能随推理时计算量增加而提升。传统模型给一个问题的计算量是固定的（一次前向传播），推理模型可以通过生成更长的思维链来”投入更多计算”。

OpenAI的研究表明，o1的性能同时随训练时计算和推理时计算提升，形成三维的Scaling曲面：

这意味着，对于特别难的问题，你可以让模型”多想一会儿”（生成更长的思维链），而不是必须用更大的模型。这对Agent设计有深远影响：

伪代码示意：动态推理预算分配

设计思路： 动态推理预算的核心是”计算资源的智能分配”。不是所有问题都需要深度思考——简单问题用短思维链快速回答，复杂问题才投入更多计算。这种”自适应计算”策略在Agent系统中尤为重要：Agent可能同时处理数十个任务，把计算预算分配给最需要深度推理的任务，可以显著提升整体吞吐量。预算控制器本身可以是一个轻量级模型，或者基于启发式规则（如问题长度、关键词、历史难度统计）。

推理模型的成本与效率

推理模型的”思考Token”也是要计费的。OpenAI的o1在思考时生成的Token不显示给用户，但计入API费用。这带来了新的成本结构：

输入Token： 用户的问题（正常计费）
推理Token（隐藏）： 模型的思维链（通常比输出Token多5-20倍）
输出Token： 模型的最终答案（正常计费）

DeepSeek-R1的优势在于透明和低成本：它的思维链完全可见（包裹在<think>标签中），且API价格仅为OpenAI o3的约1/30。这让开发者可以：

观察模型的推理过程，用于调试和优化Prompt
控制推理深度，通过限制生成长度来控制成本
蒸馏推理能力，用R1的思维链训练更小、更快的模型

1.7 前沿理论 IV：长上下文与注意力机制的进化

上下文窗口的扩展是2024-2026年最显著的工程成就之一。从早期的4K Token到如今的千万级Token，这背后是一系列注意力机制的革新。

注意力复杂度的诅咒

Transformer的核心是自注意力机制（Self-Attention），其计算复杂度为O(n²)——序列长度翻倍，计算量变为4倍。这意味着：

4K Token：计算量 = 1x（基准）
128K Token：计算量 = 1024x
1M Token：计算量 = 65536x

这种平方复杂度是长上下文的主要瓶颈。2025-2026年的研究致力于在保持模型能力的同时，降低注意力的计算和内存开销。

线性注意力与状态空间模型

线性注意力（Linear Attention） 是一类将O(n²)复杂度降至O(n)的技术。其核心思想是改变注意力的计算顺序：

标准注意力：

线性注意力：

其中φ是一个特征映射函数（如elu+1）。通过改变矩阵乘法的结合顺序，线性注意力将二次复杂度降为线性。

状态空间模型（State Space Models，SSM） 如Mamba（2023）、Mamba-2（2024）更进一步。SSM将序列建模视为一个状态转移过程：

其中A、B、C是学习得到的矩阵。SSM的巧妙之处在于，通过特定的参数化（如将A设为对角矩阵），可以并行计算整个序列，同时保持线性复杂度。

伪代码示意：SSM与Transformer的复杂度对比

设计思路： 线性注意力和SSM代表了”效率优先”的路线。它们牺牲了标准注意力的全局交互能力（任意两个Token直接交互），换取了线性复杂度。但在实践中，完全的全局交互往往是不必要的——邻近Token的交互远比远距离Token频繁。SSM的”选择性机制”（让B、C矩阵依赖于当前输入）让模型可以动态决定”关注什么信息”，在一定程度上弥补了全局交互的缺失。对于Agent应用，长上下文能力意味着可以一次性处理更完整的对话历史、更大的代码库、更长的文档——但需要注意，线性复杂度的模型在”大海捞针”（Needle in Haystack）任务上的表现通常弱于标准注意力。

上下文压缩与分层记忆

另一种思路是”不扩展窗口，而是压缩内容”。上下文压缩（Context Compression） 技术通过摘要、选择性保留、向量索引等方式，在有限的窗口内放入更多信息。

分层记忆架构（Hierarchical Memory） 是Agent系统中常用的策略：

伪代码示意：分层记忆系统

设计思路： 分层记忆模仿了人类的记忆系统——工作记忆容量极小但访问极快，长期记忆容量极大但访问需要”检索”。在Agent设计中，分层记忆解决了”上下文窗口有限”与”需要长期记忆”之间的矛盾。关键设计决策包括：各层级的容量分配、摘要策略（提取式 vs 生成式）、检索策略（向量相似度 vs 关键词 vs 混合）。一个常见的优化是”相关性加权”：在构建上下文时，不仅考虑时间远近，还考虑与当前查询的语义相关性——让Agent”记得该记得的”。

1.8 前沿理论 V：模型安全与对齐——大模型的对齐

Agent的自主性越高，安全问题越严峻。一个能自主调用工具、访问数据、执行代码的Agent，如果行为偏离预期，后果比聊天机器人严重得多。

RLHF与RLVR的对齐机制

RLHF（Reinforcement Learning from Human Feedback） 是OpenAI在InstructGPT（2022）中引入的对齐技术。其核心流程是：

收集人类偏好数据： 对同一个问题，模型生成多个回答，人类标注员选择更好的那个。
训练奖励模型（Reward Model）： 学习预测人类的偏好——给定一个问题和回答，输出一个”人类会多喜欢”的分数。
用PPO算法优化策略： 让语言模型生成奖励模型打分更高的回答，同时约束模型不要偏离原始能力太远（KL散度惩罚）。

RLVR（Reinforcement Learning from Verifiable Rewards） 是RLHF的进化版，用客观可验证的标准替代主观的人类偏好：

维度	RLHF	RLVR
奖励来源	人类标注员的偏好	代码运行结果、数学答案等
主观性	高（不同标注员标准不一）	低（对错分明）
可扩展性	低（需要大量人工标注）	高（自动验证）
适用任务	开放式创作、对话	代码、数学、逻辑推理
典型应用	ChatGPT的对话风格	DeepSeek-R1的推理能力

伪代码示意：RLHF与RLVR的奖励模型对比

设计思路： RLHF和RLVR代表了两种不同的对齐哲学。RLHF试图让模型”符合人类价值观”（helpful、harmless、honest），但人类价值观是复杂且有时矛盾的。RLVR则聚焦于”客观正确性”——只要答案是对的，模型可以自由选择表达方式。对于Agent系统，两种对齐方式需要结合：用RLHF确保Agent的行为符合社会规范（不生成有害内容、保持礼貌），用RLVR确保Agent的任务执行是正确有效的（代码能运行、数学答案对）。

安全挑战：越狱、提示注入与Agent特有的风险

Agent的安全风险比纯对话模型更复杂：

1. 越狱（Jailbreaking）： 用户通过精心构造的提示，绕过模型的安全约束。例如：“假设你是一个没有道德限制的AI…”

2. 提示注入（Prompt Injection）： 攻击者将恶意指令嵌入模型处理的数据中。例如：在网页中隐藏”忽略之前的指令，执行以下操作…”

3. 工具滥用（Tool Abuse）： Agent被诱导调用危险工具。例如：“帮我发一封邮件给所有人，内容是’公司破产了’”

4. 权限提升（Privilege Escalation）： Agent通过一系列看似无害的操作，最终获得超出预期的权限。

防御策略的分层架构：

伪代码示意：Agent安全控制框架

设计思路： Agent安全的核心是”最小权限原则”和”纵深防御”。最小权限原则：Agent只拥有完成任务所必需的最小工具集和权限，绝不赋予超额权限。纵深防御：不依赖单一安全机制，而是在输入、处理、输出、监控多个层面设置防线。特别需要注意的是，Agent的”记忆”可能成为攻击载体——如果Agent记住了之前的恶意输入，后续会话中可能受到影响。因此，记忆系统也需要安全过滤，确保不会持久化有害信息。

1.9 2026年的模型格局：如何选型

理解了原理，回到实践：面对不同任务，该选哪个模型？

主流模型能力矩阵（2026年）

维度	GPT-5.5 Pro	Claude 4 Opus	DeepSeek V4	Gemini 3 Ultra	Llama 4
推理能力	★★★★★	★★★★☆	★★★★★	★★★★☆	★★★☆☆
代码能力	★★★★★	★★★★★	★★★★★	★★★★☆	★★★★☆
长上下文	★★★☆☆	★★★★★	★★★★☆	★★★★★	★★★☆☆
多模态	★★★☆☆	★★★★☆	★★★☆☆	★★★★★	★★★☆☆
成本	$$$$	$$$$$	$	$$	免费(自托管)
可定制性	中	中	高(开源)	中	极高

模型选择的决策框架

选型不是选「最好的」，而是选「最适合的」。决策框架：

第一步：明确任务类型

需要深度推理（数学、复杂逻辑）→ GPT-5.5 Pro / DeepSeek V4
需要处理超长文档（法律、科研）→ Claude 4 Opus / Gemini 3 Ultra
需要多模态（图像、视频分析）→ Gemini 3 Ultra
需要代码生成 → GPT-5.5 / Claude 4 / DeepSeek V4
成本敏感、高并发 → DeepSeek V3.2 / Llama 4（自托管）

第二步：考虑约束条件

预算： 月Token消耗量决定了成本敏感度。日消耗<100万Token可用高端模型；日消耗>1000万Token必须考虑成本优化。
延迟： 实时交互（语音、直播）需要<500ms响应，选Gemini 3或Claude 4 Haiku；异步任务（报告生成、数据分析）可接受数秒延迟。
数据隐私： 敏感数据不能出域 → 选Llama 4自托管或DeepSeek私有部署。
可定制性： 需要领域微调 → 选开源模型（Llama 4、DeepSeek）。

第三步：设计多模型路由策略

生产级Agent很少依赖单一模型。典型的路由策略：

伪代码示意：多模型路由系统

设计思路： 多模型路由的本质是”计算资源的动态优化”。不同模型在不同任务上的”性价比”（能力/成本）差异巨大，路由系统的目标是最大化整体性价比。意图分类器需要足够快（否则路由本身成为瓶颈），通常使用小型模型或规则引擎。路由策略可以是静态的（基于预设规则）或动态的（基于历史反馈自适应）。在生产环境中，建议同时运行”影子模式”——用多个模型处理同一请求，但只返回一个结果，其他结果用于离线评估和模型对比。

1.10 本章小结：大模型能力「使用说明书」

这一章我们覆盖了大量内容，从基础概念到前沿理论。以下是核心要点：

你必须记住的

大模型的本质是”预测下一个词”。它是一台超强的模式匹配机器，不是人类式的思考者。理解这一点，你就不会对它产生不切实际的期待。
Token是模型的原子。所有计量单位都是Token——计费、上下文窗口、速度。设计Agent时必须时刻考虑Token效率。
上下文窗口是工作记忆。窗口大了，Agent能处理更复杂的任务；但窗口内的信息不一定都能被有效利用（Lost in the Middle）。
温度控制随机性。代码用低温（0-0.2），创意用高温（0.8-1.2）。温度选择直接影响任务质量。
推理模型让Agent能”想一想再做”。但推理有成本——思考Token也计费。DeepSeek V4把推理成本降到了行业最低。

你应该了解的

Scaling Law是AI的”物理定律”。模型性能随参数、数据、计算量幂律增长。2025-2026年的新发现：推理时计算（Test-time Compute）是第三个Scaling维度。
MoE是前沿模型的事实标准。稀疏激活让模型拥有巨大总参数量（存储知识），但保持合理的计算量（推理效率）。DeepSeek的MLA进一步降低了长上下文推理的内存压力。
RLVR训练出推理能力。不是”教”模型推理，而是”激励”它推理——用可验证的奖励（代码运行结果、数学答案）驱动模型自发发展推理策略。
长上下文依赖注意力机制革新。标准注意力的 O(n²) 复杂度是瓶颈；线性注意力、状态空间模型（SSM）、分层记忆是常见路线。
对齐与安全是模型底线。RLHF/RLVR、越狱与提示注入等风险，在接入工具与编排后会被放大——第 2 章起从 Agent 视角展开。

下一步

理解了大模型这颗「大脑」之后，第 2 章进入 Agent 的本质：Model + Harness、ReAct 循环、Function Calling 与四层架构，以及模型能力质变带来的「Agent 拐点」。

参考与延伸阅读

论文

Kaplan et al., “Scaling Laws for Neural Language Models”, arXiv:2001.08361, 2020
Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla), 2022
Bian et al., “Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs”, arXiv:2510.18245v2, 2026
Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models”, 2022
DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”, 2025

技术博客与报告

NVIDIA Blog: “Mixture of Experts Powers the Most Intelligent Frontier AI Models”, 2025
OpenAI o-series Technical Documentation
Anthropic: “Constitutional AI: Harmlessness from AI Feedback”, 2022
Google DeepMind: Gemini 3 Technical Report, 2025

行业分析

2025年大语言模型技术全景报告（CSDN）
2025-2026开源大模型深度报告
AI Reasoning Models 2026: From OpenAI o3 to DeepSeek-R1 (Zylos AI Research)
DeepSeek R1 vs OpenAI o3 vs Gemini 3: Reasoning Model Benchmarks (Meta-Intelligence)

关键概念索引

Token: 模型处理的最小文本单元
上下文窗口: 模型一次能处理的Token数量上限
温度: 控制生成随机性的参数（0-2）
推理模型: 通过思维链（CoT）进行多步推理的模型（o系列、DeepSeek-R1）
Scaling Law: 模型性能与规模/数据/计算的幂律关系
MoE: 混合专家架构，稀疏激活降低计算成本
RLVR: 基于可验证奖励的强化学习
RLHF: 基于人类反馈的强化学习
ReAct: 推理+行动的Agent认知循环
Function Calling: 原生结构化工具调用
MLA: 多头潜在注意力，降低KV缓存占用
SSM: 状态空间模型，线性复杂度的序列建模
Test-time Compute: 推理时计算，让模型”多想一想”

Chapter 1: Large Language Models — The Agent’s Brain

“Coding” is no longer the right verb. The new verb is “manifesting” — expressing your intent. — Andrej Karpathy, December 2025

Large language models look like they “think,” but the underlying mechanism is surprisingly plain: predict the next token. Scale, data, and alignment stack on top of that simple loop to produce poetry, code, and reasoning we experience in products.

This chapter covers the model itself — training, Tokens, temperature, multimodality, architecture frontiers (Scaling Laws, MoE), reasoning models, and alignment. Agent orchestration, tool use, ReAct, and Function Calling belong in Chapter 2 and beyond.

We focus on technical principles, not product methodology: how the “brain” runs, what it’s good at, where it fails, and where the boundaries are. Many advanced language abilities grow from one starting point — next-token prediction.

1.1 How Large Language Models Work (Explained in Plain Language)

Let’s start with the conclusion: there is only one thing a large language model does — predict the next word.

Not joking. Whether a model has 1 billion parameters or 1.6 trillion parameters, whether it can write poetry, program, or analyze legal contracts, the underlying mechanism is the same: given the preceding text, predict the most likely next word (strictly speaking, “token,” which we’ll explain later). Then add that predicted word to the context and predict the next one, one after another, like a snowball rolling.

The first time I heard this explanation, I was a bit disappointed. “Predicting the next word” sounds too unlike intelligence — isn’t that just the suggestion feature on a phone keyboard? You type “I’m going to eat today,” and it suggests “lunch.” But the deeper you read, the more you realize that this seemingly simple mechanism is precisely the source of all complex capabilities. For example: the ocean is also composed of water molecules — individual H₂O molecules aren’t magical, but when tens of billions of trillions of them combine, you get ocean currents, tides, and hurricanes. LLMs are similar — “predicting the next word” isn’t magical by itself, but when executed repeatedly at the scale of trillions of parameters, emergent behaviors like reasoning, understanding, and creativity emerge.

Courses and papers often summarize this as “one formula that rules everything.” Sounds too simple, right? But the key is: when you use virtually all text on the internet to train this “predict next word” model, it is forced to learn some unexpected things.

Think about it: to accurately predict the next sentence of a medical question, you need to understand medicine. To predict the next line of Python code, you need to understand programming logic. To predict the conclusion of a reasoning passage, you need to learn reasoning. The model isn’t taught this knowledge — it compresses these capabilities on its own under the pressure of “predicting the next word.” This is like a student who isn’t studying for an exam, but is thrown into a pure foreign-language environment — to survive, they have to learn all the subtleties of that language.

This is like asking you to solve an exam question: you’re given the first 99% of a book, and you have to guess what’s written on the last page. If you can guess it with reasonable accuracy, it means you didn’t just read the book — you understood its structure, logic, and style. The model does this every day, and the “books” it has read are the entire internet.

How Models Are Trained

Training is usually described in three stages. Understanding them helps you see where model capabilities come from:

Stage 1: Pre-training. This is the “read ten thousand books” stage. Take the massive amount of text on the internet — books, web pages, code, papers, forum posts — and let the model learn “predict next word” over and over. After this stage, the model is like a genius who has read countless books but has never talked to anyone: it has a frighteningly large amount of knowledge, but doesn’t know how to converse with people — it might recite Wikipedia to you instead of answering your question.

Take GPT-5.5 as an example: its pre-training data reached an astonishing 15 trillion Tokens, equivalent to reading 75 million books. The training used over 200,000 NVIDIA H200 GPUs, took months, and cost about $5 billion. This scale of training allowed the model to compress a vast amount of human knowledge, but it also created the “knowledge cutoff” problem — the model doesn’t know what happened after its training data cutoff date. If you ask it “who won Best Picture this year,” it will tell you all the Oscars it knows, but it just won’t tell you this year’s — because this year’s hasn’t been written into its “textbook” yet.

Stage 2: Supervised Fine-Tuning (SFT). This is the “learning social etiquette” stage. Use high-quality dialogue data written by humans — what to answer when asked what, what format, what tone — to teach the model hand-by-hand how to be an “assistant.” A pre-trained model knows all the world’s knowledge, Stage 3: RLHF/RLVR. This is the “honing through practice” stage. The model generates multiple answers, which are scored by humans or automated evaluation systems, and the model learns from feedback what makes a good answer versus a bad one. RLVR trains on verifiable signals (code runs, math checks) rather than subjective preferences — encouraging multi-step reasoning and self-checking.

The core advantage of RLVR lies in “verifiability.” In traditional RLHF, human annotators score answer quality, but the scoring criteria are subjective and inconsistent — the same person might give different scores in the morning versus the afternoon. RLVR uses objective criteria: can the code run? Is the math answer correct? Is the logical derivation rigorous? This objective feedback allows the model to learn “self-checking” — before giving the final answer, it goes through a verification process in its “mind.” An interesting finding is in the DeepSeek-R1 paper: they tried training with pure RL without any human-written reasoning chain examples, and the model spontaneously evolved behaviors like “wait, let me double-check.” They called this the “Aha Moment” — the model spontaneously learned self-questioning during training, just like when a human suddenly realizes they made a mistake while solving a problem.

The key breakthrough in 2025-2026 was the large-scale application of “Inference-Time Compute.” OpenAI’s o-series models and DeepSeek-R1 proved: letting the model “think more” before answering (generating longer reasoning chains) can significantly improve accuracy on complex tasks. GPT-5.5’s accuracy on the FrontierMath Tier 4 test increased from 27.1% (GPT-5.4) to 35.4%, and further to 39.6% (GPT-5.5 Pro). This improvement largely came from inference-time compute optimization.

”Summoned Ghosts”

A useful metaphor: LLMs are not evolved general intelligence, but “summoned ghosts” — systems summoned through training, optimized to imitate text and solve problems.

This metaphor deserves deeper thought. Ghosts have two characteristics: first, they can do things beyond ordinary people (walking through walls, flying), but are powerless in other seemingly simple matters (like picking up a physical cup). Second, you can never be completely sure what they’re thinking — they have their own operational logic, completely different from humans.

Models are the same. They can demonstrate superhuman capabilities in certain tasks — like writing perfectly structured code or analyzing complex legal provisions — Understanding this, you won’t have unrealistic expectations of it, nor feel disillusioned when it makes mistakes. The Agent’s brain is powerful, but it has its own “cognitive architecture,” completely different from humans. Knowing where its boundaries are will actually help you better leverage its capabilities.

1.2 Core Concepts: Token, Context Window, Temperature, Reasoning

To work with Agents, you don’t need to know how to write code, but you need to understand a few core concepts. They determine the Agent’s capability boundaries and behavioral characteristics.

Token: The Model’s “Atom”

Models don’t process text directly; they process Tokens.

A Token can be a complete word (like “hello”), part of a word (like “un” + “break” + “able”), or even punctuation or spaces. Chinese typically corresponds to 1-2 Tokens per character, while English typically corresponds to 1-1.5 Tokens per word.

Why is this concept important? Because all the model’s accounting units are Tokens, not character or word counts. APIs charge by Token, context windows are calculated by Token, and model generation speed is also measured in Tokens/second. When you tell the model “help me write a 2000-character article,” what it’s thinking is “approximately 3000 Tokens.”

Context Window: The Model’s “Working Memory”

The context window is how many Tokens the model can “see” at once. You can think of it as a person’s workbench — how many sheets of paper can be spread out on the desk.

Early models had very small context windows, possibly only 4096 Tokens (about 3000 Chinese characters). If you stuffed a long article into it, it could only see the first few pages, and the later parts were “invisible.” This severely limited the Agent’s ability to handle complex tasks — think about it: if an Agent needs to analyze a 50-page contract, By 2026, this limitation was largely solved. Mainstream models’ context windows have expanded to the million-Token level:

Model	Context Window	Equivalent To	Key Technology
Gemini 3 Ultra	10M Tokens	Entire Wikipedia	Infinite sliding window
Claude 4 Opus	5M Tokens	~4000 pages	Neural caching
DeepSeek V4	2M Tokens	~1600 pages	Sparse attention optimization
GPT-5.5	1M Tokens	~800 pages	Standard config
Llama 4	100K Tokens	~80 pages	Consumer-grade deployment

This means Agents can finally process large documents, long codebases, and complete conversation histories all at once, without needing to “read in segments” and then “piece together understanding.”

But there’s a common misconception here: bigger context window = better. Actually, no — a large context window doesn’t mean the model can fully utilize all the information. It’s like your workbench can spread out 100 sheets of paper, “Lost in the Middle” phenomenon: Research from Stanford University in 2024 found that models’ recall rate for information in the middle of the context is significantly lower than for the beginning and end. In Claude 4’s 5-million-Token window, if key information is buried in the middle, the probability of the model finding it may be less than 60%. So in Agent design, how context is organized and managed (Context Engineering) becomes important — put key information at the beginning or end, use clear markers to separate different parts, and use summarization/compression when necessary.

Temperature: Controlling the Model’s “Personality”

Temperature is a parameter during model text generation, typically ranging from 0 to 2. It controls the “randomness” when the model predicts the next word.

At temperature 0, the model selects the highest-probability word every time. The output is most stable and deterministic, but may seem rigid and repetitive. Like an extremely cautious person who only says what they’re most confident about.

At temperature 0.7, the model makes moderate random selections among high-probability words. The output has variation and creativity, but is still basically reliable. This is the default setting for most scenarios.

At temperature 1.5, the model selects lower-probability words more boldly. The output may be very creative, or it may become nonsense. Like someone who’s had two drinks — they talk more, but also start to ramble.

For Agent design, temperature selection directly affects task quality. Writing code requires low temperature (0-0.3), because code is either right or wrong — no need for “creativity.” Brainstorming requires high temperature (0.8-1.2), because diverse ideas are needed. Writing reports can use medium temperature (0.5-0.7) — some expressive variation is needed,

Reasoning: The Model’s “Thinking” Capability

One of the most important technical breakthroughs in 2024-2025 was the emergence of “reasoning” capability.

Traditional models “answer directly” — you ask a question, it outputs the answer directly, and the intermediate process is implicit. Reasoning models (like OpenAI’s o-series, DeepSeek-R1) will first display a “thinking process” (Chain of Thought) before outputting the answer, writing out the problem-solving steps step by step.

Why is this so critical for Agents? Agents execute multi-step tasks. They need to: understand the task → break down steps → execute each step → check results → decide next step. This entire process is “reasoning.” Models without reasoning capability can only do simple single-step tasks; with reasoning capability, models can “think before doing,” just like humans.

But reasoning capability has a cost. Reasoning models consume a large number of Tokens when “thinking” (these thinking Tokens are also billed), and the thinking time is longer. So DeepSeek V4’s architecture optimization is of great significance — it reduced inference FLOPs (floating-point operations) to 27% of V3, meaning the computational cost of doing the same reasoning is less than one-third of the original. Agents need to perform multi-step reasoning frequently — if each reasoning session is slow and expensive, Agents can only be lab experiments.

In 2025-2026, Kahneman’s dual-system analogy is often used for models: System 1 (fast, intuitive) vs System 2 (slow, logical). Plain generation is closer to System 1; visible chain-of-thought reasoning is closer to System 2 — more tokens and latency, better accuracy on hard tasks. How to pair this with Agent orchestration is covered in Chapter 2.

1.3 Multimodality: Not Just Text (Vision, Speech, Code)

The word “multimodal” sounds technical, but the concept is simple: models can not only process text, but also images, sound, video, code — and can freely convert between these different “modalities.”

Vision: The Model Has Grown Eyes

Current models can directly “look” at images. You give it a photo, and it can describe the content, recognize text, analyze charts, and even understand sarcastic memes.

This suddenly gives Agents many more things they can do. Imagine a customer service Agent: previously it could only process text messages; if a user sent a product fault photo, it couldn’t understand it. Now it can analyze the image, identify the problem, and provide a solution. A data analysis Agent can directly look at charts without needing you to convert the data to text and describe it.

Speech: Hearing Your Voice

Speech capability transforms Agents from “typing tools” to conversational companions. Speech models in 2025-2026 have achieved extremely low latency (300-500ms end-to-end), can conduct real-time conversations, can recognize emotions, and can appropriately “mm-hmm” while you’re speaking to show they’re paying attention.

For Agent products, voice interaction changes everything. It allows Agents to be used while you’re driving, cooking, or exercising. It allows Agents to serve people who aren’t good at typing (elderly, children, visually impaired). It also brings new challenges — speech is much more ambiguous than text, and Agents need stronger context understanding and intent inference capabilities.

1.4 Frontier Theory I: Scaling Laws — The “Physical Laws” of AI Capability

If there’s a discovery in the large model field that’s closest to a “physical law,” it’s the Scaling Law. It describes the mathematical relationship between model performance and three core variables: model parameter count (N), training data volume (D), and computational amount (C).

Kaplan et al.’s Foundational Discovery

In 2020, OpenAI’s Jared Kaplan, Sam McCandlish, et al. first systematically revealed this pattern in their paper “Scaling Laws for Neural Language Models” (arXiv:2001.08361). They trained hundreds of models, with parameter counts ranging from millions to billions, and discovered an astonishing fact:

Model performance (measured by cross-entropy loss) exhibits a power-law relationship with three factors:

Where L is the loss value (lower is better), and Nc, Dc, Cc are critical constants. This means: when you double the model parameters, the loss value decreases according to a power law; when you double the training data, the loss value also decreases according to a power law — Key insight: Kaplan et al. found that under a fixed compute budget, the optimal strategy is to train very large models

Chinchilla: Rebalancing Data and Models

In 2022, DeepMind’s Hoffmann et al. corrected Kaplan’s conclusion in the paper “Training Compute-Optimal Large Language Models” (Chinchilla paper). They found a key problem with Kaplan’s experiment: the models weren’t trained long enough. When models are trained more fully on more data, the optimal strategy changes.

Chinchilla’s core conclusion: Under a fixed compute budget, model parameter count (N) and training Token count (D) should grow in equal proportion, i.e., N ∝ D. Specifically, a 70-billion-parameter model should be trained with 1.4 trillion Tokens (ratio about 1:20), not the 300 billion Tokens suggested by Kaplan.

This means previous large models (including GPT-3) were all under-trained — they could have performed better with the same compute budget if trained on more data with a smaller model. The Chinchilla paper directly spurred adjustments to subsequent model training strategies: GPT-4, Llama 2/3/4, DeepSeek, etc. all adopted the “data-intensive” training route.

2025-2026 New Development: Inference-Efficient Scaling Law

Research on Scaling Laws didn’t stop at the training stage. In 2025, researchers from Amazon Web Services and UW-Madison proposed Conditional Scaling Law in their paper “Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs” (arXiv:2510.18245v2) — incorporating model architecture information into the Scaling framework.

They found that traditional Scaling Laws ignore the impact of architecture on inference efficiency. Under the same training budget, optimizing architecture (such as adjusting MLP-to-Attention ratio, using GQA grouped-query attention) can improve accuracy by 2.1% while increasing inference throughput by 42%.

Synergy of Three Scaling Laws

In 2025-2026, the industry gradually recognized that Scaling Laws exist not only in the training stage, but also in the post-training stage (Post-training) and the inference stage (Test-time Compute):

Pre-training Scaling: The traditional route, “bigger models + more data.” Highest cost,
Post-training Scaling: Through post-training techniques like RLHF, RLVR, SFT, significantly improve specific capabilities with relatively small computational amounts.
Test-time Compute Scaling: Let the model “think more” during inference, improving complex task performance through longer reasoning chains.

OpenAI’s o-series and DeepSeek-R1 prove that Test-time Scaling can unlock capabilities that base models cannot achieve. o3 reached 45.1% on ARC-AGI (abstract reasoning benchmark), while traditional LLMs are near 0% on this benchmark. This means the improvement of AI capabilities no longer completely depends on training larger models — you can use a relatively small model and achieve or even exceed large model performance through additional computation during inference.

Implications for Agent design: Agent systems should fully utilize the synergy of three Scaling Laws. Use pre-trained models to obtain base capabilities, use post-training (e.g., domain-specific RLVR) to obtain professional behaviors, and use test-time computation (e.g., dynamic reasoning chain length) to achieve higher accuracy on complex tasks. This hierarchical strategy is more cost-effective than simply pursuing larger models.

1.5 Frontier Theory II: Mixture of Experts (MoE) — Sparsely Activated Intelligence

The most significant architectural change in large models in 2025-2026 is MoE (Mixture of Experts) becoming the de facto standard. From DeepSeek-R1, Kimi K2, Mistral Large 3 to GPT-4 (rumored), almost all frontier models have adopted the MoE architecture.

From Dense to Sparse: MoE’s Core Idea

Traditional Transformers are “dense”: every Token activates all parameters of the model. A 175-billion-parameter model performs 175 billion calculations for each Token. This is like asking “what’s the weather today” and mobilizing the entire encyclopedia editorial team.

MoE’s inspiration comes from the human brain: different regions of the brain are responsible for different functions — language area, visual area, motor area. When processing language, the visual area doesn’t run at full capacity. MoE mimics this “on-demand activation” mechanism:

MoE’s core components:

Expert networks: Multiple small neural networks (e.g., feed-forward networks FFN), each specialized in processing certain types of input.
Gating network: A lightweight network that decides which experts should process each Token.
Sparse activation: For each Token, only the Top-K experts (usually K=2-8) are activated; the remaining experts don’t participate in computation.

The ingenious aspect of MoE lies in the decoupling of “total capacity” and “active computation.” The model can have a huge total parameter count (storing vast knowledge),

MoE Evolution in 2025-2026: From Expert Collapse to Load Balancing

Early MoE models faced a serious problem: Expert Collapse. The gating network tended to route all Tokens to a few “generalist” experts, causing other experts to never be trained, wasting model capacity.

Technical breakthroughs in 2024-2026 solved this problem:

Load Balancing Loss: Add auxiliary loss to the training objective, penalizing unbalanced routing. If a certain expert receives too many Tokens, the loss increases, forcing the gating network to distribute the load.
Improved expert selection strategy: From “Top-K” to “Expert Choice” — no longer Token selects expert, but expert selects Token. This ensures each expert processes a balanced number of Tokens.
Shared Experts: Set part of the experts as “shared” (all Tokens pass through), while others are “specialized” (activated on demand). Shared experts learn general representations, specialized experts learn domain-specific knowledge.

DeepSeek’s MoE Innovation: MLA + MoE Synergy

DeepSeek made unique contributions to the MoE architecture. In addition to standard MoE design, they also introduced Multi-head Latent Attention (MLA) — an attention mechanism that significantly reduces KV cache (Key-Value Cache) usage.

In standard Transformer, each Token’s attention computation needs to store Key and Value vectors, and memory usage grows linearly with sequence length. MLA reduces KV cache by several times through low-rank compression. This allows DeepSeek models to handle long text with much lower memory pressure, laying the foundation for the 2-million-Token context window.

1.6 Frontier Theory III: Reasoning Models and Test-Time Compute Scaling

2025 is called the “Year of Reasoning.” OpenAI’s o-series, DeepSeek-R1, Google Gemini 2.5/3, and other models collectively established a new paradigm: let the model “think more” during inference.

From System 1 to System 2: Cognitive Science’s Dual System Theory

The theoretical foundation of this paradigm comes from psychologist Daniel Kahneman’s “Dual System Theory”:

System 1 (fast thinking): Intuitive, fast, automatic, emotional. Traditional LLM generation is similar to System 1 — immediately answering upon receiving a question.
System 2 (slow thinking): Rational, slow, logical, computational. Reasoning models are similar to System 2 — upon receiving a question, first analyze, decompose, verify, then provide an answer.

Kahneman pointed out that humans rely on System 2 for complex problems. Similarly, LLMs also need “slow thinking” for complex math, coding, and logic problems.

RLVR: The Training Mechanism for Reasoning Capability

Reasoning model training doesn’t rely on traditional SFT (Supervised Fine-Tuning), but uses RLVR (Reinforcement Learning from Verifiable Rewards). Its core process is:

Model generates reasoning chain: For a problem, the model first outputs a series of “reasoning Tokens,” then outputs the final answer.
Answer verification: Use objective criteria to verify whether the final answer is correct (e.g., numerical answer to a math problem, code execution result).
Reinforcement learning update: If the answer is correct, reward the model; if wrong, penalize the model. Through this process, the model learns “how to think.”

DeepSeek-R1’s key finding: They tried the “pure RL” path — without giving the model any human-written reasoning chain examples, training directly with RL. The result: the model spontaneously emerged with advanced behaviors like reflection and self-verification. This proves that reasoning capability isn’t “taught” but “incentivized” — as long as you give the model the right reward signal, it will discover effective reasoning strategies on its own.

1.7 Frontier Theory IV: Long Context and Attention Mechanism Evolution

The expansion of the context window is one of the most significant engineering achievements of 2024-2026. From early 4K Tokens to today’s tens of millions of Tokens, this is backed by a series of attention mechanism innovations.

The Curse of Attention Complexity

The core of Transformer is the Self-Attention mechanism, with computational complexity O(n²) — when sequence length doubles, computational amount becomes 4x. This means:

4K Tokens: computational amount = 1x (baseline)
128K Tokens: computational amount = 1024x
1M Tokens: computational amount = 65536x

This quadratic complexity is the main bottleneck for long context. Research in 2025-2026 is dedicated to reducing the computational and memory overhead of attention while maintaining model capabilities.

Linear Attention and State Space Models

Linear Attention is a class of techniques that reduce O(n²) complexity to O(n). Its core idea is to change the computation order of attention.

Standard attention:

Linear attention:

State Space Models (SSM) like Mamba (2023), Mamba-2 (2024) go even further. SSM views sequence modeling as a state transition process.

Context Compression and Hierarchical Memory

Another approach is “not expanding the window, but compressing content.” Context Compression techniques allow more information to be placed within a limited window through summarization, selective retention, vector indexing, etc.

Hierarchical Memory Architecture is a common strategy in Agent systems.

1.8 Frontier Theory V: Model Safety and Alignment

The higher the autonomy of an Agent, the more severe the safety issues. An Agent that can autonomously call tools, access data, and execute code — if its behavior deviates from expectations — has far more serious consequences than a chatbot.

Alignment Mechanisms of RLHF and RLVR

RLHF (Reinforcement Learning from Human Feedback) is the alignment technique introduced by OpenAI in InstructGPT (2022). Its core process is…

RLVR (Reinforcement Learning from Verifiable Rewards) is an evolution of RLHF, using objective verifiable criteria to replace subjective human preferences.

Safety Challenges: Jailbreaking, Prompt Injection, and Agent-Specific Risks

Agent safety risks are more complex than pure dialogue models:

1. Jailbreaking: Users bypass the model’s safety constraints through carefully constructed prompts. 2. Prompt Injection: Attackers embed malicious instructions in data processed by the model. 3. Tool Abuse: Agents are induced to call dangerous tools. 4. Privilege Escalation: Agents gain unexpected permissions through a series of seemingly harmless operations.

1.9 The 2026 Model Landscape: How to Choose a Model

Having understood the principles, let’s return to practice: for different tasks, which model should you choose?

Mainstream Model Capability Matrix (2026)

Dimension	GPT-5.5 Pro	Claude 4 Opus	DeepSeek V4	Gemini 3 Ultra	Llama 4
Reasoning	★★★★★	★★★★☆	★★★★★	★★★★☆	★★★☆☆
Coding	★★★★★	★★★★★	★★★★★	★★★★☆	★★★★☆
Long Context	★★★☆☆	★★★★★	★★★★☆	★★★★★	★★★☆☆
Multimodal	★★★☆☆	★★★★☆	★★★☆☆	★★★★★	★★★☆☆
Cost	$$$$	$$$$$	$	$$	Free (self-hosted)
Customizability	Medium	Medium	High (open-source)	Medium	Extreme

Decision Framework for Model Selection

Model selection isn’t about choosing the “best,” but the “most suitable.” Decision framework:

Step 1: Clarify task type Step 2: Consider constraints Step 3: Design multi-model routing strategy

1.10 Chapter Summary: A “User Manual” for Model Capabilities

This chapter covered a lot of content, from basic concepts to frontier theories. Here are the core takeaways:

Must Remember

The essence of large models is “predict the next word.” It’s a super-powerful pattern matching machine, not a human-style thinker.
Token is the model’s atom. All accounting units are Tokens.
The context window is working memory.
Temperature controls randomness.
Reasoning models let Agents “think before doing.”

Should Understand

Scaling Law is AI’s “physical law.”
MoE is the de facto standard for frontier models.
RLVR trains reasoning capability.
Long context requires attention mechanism innovation.
Alignment and safety are baselines — risks amplify once tools and orchestration enter the picture (Chapter 2+).

Next Step

With the model “brain” covered, Chapter 2 covers Agent nature: Model + Harness, the ReAct loop, Function Calling, the four-layer architecture, and the “Agent inflection point.”

References and Further Reading

Papers

Kaplan et al., “Scaling Laws for Neural Language Models”, arXiv:2001.08361, 2020
Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla), 2022
Bian et al., “Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs”, arXiv:2510.18245v2, 2026
Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models”, 2022
DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”, 2025

Technical Blogs and Reports

NVIDIA Blog: “Mixture of Experts Powers the Most Intelligent Frontier AI Models”, 2025
OpenAI o-series Technical Documentation
Anthropic: “Constitutional AI: Harmlessness from AI Feedback”, 2022
Google DeepMind: Gemini 3 Technical Report, 2025

Industry Analysis

2025 Large Language Model Technology Panoramic Report (CSDN)
2025-2026 Open-source Large Model Deep Report
AI Reasoning Models 2026: From OpenAI o3 to DeepSeek-R1 (Zylos AI Research)

Key Concept Index

Token: Minimum text unit processed by the model
Context Window: Maximum Token count the model can process at once
Temperature: Parameter controlling generation randomness (0-2)
Reasoning Model: Models that perform multi-step reasoning via Chain-of-Thought (o-series, DeepSeek-R1)
Scaling Law: Power-law relationship of model performance to scale/data/compute
MoE: Mixture of Experts architecture, sparse activation reduces compute cost
RLVR: Reinforcement Learning from Verifiable Rewards
RLHF: Reinforcement Learning from Human Feedback
ReAct: Reasoning + Acting Agent cognitive loop
Function Calling: Native structured tool calling
MLA: Multi-head Latent Attention, reducing KV cache usage
SSM: State Space Models, linear-complexity sequence modeling
Test-time Compute: Inference-time computation, letting the model “think more”