第6章 记忆(Memory)— Agent的持久化信息
没有记忆的Agent,每次对话都像失忆。
你上周告诉它你公司的名字、主营业务、主要竞品。这周再聊,它全忘了,你得重新说一遍。
这很烦。就像你每天去同一家咖啡店,每次都要重新告诉店员你喝什么。
记忆让Agent能跨对话保留信息,越用越懂你。
6.1 为什么需要记忆
我试过没记忆的助手,第二天就得重讲一遍背景。
一个真实的困境
你让Agent帮你分析竞品。它分析得很好,给了你一份详细的报告。
第二天,你说”帮我更新一下昨天的竞品分析,加上最新的市场份额数据”。
Agent说:“请问您要分析哪些竞品?”
它不记得昨天的对话了。
这就是没有记忆的Agent的日常。每次对话都是从零开始。
记忆的本质
Agent的上下文窗口是有限的。即使是最新的模型支持100万token的上下文,你也不可能每次都把所有历史信息塞进去。
而且,上下文窗口越大,推理速度越慢,成本越高。
记忆的本质是:把不需要每次都看的信息存在外面,需要时再调进来。
这就像人的大脑。你不会把所有经历过的事情都同时记在脑子里。大部分记忆存在长期记忆中,只有当前需要的才会被调入工作记忆。
6.2 三层记忆架构
工作记忆
工作记忆就是当前上下文窗口里的内容。
它包括:系统指令、对话历史、检索到的相关知识、当前任务的状态。这是Agent在当前步骤中能直接”看到”的信息。
工作记忆的特点是容量有限、访问速度快。就像人的短期记忆——你能同时处理的信息是有限的。
短期记忆
短期记忆是当前会话中的对话历史。
用户说了什么,Agent回了什么,中间结果是什么。这些信息在会话结束后通常不需要保留,但在会话过程中很重要。
管理短期记忆的关键是压缩。对话太长时,对早期内容做摘要,保留关键信息,丢弃冗余细节。
怎么实现:LangGraph 提供了 trim_messages 函数可以自动裁剪对话历史,避免超出上下文窗口限制。你只需要指定最大 Token 数(如 4000)和裁剪策略(如保留最近消息)。
短期记忆压缩的实现思路:当消息的 Token 数接近上限时(比如达到 80%),对较早的消息做摘要,只保留最近的消息。这样可以保留关键信息,同时控制上下文长度。
长期记忆
长期记忆是跨会话的持久化信息。
用户的偏好、项目背景、历史决策、学到的经验。这些信息在多次会话之间共享。
长期记忆通常存储在外部数据库中。需要时通过检索调入上下文。
6.3 向量嵌入:把信息变成”数学”
一个直觉
“今天天气很好”和”今日气候宜人”——这两句话文字完全不同,但意思很接近。
怎么让计算机知道它们意思接近?
答案是:把它们变成数字。
什么是向量嵌入
向量嵌入是把文字转换成一串数字(向量)的技术。
“今天天气很好”被转换成[0.2, 0.8, 0.1, …]。“今日气候宜人”被转换成[0.21, 0.79, 0.11, …]。这两个向量很接近。
“我需要写一份财报”被转换成[0.9, 0.1, 0.7, …]。这个向量和前两个差距很大。
向量之间的距离反映了语义的相似度。距离越近,意思越像。
为什么需要向量
传统的关键词搜索有局限。你搜”天气”,只能找到包含”天气”这两个字的文档。如果文档里写的是”气候”,就搜不到。
向量搜索没有这个限制。它根据语义相似度匹配,不依赖关键词完全一致。
2026年的嵌入模型
选择嵌入模型时,要考虑质量、速度、成本、语言支持。
OpenAI text-embedding-3:效果好,需要API调用。适合对质量要求高的场景。 all-MiniLM-L6-v2:开源,可以本地运行,速度快。适合对成本敏感的场景。 BGE系列:中文效果好,开源。适合中文场景。
怎么实现:用OpenAI的嵌入模型,调用 embeddings.create() 方法,传入模型名称和文本,即可获得向量(如1536维)。
向量相似度:用余弦相似度计算两个向量的距离,越接近1表示越相似。相似句子的向量距离通常>0.9,不相似的<0.5。
2026年主流嵌入模型对比:
| 模型 | 维度 | 语言支持 | 成本(百万token) | 适用场景 |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | 多语言 | $0.02 | 通用场景,成本敏感 |
| text-embedding-3-large | 3072 | 多语言 | $0.13 | 高质量需求 |
| BGE-M3 | 1024 | 中英优 | 免费(开源) | 中文场景 |
| all-MiniLM-L6-v2 | 384 | 英文 | 免费(开源) | 本地部署 |
| E5-Mistral | 4096 | 多语言 | 免费(开源) | 长文档检索 |
6.4 RAG的底层原理
什么是RAG
RAG(Retrieval-Augmented Generation,检索增强生成)是让Agent从外部知识库中检索信息,然后结合这些信息生成回答的技术。
核心流程:用户提问→检索相关知识→把知识和问题一起交给模型→模型基于知识生成回答。
为什么需要RAG
模型的训练数据有截止日期。它不知道昨天发生的新闻,不知道你公司内部的文档,不知道最新的产品价格。
RAG让模型在回答时能”查资料”,而不是完全依赖训练时学到的知识。
RAG的完整流程
第一步:处理。把原始文档(PDF、网页、数据库)转换成纯文本。
第二步:分块。把长文本切分成小段。每段通常200-1000字。分块太大,检索不精确;分块太小,丢失上下文。
第三步:嵌入。用嵌入模型把每个文本块转换成向量。
第四步:存储。把向量存入向量数据库。
第五步:查询。用户的提问也被转换成向量。
第六步:检索。在向量数据库中找到与提问向量最相似的文本块。
第七步:重排。用更精确的模型对检索结果重新排序。这一步能把召回率提升20-40%。
第八步:生成。把重排后的文本块和用户提问一起交给模型,生成最终回答。
2026年的向量数据库选型
Pinecone:全托管,零运维。生产RAG首选。适合不想自己运维的团队。 Weaviate:原生混合搜索(向量+BM25关键词匹配)。召回率比纯向量高20-40%。适合对搜索质量要求高的场景。 Qdrant:性能关键、自托管首选。适合对成本和性能有要求的团队。 pgvector:PostgreSQL生态。如果你已经在用PostgreSQL,直接加个扩展就行。
怎么实现:用LangChain + Pinecone搭建基础RAG,只需几行代码即可实现文档嵌入存储和相似度检索。
完整RAG流水线:包括文档摄入(处理→分块→嵌入→存储)和查询(嵌入→检索→重排→生成)两个主要流程。实际实现时,需要注意分块重叠保持上下文连贯性,以及使用重排模型提升检索精度。
RAG的局限
检索质量是最大瓶颈。如果检索到的信息不相关或不完整,模型再聪明也生成不出好答案。行业共识:RAG系统的问题,80%出在检索环节,不是模型本身。
分块策略影响大。分块太大,检索不精确。分块太小,丢失上下文。没有万能的分块策略,需要根据具体场景调整。
知识库需要维护。文档更新了,向量也要重新生成。知识库过期了,检索到的信息可能是错的。
6.5 自进化记忆:Agent的”做梦”学习机制
一个有趣的概念
Anthropic提出了一个实验:让Agent在会话间隙”做梦”。
Agent在完成一次对话后,启动一个后台进程,复盘这次对话:哪些地方做得好?哪些地方犯了错?用户有什么新的偏好?
从这些复盘中,Agent提取出持久化的经验,存入长期记忆。
下次对话时,Agent会检索这些经验,避免重复犯错,利用积累的知识。
工作流程
会话结束→后台复盘→提取模式→更新记忆→下次会话时检索使用。
这就像人睡觉时大脑在整理白天的记忆。重要的信息被巩固,不重要的被遗忘。
智能遗忘
记忆不是越多越好。过时的、低价值的信息会干扰Agent的判断。
智能遗忘机制用评分模型评估每条记忆的价值:多久没被访问了?被访问时有没有帮助?和最新信息是否矛盾?
低分的记忆会被逐步降级,最终删除。
怎么实现:Claude Code的”Dreaming”功能(2026年新发布)已经实现了这个机制。Agent在会话间自动复盘,更新记忆。你不需要自己实现,用Claude Code就自动有这个能力。
自进化记忆的实现思路:会话结束后,用模型复盘对话记录,提取用户偏好、成功经验、错误教训和新知识,存入向量数据库。每次检索时更新访问统计和价值分。定期清理低价值记忆(90天未访问且访问次数少)。
6.6 语义缓存:用记忆省钱
一个省钱的技巧
用户问了一个问题,Agent回答了。下一个用户问了一个类似的问题,Agent又从头推理了一遍。
这很浪费。如果两个问题语义相似,答案也差不多,为什么不把第一个答案缓存起来,第二个问题直接用缓存的答案?
这就是语义缓存。
怎么工作
用户的问题先和缓存中的历史问答做语义匹配。如果找到了相似度超过阈值的历史问答,直接返回缓存的答案,不需要再调用模型。
这能把成本降低约68%。响应速度也快了很多——缓存查询只要几十毫秒,模型推理可能要几秒。
怎么实现:Redis是常用的语义缓存方案。2026年,Bifrost和GPTCache也提供了开箱即用的语义缓存服务,缓存命中率通常在60-85%。
本章小结
记忆是Agent的持久化信息。核心要点:
- 三层架构:工作记忆(上下文窗口)、短期记忆(对话历史)、长期记忆(外部存储)。
- 向量嵌入:把文字变成数字,实现语义搜索。
- RAG:检索→分块→嵌入→存储→查询→重排→生成。检索质量是最大瓶颈。
- 向量数据库选型:Pinecone(托管)、Weaviate(混合搜索)、Qdrant(自托管)、pgvector(PostgreSQL)。
- 自进化记忆:Agent通过复盘积累经验。Claude Code的Dreaming功能已内置。
- 语义缓存:缓存常见问答,降低成本68%。
下一章讲Harness的第四个子系统:沙箱(Sandbox)。Agent怎么在安全的环境里执行代码和操作文件。
Ch06 Memory — Agent’s Persistent Information
An Agent without memory forgets every conversation.
You told it your company’s name, business, and competitors last week. This week, it forgot everything and you have to repeat.
This is annoying. Like going to the same coffee shop daily but having to reorder every time.
Memory lets Agents retain information across conversations, understanding you better over time.
6.1 Why Memory is Needed
A Real Dilemma
You ask the Agent to analyze competitors. It does a great job and gives you a detailed report.
The next day, you say “Update yesterday’s competitor analysis with the latest market share data.”
The Agent says: “Which competitors would you like to analyze?”
It doesn’t remember yesterday’s conversation.
This is the daily reality of an Agent without memory. Every conversation starts from zero.
The Nature of Memory
An Agent’s context window is limited. Even the latest models support 1M token contexts, but you can’t stuff all historical information in every time.
Moreover, larger context windows mean slower inference and higher costs.
The essence of memory is: store information that doesn’t need to be seen every time outside, and retrieve it when needed.
This is like the human brain. You don’t keep all your experiences in active memory at once. Most memories are stored in long-term memory; only what’s currently needed gets loaded into working memory.
6.2 Three-Layer Memory Architecture
Working Memory
Working memory is the content within the current context window.
It includes: system instructions, conversation history, retrieved relevant knowledge, current task status. This is the information the Agent can directly “see” at the current step.
Working memory is characterized by limited capacity and fast access. Like human short-term memory—the amount of information you can process simultaneously is limited.
Short-term Memory
Short-term memory is the conversation history within the current session.
What the user said, how the Agent replied, what intermediate results were. This information is typically not retained after the session ends, but is important during the session.
The key to managing short-term memory is compression. When the conversation gets long, summarize earlier content, retain key information, and discard redundant details.
Implementation: LangGraph provides the trim_messages function to automatically trim conversation history, avoiding context window limits. You only need to specify the maximum token count (e.g., 4000) and a trimming strategy (e.g., keep recent messages).
Short-term memory compression approach: When the message token count approaches the limit (e.g., reaches 80%), summarize earlier messages, keeping only recent ones. This preserves key information while controlling context length.
Long-term Memory
Long-term memory is persistent information across sessions.
User preferences, project context, historical decisions, learned experiences. This information is shared between multiple sessions.
Long-term memory is typically stored in external databases. It’s retrieved into context when needed.
6.3 Vector Embeddings: Turning Information into “Mathematics”
An Intuition
“Today’s weather is nice” and “Today’s climate is pleasant” — these two sentences are completely different in text, but their meanings are very close.
How do you let a computer know they’re similar in meaning?
The answer: turn them into numbers.
What are Vector Embeddings
Vector embedding is the technique of converting text into a string of numbers (a vector).
“Today’s weather is nice” gets converted to [0.2, 0.8, 0.1, …]. “Today’s climate is pleasant” gets converted to [0.21, 0.79, 0.11, …]. These two vectors are very close.
“I need to write a financial report” gets converted to [0.9, 0.1, 0.7, …]. This vector is very different from the first two.
The distance between vectors reflects semantic similarity. The closer the distance, the more similar the meaning.
Why Vectors are Needed
Traditional keyword search has limitations. If you search for “weather”, you can only find documents containing those characters. If the document says “climate”, it won’t be found.
Vector search doesn’t have this limitation. It matches based on semantic similarity, not exact keyword matches.
2026 Embedding Models
When choosing an embedding model, consider quality, speed, cost, and language support.
- OpenAI text-embedding-3: Good quality, requires API calls. Suitable for quality-critical scenarios.
- all-MiniLM-L6-v2: Open-source, can run locally, fast. Suitable for cost-sensitive scenarios.
- BGE series: Good Chinese performance, open-source. Suitable for Chinese scenarios.
Implementation: Using OpenAI’s embedding model, call the embeddings.create() method, passing the model name and text to get the vector (e.g., 1536 dimensions).
Vector similarity: Cosine similarity calculates the distance between two vectors. Closer to 1 means more similar. Similar sentences typically have similarity > 0.9; dissimilar ones < 0.5.
2026 Mainstream Embedding Model Comparison:
| Model | Dimensions | Language Support | Cost (per M tokens) | Suitable Scenario |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | Multilingual | $0.02 | General purpose, cost-sensitive |
| text-embedding-3-large | 3072 | Multilingual | $0.13 | High quality requirements |
| BGE-M3 | 1024 | Chinese/English | Free (open-source) | Chinese scenarios |
| all-MiniLM-L6-v2 | 384 | English | Free (open-source) | Local deployment |
| E5-Mistral | 4096 | Multilingual | Free (open-source) | Long document retrieval |
6.4 RAG’s Underlying Principles
What is RAG
RAG (Retrieval-Augmented Generation) is a technique that lets Agents retrieve information from external knowledge bases, then generate answers based on that information.
Core flow: User question → Retrieve relevant knowledge → Feed knowledge and question to model → Model generates answer based on knowledge.
Why RAG is Needed
A model’s training data has a cutoff date. It doesn’t know yesterday’s news, your company’s internal documents, or the latest product prices.
RAG lets the model “look up information” when answering, rather than relying entirely on knowledge learned during training.
RAG’s Complete Flow
Step 1: Process. Convert raw documents (PDF, web pages, databases) into plain text.
Step 2: Chunk. Split long text into small segments. Typically 200-1000 characters per chunk. Chunks too large → imprecise retrieval; chunks too small → lost context.
Step 3: Embed. Use an embedding model to convert each text chunk into a vector.
Step 4: Store. Store vectors in a vector database.
Step 5: Query. The user’s question is also converted into a vector.
Step 6: Retrieve. Find the text chunks most similar to the question vector in the vector database.
Step 7: Rerank. Re-sort the retrieved results using a more precise model. This step can improve recall by 20-40%.
Step 8: Generate. Feed the reranked text chunks and the user’s question to the model to generate the final answer.
2026 Vector Database Selection
- Pinecone: Fully managed, zero operations. Production RAG首选. Suitable for teams that don’t want to manage their own infrastructure.
- Weaviate: Native hybrid search (vector + BM25 keyword matching). Recall rate 20-40% higher than pure vector search. Suitable for scenarios with high search quality requirements.
- Qdrant: Performance-critical, self-hosting preferred. Suitable for teams with cost and performance requirements.
- pgvector: PostgreSQL ecosystem. If you’re already using PostgreSQL, just add an extension.
Implementation: Using LangChain + Pinecone to build a basic RAG requires just a few lines of code to implement document embedding storage and similarity retrieval.
Complete RAG pipeline: Includes document ingestion (process → chunk → embed → store) and query (embed → retrieve → rerank → generate) two main flows. In actual implementation, pay attention to chunk overlap to maintain context coherence, and use reranking models to improve retrieval precision.
RAG’s Limitations
Retrieval quality is the biggest bottleneck. If retrieved information is irrelevant or incomplete, no matter how smart the model is, it can’t generate good answers. Industry consensus: 80% of RAG system problems come from the retrieval stage, not the model itself.
Chunking strategy has big impact. Chunks too large → imprecise retrieval. Chunks too small → lost context. There’s no universal chunking strategy; it needs adjustment based on specific scenarios.
Knowledge base needs maintenance. When documents update, vectors need to be regenerated. If the knowledge base is outdated, retrieved information may be wrong.
6.5 Self-Evolving Memory: Agent’s “Dreaming” Learning Mechanism
An Interesting Concept
Anthropic proposed an experiment: let the Agent “dream” during session gaps.
After completing a conversation, the Agent launches a background process to review the conversation: what went well? where did it make mistakes? what are the user’s new preferences?
From these reviews, the Agent extracts persistent experiences and stores them in long-term memory.
In the next conversation, the Agent retrieves these experiences to avoid repeating mistakes and leverage accumulated knowledge.
Workflow
Session ends → Background review → Extract patterns → Update memory → Retrieve and use in next session.
This is like how the brain consolidates daytime memories during sleep. Important information gets consolidated; unimportant information gets forgotten.
Intelligent Forgetting
More memory isn’t always better. Outdated, low-value information interferes with the Agent’s judgment.
An intelligent forgetting mechanism uses a scoring model to evaluate each memory’s value: how long since last access? was it helpful when accessed? does it contradict the latest information?
Low-scoring memories get gradually demoted and eventually deleted.
Implementation: Claude Code’s “Dreaming” feature (released 2026) has already implemented this mechanism. The Agent automatically reviews and updates memory between sessions. You don’t need to implement it yourself; using Claude Code gives you this capability automatically.
Self-evolving memory implementation approach: After a session ends, use a model to review the conversation, extract user preferences, successful experiences, error lessons, and new learning, and store them in a vector database. Update access statistics and value scores each time memories are retrieved. Periodically clean up low-value memories (not accessed in 90 days and few access counts).
6.6 Semantic Caching: Saving Money with Memory
A Money-Saving Trick
A user asks a question, the Agent answers. Another user asks a similar question, and the Agent reasons from scratch again.
This is wasteful. If two questions are semantically similar and the answers are about the same, why not cache the first answer and reuse it for the second question?
This is semantic caching.
How it Works
The user’s question is first semantically matched against historical Q&A in the cache. If a historical Q&A with similarity exceeding a threshold is found, the cached answer is returned directly without calling the model again.
This can reduce costs by about 68%. Response speed is also much faster — cache queries take tens of milliseconds; model inference may take seconds.
Implementation: Redis is a commonly used semantic caching solution. In 2026, Bifrost and GPTCache also provide out-of-the-box semantic caching services, with cache hit rates typically between 60-85%.
Chapter Summary
Memory is the Agent’s persistent information. Key points:
- Three-layer architecture: Working memory (context window), short-term memory (conversation history), long-term memory (external storage).
- Vector embeddings: Turn text into numbers to enable semantic search.
- RAG: Retrieve → chunk → embed → store → query → rerank → generate. Retrieval quality is the biggest bottleneck.
- Vector database selection: Pinecone (managed), Weaviate (hybrid search), Qdrant (self-hosted), pgvector (PostgreSQL).
- Self-evolving memory: Agents accumulate experiences through review. Claude Code’s Dreaming feature is built-in.
- Semantic caching: Cache common Q&A to reduce costs by 68%.
The next chapter covers the Harness’s fourth subsystem: Sandbox. How Agents execute code and manipulate files in a secure environment.