Agent 架构重构：为什么 System Prompt 不该存进 History？

开源了 Blade Code 之后，我一直在琢磨一个问题：

系统提示词（System Prompt）到底该不该存到历史消息（History）里？

最近结合最新的 Prompt Caching 技术把这块想清楚了，今天专门聊聊。

先说结论

不该存。系统提示词应该每次请求时动态注入，而不是作为”第一条消息”持久化到数据库中。

听起来有点反直觉？往下看。

从一个常见写法说起

很多人写 Coding Agent 是这样的：

class Agent {
  private systemPrompt?: string // 实例级状态
  history: Message[] = []

  async initialize() {
    this.systemPrompt = await buildSystemPrompt()
    // ❌ 错误示范：初始化时就把 System 塞进历史
    this.history.push({ role: 'system', content: this.systemPrompt })
  }

  async chat(userMessage: string) {
    this.history.push({ role: 'user', content: userMessage })
    const response = await llm.chat(this.history)
    this.history.push({ role: 'assistant', content: response })
    return response
  }
}

这个写法很常见，逻辑也很直观：初始化时构建 System Prompt，塞进数组，后面直接追加对话。

但用着用着，你会发现几个致命问题：

改了配置不生效 —— 历史里存的还是旧的 System Prompt，除非用户手动清空会话
无法切换模式 —— 比如想从”代码执行模式”切到”架构规划模式”，历史里的 System Prompt 却是写死的
Agent 实例变重 —— System Prompt 往往包含大量上下文（文件树、规则），存进实例属性会导致内存膨胀，且不利于水平扩展
长对话”跑偏” —— System Prompt 被淹没在几百条历史消息的最顶端，模型渐渐”忘了”自己是谁

问题的本质：关注点混淆

根本原因是：把”元指令”当成了”对话数据”，还把它存成了”实例状态”。

System Prompt：是元指令（Meta-Instruction），告诉模型”你是谁”、“当前环境是什么”。它是易变的（随配置、模式、环境变化）
History：是业务数据（Business Data），记录用户和 AI 的交流。它是只增不减的

把这两个混在一起，架构就会变得僵硬。

graph TB
    subgraph "❌ 有状态设计 (Stateful)"
        direction TB
        A[Agent 实例]
        A --> A1["systemPrompt (死数据)"]
        A --> A2["history (混合了 System)"]
    end

    subgraph "✅ 无状态设计 (Stateless)"
        direction TB
        B[Agent 实例]
        B --> B1["chat(context)"]
        B --> B2[纯函数，无内部状态]

        C[外部存储]
        C --> C1["History Storage (纯对话)"]
        C --> C2["Config Center (动态构建 System)"]
    end

改进方案：Stateless Agent

思路很简单：Agent 实例不持有 System Prompt，改为每次请求时动态传入。

第一步：移除实例级状态

Agent 变成一个纯粹的执行器，所有上下文通过 context 传入：

// 扩展 ChatContext 接口
interface ChatContext {
  messages: Message[] // 纯对话历史（不含 System）
  systemPrompt?: string // 动态传入
  // ...其他环境信息
}

class Agent {
  // ✅ 没有任何状态属性

  async chat(context: ChatContext) {
    // 1. 获取或构建 System Prompt
    const systemPrompt = context.systemPrompt ?? (await this.buildSystemPrompt())

    // 2. 临时拼装发送给 LLM 的消息体
    const messages = [
      { role: 'system', content: systemPrompt }, // 每次都在最前
      ...context.messages,
    ]

    return this.executeLoop(messages)
  }
}

第二步：Service 层负责组装

// Service 层
async handleChat(sessionId: string, userMessage: string) {
  // 1. 动态构建 System Prompt (包含最新的文件树、当前时间等)
  const systemPrompt = await buildSystemPrompt({ mode: 'coding' });

  // 2. 加载纯对话历史
  const history = await db.getMessages(sessionId);

  // 3. 调用 Agent
  return agent.chat({
    messages: history,
    systemPrompt: systemPrompt.prompt
  });
}

⚠️ 迁移提醒：清洗旧数据

如果你是从旧架构迁移过来，有个坑要注意：旧的 History 数据库里可能残留了 role: system 的消息。

加载历史时必须过滤掉，否则模型会收到两条 System Prompt（一条旧的，一条新注入的），导致行为异常。

async loadHistory(sessionId: string) {
  const raw = await db.getMessages(sessionId);
  // ✅ 必须过滤掉历史中的 system 消息
  return raw.filter(msg => msg.role !== 'system');
}

整体架构变成这样：

flowchart TB
    subgraph 调用层["调用层 (Service)"]
        S1[构建 systemPrompt]
        S2[加载 History]
        S3[组装 ChatContext]
    end

    subgraph Agent层["Agent 层 (无状态)"]
        A1[接收 context]
        A2[拼接完整 Prompt]
        A3[调用 LLM]
        A4[返回结果]
    end

    subgraph 持久化层["持久化层"]
        P1[(History Storage)]
        P2[(Config Center)]
    end

    subgraph LLM服务["LLM 服务"]
        L1[Prompt Cache]
        L2[模型推理]
    end

    S1 --> S3
    S2 --> S3
    P1 --> S2
    P2 --> S1
    S3 --> A1
    A1 --> A2 --> A3 --> A4
    A3 --> LLM服务
    LLM服务 --> A3

最大的质疑：这样不浪费 Token 吗？

这是很多人的第一反应：“每次请求都发一遍几千 Token 的 System Prompt，这成本受得了吗？”

在 2023 年，这是个大问题。但在 2026 年，这已经不是问题了。

因为我们有 Prompt Caching。

Prompt Cache 的工作原理

简单说，就是告诉 LLM 服务端：“这部分前缀（System Prompt）我会反复发，你缓存一下 KV Cache，别每次都重新算。“

sequenceDiagram
    participant C as Client
    participant S as LLM Server
    participant Cache as Prompt Cache

    Note over C,Cache: 第一次请求 (Cache Miss)
    C->>S: [System(10k) + User1]
    S->>Cache: 写入缓存
    S->>C: 响应 (计费: 10k + User1)

    Note over C,Cache: 第二次请求 (Cache Hit)
    C->>S: [System(10k) + User1 + Asst1 + User2]
    Cache-->>S: ⚡️ 命中! System 部分跳过计算
    S->>C: 响应 (计费: 仅 User/Asst 新增部分)

实际收益

根据 Anthropic 和 DeepSeek 的数据，命中缓存后：

成本降低 90%（DeepSeek 甚至低至 0.1 元/百万 tokens）
首字延迟（TTFT）降低 85%

结论：无状态设计 + Prompt Cache = 既灵活（随时改 Prompt）又省钱（不重复计费）。

其他厂商的支持情况

Prompt Cache 不是 Anthropic 独有的，但各家实现方式不同：

厂商	功能名称	触发方式	特点与适用场景
Anthropic	Prompt Caching	手动标记 `cacheControl`	最灵活。支持设置最多 4 个断点，适合结构化长 Prompt（如 System + Tools + Few-shot）
OpenAI	Prompt Caching	自动触发	最省心。前缀超过 1024 tokens 自动缓存，无需改代码，但控制粒度不如 Anthropic
Google	Context Caching	显式创建 (主要) / 自动触发 (部分)	适合超长静态文档。显式创建需管理缓存 ID 和 TTL；短 prompt 也支持自动前缀缓存
DeepSeek	Context Caching	自动触发 (API) / 需配置 (本地)	性价比之王。API 自动缓存；本地部署需在 vLLM 开启 `--enable-prefix-caching`

选型建议：

想省心（DeepSeek/OpenAI）：什么都不用管，把 System Prompt 放最前面，API 会自动帮你省钱
想极致优化（Anthropic）：特别是当你的 Prompt 结构很复杂（比如 System → Tools → Few-shot Examples → History），你可以把 Tools 和 Examples 也标记上缓存，这样即使 System 变了，后面的部分还能复用缓存（只要前缀匹配）
处理整本书/代码库（Google）：把整个文档上传存起来，按小时付费，适合作为 RAG 的替代方案

如果你要支持多 Provider，可以在 ChatService 层做适配：

// 根据 Provider 决定是否添加缓存标记
if (provider === 'anthropic') {
  // Anthropic 需要手动标记
  systemMessage.providerOptions = {
    anthropic: { cacheControl: { type: 'ephemeral' } },
  }
}
// OpenAI / DeepSeek 自动缓存，不需要特殊处理
// Google 需要单独的 caching API 创建缓存对象

💡 观念升级：Cache 依然占用窗口，但不再是瓶颈

有个技术细节需要澄清：Prompt Cache 依然会占用 Context Window。

因为模型在推理时，仍然需要加载这部分 KV Cache 到显存中。

但为什么说”担心占用”已经过时了？

窗口容量的爆发：在 GPT-4 (8k) 的早期时代，10k 的 System Prompt 是不可接受的（直接爆显存）。但在 Claude 4.5 (1M) 或 Gemini 3 Pro (2M) 的时代，它只占 1% 甚至 0.5%。

这点占用相对于它带来的灵活性，完全值得。
关注点转移：现代 Agent 架构的瓶颈不再是”能不能装下”（Capacity），而是**“推理成本”（Cost）和”首字延迟”**（Latency）。

Prompt Cache 正是精准解决了这两个核心痛点。

结论：除非你的 System Prompt 是几百兆的知识库（这种情况建议走 RAG），否则在现代大窗口模型下，放心大胆地塞，不用为了省那点窗口去做复杂的压缩。

这样设计的好处

1. 真正的无状态

Agent 实例不保存任何会话相关状态，这意味着：

随时水平扩展：加机器就能扩容，不用担心状态同步
任意实例处理任意请求：请求可以路由到任何一个 Agent 实例
故障恢复不丢状态：实例挂了重启就行，状态都在外部存储

graph LR
    subgraph 请求
        R1[请求 1]
        R2[请求 2]
        R3[请求 3]
    end

    subgraph Agent集群["Agent 集群 (无状态)"]
        A1[Agent 1]
        A2[Agent 2]
        A3[Agent 3]
    end

    subgraph 存储
        S[(共享存储)]
    end

    R1 --> A1
    R2 --> A3
    R3 --> A2

    A1 --> S
    A2 --> S
    A3 --> S

这和微服务架构里的 Stateless Service 是一个思路。

2. 高度灵活

每次请求可以使用不同的 systemPrompt，这就支持了很多场景：

模式切换：同一个会话里，可以在”执行模式”和”规划模式”之间切换
热更新：改了系统提示词，下一轮对话立刻生效
A/B 测试：对同一段历史，用不同的系统提示词测试效果

sequenceDiagram
    participant U as 用户
    participant S as Service
    participant A as Agent

    U->>S: 帮我写代码
    S->>S: systemPrompt = 执行模式
    S->>A: chat(context)
    A->>U: [代码]

    U->>S: 先规划一下架构
    S->>S: systemPrompt = 规划模式
    S->>A: chat(context)
    A->>U: [架构方案]

    U->>S: 好，开始实现
    S->>S: systemPrompt = 执行模式
    S->>A: chat(context)
    A->>U: [代码]

历史消息保留，但系统提示词可以随时切换。

3. 上下文压缩更简单

对话太长怎么办？总不能无限累积历史消息，会超出模型的上下文窗口，这时候需要压缩。

两种设计在压缩时的处理方式不同：

有状态设计：System 存在 history 里，压缩时要特殊处理，避免把 System 也压缩掉。

// 有状态设计：压缩时要小心翼翼
async compress(history: Message[]) {
  const systemMsg = history[0]; // 先把 System 拿出来
  const rest = history.slice(1); // 剩下的才能压缩
  const summary = await llm.summarize(rest);
  return [systemMsg, { role: 'user', content: summary }];
}

无状态设计：System 本来就不在 history 里，压缩逻辑简单直接。

// 无状态设计：直接压缩，不用管 System
async compress(history: Message[]) {
  const summary = await llm.summarize(history);
  return [{ role: 'user', content: summary }];
}
// System 在发送时才拼上去，和压缩逻辑完全解耦

graph TB
    subgraph 有状态设计["有状态设计：压缩时要特殊处理"]
        direction TB
        S1["history = [System, User1, Asst1, ...]"]
        S2["压缩时：先取出 System"]
        S3["再压缩剩余部分"]
        S4["最后拼回去"]
        S1 --> S2 --> S3 --> S4
    end

    subgraph 无状态设计["无状态设计：压缩逻辑简单"]
        direction TB
        N1["history = [User1, Asst1, ...]"]
        N2["直接压缩整个 history"]
        N3["System 发送时再拼"]
        N1 --> N2 --> N3
    end

看起来差别不大？但当你的压缩逻辑变复杂时（比如分段压缩、保留关键轮次），有状态设计要一直惦记着”别动 System”，无状态设计完全不用操心。

4. 成本可控

配合 Prompt Cache，成本不升反降。虽然每次都发送系统提示词，但缓存命中后不重复计费。

总结

将 System Prompt 从 History 中剥离，是一次架构上的”解耦”：

设计决策	带来的优势
移除实例级 System Prompt	真正无状态：Agent 实例可随时销毁、重启、水平扩展
每次请求动态注入	高度灵活：支持模式切换、A/B 测试、热更新配置
配合 Prompt Cache	成本可控：虽然每次都发全量 Prompt，但费用和延迟极低
History 只存对话	数据纯净：压缩算法更简单，数据迁移更方便

这套模式本质上是 Stateless Service + External State Store 在 LLM 应用开发中的投射。如果你正在重构你的 Agent，强烈建议采用这种设计。

这篇文章聚焦在状态管理上：为什么 System Prompt 不该存进数据库、怎么配合 Prompt Cache。如果你对 Coding Agent 的实现感兴趣，可以参考 Blade Code 的源码。