你可以在扣子罗盘中评测 Agent 的轨迹。与传统的仅关注输出结果的评测方法不同,轨迹评测的目标是 Agent 的推理过程与执行逻辑,从而验证 Agent 决策链条的合理性。
在扣子罗盘中,只有火山智能体才能被用作轨迹评测的评测对象。详情参见 火山智能体注册。
轨迹(Trajectory)是指 AI Agent 在任务执行过程中生成的结构化时序数据,数据格式为 JSON。轨迹完整记录了从接收用户指令开始,Agent 在多轮交互中进行的思考、行动和观察的全链路历史。
扣子罗盘定义了轨迹的标准数据结构,旨在将不同开发框架(如 LangChain、Eino)产生的异构 Trace 数据,归一化为标准的 根节点 - Agent 步骤 - 原子步骤 层级架构。详情参见 轨迹数据结构说明。
下面的示例展示了行程规划 Agent 完成 “规划上海行程” 任务时生成的轨迹数据。该轨迹数据的核心内容如下:
metrics_info,评测者无需遍历细节即可了解该任务的健康度:工具调用占比 40%,且主要耗时集中在 LLM 推理(3200ms)而非工具响应上。reasoning_tokens 量化了模型处理突发状况的思考深度。weather_tool 和 search_tool 工具的实际输入输出,验证了 Agent 是否正确获取了“周日大雨”和“需提前3天预约”的关键事实。{
"id": "trace_shanghai_001",
"root_step": {
"id": "span_root_001",
"name": "Travel_Planning_Session",
"input": "帮我规划上海三日游,这周末出发。",
"output": "已为你规划行程:周六去外滩和迪士尼(晴天),周日安排上海博物馆(大雨)。注意:博物馆需提前3天预约。",
"basic_info": {
"started_at": "1715400000000",
"duration": "4500"
},
"metrics_info": {
"llm_duration": "3200",
"tool_duration": "1300",
"tool_errors": {},
"tool_error_rate": 0,
"model_errors": {},
"model_error_rate": 0,
"tool_step_proportion": 0.4,
"input_tokens": 850,
"output_tokens": 420
},
"agent_steps": [
{
"id": "span_agent_001",
"parent_id": "span_root_001",
"name": "TravelPlannerAgent",
"input": "帮我规划上海三日游,这周末出发。",
"output": "已为你规划行程...",
"basic_info": {
"started_at": "1715400000100",
"duration": "4400"
},
"metrics_info": {
"llm_duration": "3200",
"tool_duration": "1300",
"tool_errors": {},
"tool_error_rate": 0,
"model_errors": {},
"model_error_rate": 0,
"tool_step_proportion": 0.4,
"input_tokens": 850,
"output_tokens": 420
},
"steps": [
{
"id": "span_step_001",
"parent_id": "span_agent_001",
"type": "model",
"name": "Reasoning_Weather_Check",
"input": "用户请求:上海三日游,本周末。",
"output": "思考:户外活动依赖天气,需先查天气。\nAction: 调用 weather_tool",
"basic_info": {
"started_at": "1715400000100",
"duration": "400"
},
"model_info": {
"input_tokens": 100,
"output_tokens": 50,
"reasoning_tokens": 20,
"latency_first_resp": "400",
"input_read_cached_tokens": 0,
"input_creation_cached_tokens": 0
}
},
{
"id": "span_step_002",
"parent_id": "span_agent_001",
"type": "tool",
"name": "weather_tool",
"input": "{\"location\": \"Shanghai\", \"date\": \"this_weekend\"}",
"output": "{\"Saturday\": \"Sunny\", \"Sunday\": \"Heavy Rain\"}",
"basic_info": {
"started_at": "1715400000500",
"duration": "500"
}
},
{
"id": "span_step_003",
"parent_id": "span_agent_001",
"type": "model",
"name": "Reasoning_Itinerary_Logic",
"input": "天气结果:周六晴,周日雨。",
"output": "思考:周日下雨,不能去迪士尼,改去博物馆。需确认博物馆预约规则。\nAction: 调用 search_tool",
"basic_info": {
"started_at": "1715400001000",
"duration": "600"
},
"model_info": {
"input_tokens": 200,
"output_tokens": 60,
"reasoning_tokens": 30,
"latency_first_resp": "300",
"input_read_cached_tokens": 0,
"input_creation_cached_tokens": 0
}
},
{
"id": "span_step_004",
"parent_id": "span_agent_001",
"type": "tool",
"name": "search_tool",
"input": "{\"query\": \"上海博物馆 预约规则\"}",
"output": "需提前3天在官网预约,实名制。",
"basic_info": {
"started_at": "1715400001600",
"duration": "800"
}
},
{
"id": "span_step_005",
"parent_id": "span_agent_001",
"type": "model",
"name": "Final_Response_Generation",
"input": "博物馆规则:需提前3天预约。",
"output": "已为你规划行程:周六去外滩和迪士尼(晴天),周日安排上海博物馆(大雨)。注意:博物馆需提前3天预约。",
"basic_info": {
"started_at": "1715400002400",
"duration": "2100"
},
"model_info": {
"input_tokens": 350,
"output_tokens": 150,
"reasoning_tokens": 0,
"latency_first_resp": "500",
"input_read_cached_tokens": 0,
"input_creation_cached_tokens": 0
}
}
]
}
]
}
}
与 LLM 不同,Agent 是一个包括了 LLM、提示词、工具调用、记忆等多个组件的系统。因此,对于 Agent,仅评测最终结果是不够的。轨迹评测能够深入 Agent 系统内部,达到以下效果:
扣子罗盘中的轨迹数据来源于 Trace 数据。
Trace 是一次完整请求的调用链记录。轨迹的数据结构是通过拼接 Trace 数据中的 agent 节点、model 节点、tool 节点和 graph 节点的输入输出信息形成的结构化序列。虽然不同 Agent 开发框架的 Trace 数据结构存在差异,但是扣子罗盘可以基于不同 Agent 开发框架的 Trace 数据结构自动提取出通用的轨迹数据结构。
说明
如果你通过以下 SDK 上报 Eino、LangGraph 或 LangChain 的 Trace 数据,请确保使用下面的指定版本或更高版本,以便扣子罗盘能够成功提取轨迹:
go get github.com/cloudwego/eino-ext/callbacks/cozeloop@v0.1.7pip install cozeloop==0.1.20 npm i @cozeloop/langchain@0.0.3每种节点对应轨迹数据中不同类型的成员:
agent_steps 数组,通常作为父节点存在,负责维护会话状态、管理上下文以及执行决策逻辑。steps 数组中 type 为 model 的成员, 记录了一次完整的请求响应,负责将结构化的 Prompt 转换为非结构化的文本或结构化的函数调用指令。steps 数组中 type 为 tool 的成员, 代表了 Agent 与外部环境(如 Web Search、Database、API)进行的数据交换过程。steps 数组中 type 为 graph 的成员,代表每个 LangGraph 或 Eino Graph 的入口点。Agent 的 Trace 数据被上报到扣子罗盘后,扣子罗盘可以自动提取出轨迹数据。例如:
在实验中,评测对象的实际轨迹数据也可以被发送到评估器。
你可以通过以下方法来评测轨迹数据:
在扣子罗盘中,评测 Agent 轨迹主要有以下两种流程:
说明
你可以使用任意一种流程,也可以结合使用这两种流程。例如,你可以把从 Trace 回流的轨迹数据作为参考,用于评测 Agent 实时产生的轨迹数据。参见 通过数据回流评测行程规划 Agent 的轨迹。
注意
评测集中无需包含轨迹数据,但评测集的场景必须是 轨迹评测集。详情参阅 管理评测集。
虽然轨迹数据源自 Trace,但两者的抽象层级不同:
轨迹数据结构主要包含以下核心组件:
root_step: 全局概览,记录整个任务的输入输出与核心指标。包含预聚合的 metrics_info(如 Token 消耗、工具错误率、首字延迟等),支持在不遍历详情的情况下快速评估任务质量。agent_steps: 交互链路,记录 Agent 的逻辑执行流。将复杂的调用链扁平化为有序的步骤列表,清晰还原 Agent 的决策路径。steps: 原子步骤,标准化的执行单元。无论底层实现如何,所有的操作都被统一分类到 agent 节点、model 节点、tool 节点和 graph 节点,并剥离了框架特定的工程噪音,仅保留评测所需的语义信息(如 reasoning_tokens、tool_errors)。{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"title": "Trajectory",
"description": "Trajectory structure for coze loop tracking",
"properties": {
"id": {
"type": "string",
"description": "trace_id"
},
"root_step": {
"type": "object",
"description": "根节点,记录整个轨迹的信息",
"properties": {
"id": {
"type": "string",
"description": "唯一ID,trace导入时取span_id"
},
"name": {
"type": "string",
"description": "name,trace导入时取span_name"
},
"input": {
"type": "string",
"description": "输入"
},
"output": {
"type": "string",
"description": "输出"
},
"metadata": {
"type": "object",
"description": "保留字段,可以承载业务自定义的属性",
"additionalProperties": {
"type": "string"
}
},
"basic_info": {
"type": "object",
"properties": {
"started_at": {
"type": "string",
"description": "单位毫秒"
},
"duration": {
"type": "string",
"description": "单位毫秒"
},
"error": {
"type": "object",
"properties": {
"code": {
"type": "integer"
},
"msg": {
"type": "string"
}
}
}
}
},
"metrics_info": {
"type": "object",
"properties": {
"llm_duration": {
"type": "string",
"description": "单位毫秒"
},
"tool_duration": {
"type": "string",
"description": "单位毫秒"
},
"tool_errors": {
"type": "object",
"description": "Tool错误分布,格式为:错误码-->list<ToolStepID>",
"additionalProperties": {
"type": "array",
"items": {
"type": "string"
}
}
},
"tool_error_rate": {
"type": "number",
"description": "Tool错误率"
},
"model_errors": {
"type": "object",
"description": "Model错误分布,格式为:错误码-->list<ModelStepID>",
"additionalProperties": {
"type": "array",
"items": {
"type": "string"
}
}
},
"model_error_rate": {
"type": "number",
"description": "Model错误率"
},
"tool_step_proportion": {
"type": "number",
"description": "Tool Step占比(分母是总子Step)"
},
"input_tokens": {
"type": "integer",
"description": "输入token数"
},
"output_tokens": {
"type": "integer",
"description": "输出token数"
}
}
}
}
},
"agent_steps": {
"type": "array",
"description": "agent step列表,记录轨迹中agent执行信息",
"items": {
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "唯一ID,trace导入时取span_id"
},
"parent_id": {
"type": "string",
"description": "父ID, trace导入时取parent_span_id"
},
"name": {
"type": "string",
"description": "name,trace导入时取span_name"
},
"input": {
"type": "string",
"description": "输入"
},
"output": {
"type": "string",
"description": "输出"
},
"steps": {
"type": "array",
"description": "子节点,agent执行内部经历了哪些步骤",
"items": {
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "唯一ID,trace导入时取span_id"
},
"parent_id": {
"type": "string",
"description": "父ID, trace导入时取parent_span_id"
},
"type": {
"type": "string",
"description": "类型"
},
"name": {
"type": "string",
"description": "name,trace导入时取span_name"
},
"input": {
"type": "string",
"description": "输入"
},
"output": {
"type": "string",
"description": "输出"
},
"model_info": {
"type": "object",
"description": "type=model时填充",
"properties": {
"input_tokens": {
"type": "integer"
},
"output_tokens": {
"type": "integer"
},
"latency_first_resp": {
"type": "string",
"description": "首包耗时,单位毫秒"
},
"reasoning_tokens": {
"type": "integer"
},
"input_read_cached_tokens": {
"type": "integer"
},
"input_creation_cached_tokens": {
"type": "integer"
}
}
},
"metadata": {
"type": "object",
"description": "保留字段,可以承载业务自定义的属性",
"additionalProperties": {
"type": "string"
}
},
"basic_info": {
"type": "object",
"properties": {
"started_at": {
"type": "string",
"description": "单位毫秒"
},
"duration": {
"type": "string",
"description": "单位毫秒"
},
"error": {
"type": "object",
"properties": {
"code": {
"type": "integer"
},
"msg": {
"type": "string"
}
}
}
}
}
}
}
},
"metadata": {
"type": "object",
"description": "保留字段,可以承载业务自定义的属性",
"additionalProperties": {
"type": "string"
}
},
"basic_info": {
"type": "object",
"properties": {
"started_at": {
"type": "string",
"description": "单位毫秒"
},
"duration": {
"type": "string",
"description": "单位毫秒"
},
"error": {
"type": "object",
"properties": {
"code": {
"type": "integer"
},
"msg": {
"type": "string"
}
}
}
}
},
"metrics_info": {
"type": "object",
"properties": {
"llm_duration": {
"type": "string",
"description": "单位毫秒"
},
"tool_duration": {
"type": "string",
"description": "单位毫秒"
},
"tool_errors": {
"type": "object",
"description": "Tool错误分布,格式为:错误码-->list<ToolStepID>",
"additionalProperties": {
"type": "array",
"items": {
"type": "string"
}
}
},
"tool_error_rate": {
"type": "number",
"description": "Tool错误率"
},
"model_errors": {
"type": "object",
"description": "Model错误分布,格式为:错误码-->list<ModelStepID>",
"additionalProperties": {
"type": "array",
"items": {
"type": "string"
}
}
},
"model_error_rate": {
"type": "number",
"description": "Model错误率"
},
"tool_step_proportion": {
"type": "number",
"description": "Tool Step占比(分母是总子Step)"
},
"input_tokens": {
"type": "integer",
"description": "输入token数"
},
"output_tokens": {
"type": "integer",
"description": "输出token数"
}
}
}
}
}
}
}
}