本文介绍基于 HTTP 请求实现用户给智能体发送语音消息的功能。用户侧发送一条语音消息,智能体接收音频文件后,由大模型生成语音形式的回复。语音回复以流式音频片段的形式返回,应用程序可选择实时播放或在智能体回复完成后将音频片段合并后再播放。
说明
通过发起对话 API 发送语音消息的功能已停止迭代,推荐使用 WebSocket 语音通话,其具备更优性能、更低延迟,具体请参见基于 WebSocket OpenAPI 实现音频通话。
file_id。file_url。调用发起对话时,需在入参中指定已上传的音频文件,并选择流式响应。必选输入参数的说明如下表所示。
|
请求参数 |
说明 |
|---|---|
|
bot_id |
智能体 ID。智能体开发页面 URL 中 |
|
user_id |
与智能体对话的用户 ID。不同的 user_id,其对话的上下文消息、数据库等对话记忆数据互相隔离。如果不需要用户数据隔离,可将此参数固定为一个任意字符串 |
|
stream |
是否采用流式响应。语音场景下固定设置为 true,表示采用流式响应。 说明 语音消息只支持流式响应,所以此处 stream 必须设置为 true。 |
|
additional_messages |
对话中的消息内容,包括对话的历史消息和本次对话中的用户问题。 |
additional_messages 为 JSON 数组格式,每个 JSON 对象代表一条独立的消息,其结构如下:
|
字段 |
说明 |
|---|---|
|
role |
发送消息的实体,取值包括:
|
|
type |
消息类型。默认为
|
|
content |
消息内容。在语音场景下, 说明
|
|
content_type |
消息内容类型。语音场景下设置为 |
additional_messages 语音消息示例如下:
[
{
"role":"user",
"content":"[{\"type\":\"audio\",\"file_id\":\"736949598110202****\"}]",
"content_type":"object_string"
}
]
[
{
"role":"user",
"content":"[{\"type\":\"audio\",\"file_url\":\"https://example.com/image_search/src=http%3A%2F%2Fci.xiaohongshu.com%2Fe7368218-****-bda3-56ad-5672b2a113b2%3FimageView2%2F2%2Fw%2F1080%2Fformat%2Fjpg&refer=http%3A%2F%2Fci.xiaohongshu.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1720005307&t=1acd734e6e8937****625bcdb0dc57\"}]",
"content_type":"object_string"
}
]
语音场景下,发起对话接口的响应为流式响应。conversation.audio.delta 响应事件中包含模型回复的音频片段。data 字段为 Message Object,其中 content_type 为 audio,content 为音频片段的 Base64 编码字符串。
根据输入音频文件格式,content 的格式有所不同:
content 为 PCM 音频片段的 Base64 编码字符串。音频片段采样率为 24kHz,16 位,单声道,little-endian。content 为 OPUS 音频片段的 Base64 编码字符串。音频片段码率为 48kbps,单声道,帧长为 10ms。如需自定义 OPUS 编码格式,可参考下文中的自定义 opus 编码。curl --location --request POST 'https://api.coze.cn/v3/chat' \
--header 'Authorization: Bearer pat_OYDacMzM3WyOWV3Dtj2bHRMymzxP****' \
--header 'Content-Type: application/json' \
--data-raw '{
"bot_id": "734829333445931****",
"user_id": "123456789",
"stream": true,
"auto_save_history":true,
"additional_messages":[
{
"role":"user",
"content":"[{\"type\": \"audio\", \"file_id\": \"734829333445931****\"}]",
"content_type":"object_string"
}
]
}'
event:conversation.chat.created
// 在 chat 事件里,data 字段中的 id 为 Chat ID,即会话 ID。
data:{"id":"7382159*****97202","conversation_id":"73814735*****78089","bot_id":"73794621****98898","completed_at":1718792949,"last_error":{"code":0,"msg":""},"status":"created","usage":{"token_count":0,"output_count":0,"input_count":0}}
event:conversation.chat.in_progress
data:{"id":"7382159*****97202","conversation_id":"73814735*****78089","bot_id":"73794621****98898","completed_at":1718792949,"last_error":{"code":0,"msg":""},"status":"in_progress","usage":{"token_count":0,"output_count":0,"input_count":0}}
event:conversation.message.delta
data:{"id":"7382159494123470858","conversation_id":"73814735*****78089","bot_id":"73794621****98898","role":"assistant","type":"answer","content":"2","content_type":"text","chat_id":"7382159*****97202"}
event:conversation.message.delta
data:{"id":"7382159494123470858","conversation_id":"73814735*****78089","bot_id":"73794621****98898","role":"assistant","type":"answer","content":"0","content_type":"text","chat_id":"7382159*****97202"}
// 语音消息
event:conversation.audio.delta
data:{"id":"7382159494123470858", "content_type":"audio", "content":"DQATABQAEgAUABEADgARAA8ADgAMAAsACQAGAAQA/v/6//z/+v***", "conversation_id":"73814735*****78089","bot_id":"73794621****98898","role":"assistant","type":"answer","chat_id":"7382159*****97202"}
//省略模型回复的部分中间事件event:conversation.message.delta、conversation.audio.delta
......
event:conversation.message.delta
data:{"id":"7382159494123470858","conversation_id":"73814735*****78089","bot_id":"73794621****98898","role":"assistant","type":"answer","content":"星期三","content_type":"text","chat_id":"7382159*****97202"}
event:conversation.message.delta
data:{"id":"7382159494123470858","conversation_id":"73814735*****78089","bot_id":"73794621****98898","role":"assistant","type":"answer","content":"。","content_type":"text","chat_id":"7382159*****97202"}
event:conversation.message.completed
data:{"id":"7382159494123470858","conversation_id":"73814735*****78089","bot_id":"73794621****98898","role":"assistant","type":"answer","content":"2024 年 10 月 1 日是星期三。","content_type":"text","chat_id":"7382159*****97202"}
event:conversation.message.completed
data:{"id":"7382159494123552778","conversation_id":"73814735*****78089","bot_id":"73794621****98898","role":"assistant","type":"verbose","content":"{\"msg_type\":\"generate_answer_finish\",\"data\":\"\",\"from_module\":null,\"from_unit\":null}","content_type":"text","chat_id":"7382159*****97202"}
event:conversation.chat.completed
data:{"id":"7382159*****97202","conversation_id":"73814735*****78089","bot_id":"73794621****98898","completed_at":1718792949,"last_error":{"code":0,"msg":""},"status":"completed","usage":{"token_count":633,"output_count":19,"input_count":614}}
event:done
data:"[DONE]"
若输入音频文件为 OGG_OPUS 格式,开发者可自定义返回的 OPUS 编码格式。在发起对话接口中,通过 extra_params 参数配置返回的 OPUS 音频片段编码格式,audio_message_config配置的值为 JSON 字符串,结构如下:
{
"opus_codec": {
"bitrate": 16000, // 码率,不传默认为48000,opus 支持 6kb/s 到 510kb/s 的码率
"use_cbr": false, // 是否使用 CBR 编码,默认为 false
"frame_size_ms": 10 // opus 帧大小,默认为 10ms。opus 支持 2.5ms 到 60ms 的帧大小
}
}
请求示例:
curl --location --request POST 'https://api.coze.cn/v3/chat' \
--header 'Authorization: Bearer pat_OYDacMzM3WyOWV3Dtj2bHRMymzxP****' \
--header 'Content-Type: application/json' \
--data-raw '{
"bot_id": "734829333445931****",
"user_id": "123456789",
"stream": true,
"auto_save_history":true,
"additional_messages":[
{
"role":"user",
"content":"[{\"type\": \"audio\", \"file_id\": \"734829333445931****\"}]",
"content_type":"object_string"
}
],
"extra_params": {
"audio_message_config": "{\"opus_codec\": {\"bitrate\": 16000, \"use_cbr\": false, \"frame_size_ms\": 60}}"
}
}'
通过发起对话 API 的返回结果中获取音频片段后,开发者可选择实时播放,或在流式响应结束后将音频片段合并写入音频文件,一次性播放给用户。
以下是使用 Go 语言处理音频片段的示例代码,展示如何使用 WAV 封装 PCM 和 OGG 封装 OPUS。
import (
"log"
"os"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
)
func callCoze() {
pcmData := make([]byte, 0)
// 调用 coze ,从 http response 中一直拿返回的 event 和 data
for {
// 伪代码,从 http response 中流式读取 event 和 data
event, data := resp.Recv()
// 对于所有 event 为 conversation.audio.delta 的事件,取出 data 中的音频片段
if event == "conversation.audio.delta" {
audioData := make(map[string]interface{})
err := json.Unmarshal([]byte(data), &audioData)
if err != nil {
log.Fatalf("Error Unmarshal: %v", err)
}
if base64AudioStr, exist := audioData["content"]; exist {
pcmPart, err := base64.StdEncoding.DecodeString(base64AudioStr.(string))
if err != nil {
log.Fatalf("Error DecodeString: %v", err)
}
pcmData = append(pcmData, pcmPart...)
}
}
// 结束事件,退出循环
if event == "done" {
break
}
}
writeWav(pcmData)
}
func writeWav(pcmData [][]byte) {
// 采样率
sampleRate := 24000
// 位深
bitDepth := 16
// 通道数
numChannels := 1
f, err := os.Create("pcm-example.wav")
if err != nil {
log.Fatalf("Error Create: %v", err)
}
intData := make([]int, 0)
for i := 0; i < len(pcmData); i += 2 {
intData = append(intData, int(uint16(pcmData[i])|uint16(pcmData[i+1])<<8))
}
intBuffer := &audio.IntBuffer{
Format: &audio.Format{
NumChannels: numChannels,
SampleRate: sampleRate,
},
Data: intData,
SourceBitDepth: bitDepth,
}
e := wav.NewEncoder(f, sampleRate, bitDepth, numChannels, 1)
defer e.Close()
err = e.Write(intBuffer)
if err != nil {
log.Fatalf("Error Write: %v", err)
}
return
}
import (
"log"
"github.com/pion/webrtc/v3/pkg/media/oggwriter"
)
func callCoze() {
opusData := make([][]byte, 0)
// 调用 coze ,从 http response 中一直拿返回的 event 和 data
for {
// 伪代码,从 http response 中流式读取 event 和 data
event, data := resp.Recv()
// 对于所有 event 为 conversation.audio.delta 的事件,取出 data 中的音频片段
if event == "conversation.audio.delta" {
audioData := make(map[string]interface{})
err := json.Unmarshal([]byte(data), &audioData)
if err != nil {
log.Fatalf("Error Unmarshal: %v", err)
}
if base64AudioStr, exist := audioData["content"]; exist {
opusPart, err := base64.StdEncoding.DecodeString(base64AudioStr.(string))
if err != nil {
log.Fatalf("Error DecodeString: %v", err)
}
opusData = append(opusData, opusPart)
}
}
// 结束事件,退出循环
if event == "done" {
break
}
}
writeOgg(opusData)
}
func writeOgg(opusData [][]byte) {
// 将 opus 写入到 opus-example.ogg 文件,采样率为 48000,单通道
writer, err := oggwriter.New("./opus-example.ogg", 48000, 1)
if err != nil {
log.Fatalf("Error creating OggWriter: %v", err)
}
defer writer.Close()
for idx, data := range opusData {
err = writer.WriteRTP(&rtp.Packet{
Payload: data,
Header: rtp.Header{
Timestamp: 480 * uint32(idx+1),
},
})
if err != nil {
log.Fatalf("Error writing to OggWriter: %v", err)
}
}
}