Voice Session
Status: Running today. Desktop's agent chat voice session, voice executor, and voice workflow contracts are shipped at the kernel level (
agent-chat-voice-*-contract.md).
The Desktop voice session is the surface where a user holds a voice conversation with their agent: speech in, agent reply out, captions synchronized, lifecycle states explicit. The contracts split deliberately into session, executor, and workflow.
Three Contracts
| Contract | Owns |
|---|---|
| Voice session | Higher-level voice session lifecycle as it appears in chat |
| Voice executor | Per-turn voice execution mechanics |
| Voice workflow | Cross-turn workflow + identity binding |
The split keeps "did the user start a voice conversation" separate from "how is one turn executing" separate from "how does the agent's voice identity bind across turns."
Boundary
| Owns | Does NOT own |
|---|---|
| Desktop chat voice surface lifecycle + UI | Voice creation (K-VOICE-* runtime — see Voice Asset Lifecycle) |
| Per-turn voice executor in chat | TTS / STT provider semantics (Runtime) |
| Workflow + identity binding in chat | Avatar lipsync (Avatar) |
The Desktop voice surface consumes runtime voice + projects through the captioned chat UI. It does not invent voice cloning or asset storage.
Reader Scenario: User Voice Turn
User taps voice in chat and speaks.
- Voice session begins. Desktop tracks lifecycle.
- STT executes. Per voice executor contract; transcribes user speech.
- Turn submits. Per
RuntimeAgentServiceturn lifecycle. - Agent reply streams. TTS executes per executor contract.
- Captions sync. Desktop chat surface keeps captions aligned to audio.
- Avatar lipsync. If Avatar is also open, runtime presentation stream + Avatar audio pipeline drive
ParamMouthOpenY.
What Voice Session Does Not Do
- It does not own voice creation (
K-VOICE-*runtime). - It does not redefine TTS / STT provider semantics.
- It does not bypass
RuntimeAgentServiceturn lifecycle. - It does not own Avatar lipsync.