Skip to content

Voice Session

Status: Running today. Desktop's agent chat voice session, voice executor, and voice workflow contracts are shipped at the kernel level (agent-chat-voice-*-contract.md).

The Desktop voice session is the surface where a user holds a voice conversation with their agent: speech in, agent reply out, captions synchronized, lifecycle states explicit. The contracts split deliberately into session, executor, and workflow.

Three Contracts

ContractOwns
Voice sessionHigher-level voice session lifecycle as it appears in chat
Voice executorPer-turn voice execution mechanics
Voice workflowCross-turn workflow + identity binding

The split keeps "did the user start a voice conversation" separate from "how is one turn executing" separate from "how does the agent's voice identity bind across turns."

Boundary

OwnsDoes NOT own
Desktop chat voice surface lifecycle + UIVoice creation (K-VOICE-* runtime — see Voice Asset Lifecycle)
Per-turn voice executor in chatTTS / STT provider semantics (Runtime)
Workflow + identity binding in chatAvatar lipsync (Avatar)

The Desktop voice surface consumes runtime voice + projects through the captioned chat UI. It does not invent voice cloning or asset storage.

Reader Scenario: User Voice Turn

User taps voice in chat and speaks.

  1. Voice session begins. Desktop tracks lifecycle.
  2. STT executes. Per voice executor contract; transcribes user speech.
  3. Turn submits. Per RuntimeAgentService turn lifecycle.
  4. Agent reply streams. TTS executes per executor contract.
  5. Captions sync. Desktop chat surface keeps captions aligned to audio.
  6. Avatar lipsync. If Avatar is also open, runtime presentation stream + Avatar audio pipeline drive ParamMouthOpenY.

What Voice Session Does Not Do

  • It does not own voice creation (K-VOICE-* runtime).
  • It does not redefine TTS / STT provider semantics.
  • It does not bypass RuntimeAgentService turn lifecycle.
  • It does not own Avatar lipsync.

Source Basis

Nimi AI open world platform documentation.