Voice Asset Lifecycle
Status: Running today. Voice creation (clone + design) and the
VoiceAssetruntime-managed object are shipped under theK-VOICE-*authority surface.
Voice in Nimi is a runtime first-class capability. Voice creation (clone + design) and voice asset lifecycle are owned by Runtime under K-VOICE-001..K-VOICE-018. This page covers the asset side — what VoiceAsset is, how it is created, how it is referenced, and how it differs from one-off speech synthesis.
Voice Creation Scope
Voice creation covers two scenarios:
| Scenario | What it does |
|---|---|
voice_clone | voice / audio → voice (sample-based) |
voice_design | text → voice (description-based) |
Both are submitted through a unified Scenario abstraction:
SubmitScenarioJobwithscenario_type=VOICE_CLONESubmitScenarioJobwithscenario_type=VOICE_DESIGN
Provider-private parameters do not pass through freely; they go through namespaced ScenarioExtension and are bound by extension registry rules.
VoiceAsset: Runtime-Owned Object Truth
VoiceAsset is the runtime-managed voice resource. Minimum required fields:
| Field | Meaning |
|---|---|
voice_asset_id | Runtime-owned asset id |
app_id | Owning app context |
subject_user_id | Owning user (tenant scope) |
workflow_type | voice_clone / voice_design |
provider | Backing provider |
model_id | Provider model used at creation |
target_model_id | Model the asset is bound to for synthesis |
provider_voice_ref | Provider-owned native handle truth |
persistence | Logical lifecycle (persistence_types) |
status | Asset status (asset_statuses) |
Both persistence and status enums live in tables/voice-enums.yaml.
persistence describes logical lifecycle and handle policy. It does not automatically promise a durable local substrate; until durable local substrate is admitted separately, locally-generated VoiceAsset may remain a session-local orchestration object.
Asset vs Reference vs Synthesis
This is the most important distinction on this page:
| Concept | What it is | Owner |
|---|---|---|
VoiceAsset | Durable runtime-owned voice resource | Runtime |
VoiceReference | The handle a synthesis call uses to identify which voice to speak | Runtime |
provider_voice_ref | The provider's native handle inside its API | Provider |
| Voice synthesis | One-off tts_synthesize invocation that uses a VoiceReference to produce audio | Runtime (RPC) |
VoiceAsset and provider_voice_ref must stay separate. Runtime must not promote provider_voice_ref to a public primary key. The provider must not bypass VoiceAsset to become the runtime's user resource truth. When a provider returns a native custom voice handle, the runtime converges it into the VoiceAsset + VoiceReference public contract.
VoiceReference Boundary
The synthesis entry point goes through VoiceReference. Only three admitted reference kinds:
| Kind | Use |
|---|---|
preset_voice_id | System-preset voice |
voice_asset_id | User-created VoiceAsset |
provider_voice_ref | Provider-native handle (admitted explicitly) |
The reference kinds enum lives in tables/voice-enums.yaml (reference_kinds).
VoiceReference may be embedded inside the runtime-owned AgentPresentationProfile as the agent's default voice binding. That embedding does not transfer voice workflow / discovery / asset ownership outside K-VOICE-*.
Discovery: Two Separate Channels
Voice discovery splits cleanly into two channels:
| Discovery API | What it returns |
|---|---|
ListPresetVoices | System-preset voice catalog |
ListVoiceAssets | User's VoiceAsset records |
Callers must not depend on a single mixed API. When a provider supports both global presets and user assets, both channels stay available — but the runtime never returns mixed-stream results.
The voice.discovery_mode catalog setting binds discovery responsibility:
static_catalog→ preset discovery only viaListPresetVoicesdynamic_user_scoped→ user asset discovery viaListVoiceAssetsmixed→ both channels active, callers invoke separately
ScenarioJob Lifecycle
Voice creation is asynchronous. ScenarioJob lifecycle (state machine + event stream) aligns with K-JOB-002. Voice does not duplicate a parallel job state table.
The four lifecycle RPCs:
SubmitScenarioJob— start a clone or design jobGetScenarioJob— poll statusCancelScenarioJob— cancel in flightSubscribeScenarioJobEvents— stream events
Provider-native multi-step workflows (e.g., preview → create) must be encapsulated inside one ScenarioJob lifecycle. Provider internal steps do not become extra public RPCs.
Tenant Isolation
VoiceAsset is user-scoped by default. Cross-app_id or cross-subject_user_id access fails closed. There is no implicit cross-tenant voice asset reuse.
Target Model Binding
VoiceAsset binds target_model_id at creation. If tts_synthesize later requests a different target model, runtime returns AI_VOICE_TARGET_MODEL_MISMATCH. The binding is contractual; it is not an advisory hint.
Voice Handle Policy
Workflow-capable voice families must declare voice_handle_policy once admitted. Minimum fields:
| Field | Source |
|---|---|
persistence | tables/voice-enums.yaml persistence_types |
scope | tables/voice-enums.yaml handle_scopes |
default_ttl | per-family |
delete_semantics | tables/voice-enums.yaml delete_semantics |
runtime_reconciliation_required | per-family |
Workflow-capable families without an admitted voice_handle_policy may not be admitted.
Reader Scenario: User Clones Their Voice
A user wants to clone their own voice for an agent.
- Submit clone job. App calls
SubmitScenarioJob({ scenario_type: VOICE_CLONE, ... })with the audio sample inputs. - Job lifecycle. Runtime executes through admitted
ScenarioJobstate machine; events stream viaSubscribeScenarioJobEvents. - Provider native preview / create wraps. The provider's internal preview-then-create steps are encapsulated inside the one
ScenarioJob— the app sees one lifecycle. - Result. Runtime emits a
VoiceAssetwith admitted fields and aVoiceReference{ kind: voice_asset_id }. - Bind to agent. App embeds the
VoiceReferenceinAgentPresentationProfile. The agent now speaks with the cloned voice by default.
Reader Scenario: Synthesis With A Preset Voice
App wants the agent to speak a one-off line using a preset voice.
- Discover.
ListPresetVoicesreturns the catalog withpreset_voice_ids. - Build VoiceReference.
{ kind: preset_voice_id, value: ... }. - Synthesize. Runtime
tts_synthesizecall uses the reference; noVoiceAssetis involved (no user asset is created). - Audio bytes return. Runtime owns the artifact bytes; Avatar's audio pipeline consumes via
runtime.artifacts.readBytes.
Synthesis is transient. The VoiceAsset lifecycle is for durable voice resources — a user's cloned voice, a designed voice, etc.
Reader Scenario: Discovery Honors The Channel Split
App wants to surface the user's available voices in a settings UI.
- Two calls. App calls
ListPresetVoicesfor system voices andListVoiceAssetsfor user-created voices. - Render in two sections. UI shows them as separate sections per the
discovery_modeboundary. - No mixed API. App does not look for a single "list everything" call — there isn't one, and the channel split is contractual.
What Voice Asset Lifecycle Does Not Do
- It is not a one-off synthesis call.
tts_synthesizeis transient;VoiceAssetis durable resource truth. - It does not mix preset and user-asset discovery into one stream.
- It does not let providers escape into the public asset surface;
provider_voice_refstays insideVoiceReference. - It does not silently cross tenants. User scope is fail-closed.
- It does not let workflow-capable TTS families substitute for STT;
audio.transcribeadmits independently perK-VOICE-016family-level boundary.
Boundary Summary
| Concern | Owner |
|---|---|
| Voice creation workflows (clone + design) | Runtime (K-VOICE-001..002) |
VoiceAsset object truth | Runtime (K-VOICE-004) |
VoiceReference synthesis entry | Runtime (K-VOICE-003) |
| Provider-native voice handle | Provider (provider_voice_ref) |
| Tenant isolation | Runtime (K-VOICE-006) |
| Target model binding | Runtime (K-VOICE-007) |
| Discovery channel split | Runtime (K-VOICE-009, K-VOICE-013) |
| Voice handle policy | Runtime (K-VOICE-015) |
| Family-level workflow validation boundary | Runtime (K-VOICE-016) |