Skip to content

Multimodal

Multimodal work produces non-text artifacts. Runtime owns the canonical input fields, the artifact shape, the adapter routing, and the delivery gates. Apps display or consume artifacts; they do not redefine artifact truth.

Capability Surface

Runtime's multimodal contract covers:

CapabilityWhat it generates
ImageRaster images via image engines
VideoVideo artifacts via video engines
AudioAudio generation
VoiceText-to-speech (AI_TTS), with voice cloning support (AI_TTS_CREATE_VOICE, AI_TTS_SYNTHESIZE)
MusicMusic generation, including iteration support

Each capability has admitted canonical input fields, an admitted artifact shape, and admitted delivery gates. Apps can't invent a new artifact MIME type or skip the delivery gate.

Canonical Input Fields

Every multimodal request has a canonical typed input. Apps build the input under the contract:

FieldPurpose
Capability idWhich capability is being invoked
Provider contextOptional provider-specific extension
Resource referencesInput resources (existing artifacts)
Generation parametersCapability-specific parameters

The canonical fields live in runtime/kernel/tables/multimodal-canonical-fields.yaml. Apps that produce off-contract input shapes fail closed at admission.

Provider Async Task Lifecycle

Multimodal generations are typically long-running. Runtime models them as provider async tasks with a typed lifecycle:

StateTerminal?
queuedno
runningno
succeededyes
failedyes
expiredyes (timeout-equivalent)

Note the lower_snake casing (vs ScenarioJob's UPPER_SNAKE). Provider async tasks are normalization at the provider boundary; the casing matches provider semantics.

Async-to-ScenarioJob mapping

Provider async terminal states map deterministically into ScenarioJob terminal states:

Provider async stateScenarioJob terminal
succeededCOMPLETED
expiredTIMEOUT
failedFAILED

The mapping rule (K-MMPROV-027) is admitted; apps see one unified shape across modalities.

Artifact Normalization

Multimodal output lands as an artifact with typed canonical fields. Apps consume artifacts through the artifact contract:

Artifact fieldPurpose
Artifact idStable identity
MIME typeFrom the contract, not guessed
Bytes / referenceWhere to read the artifact
ProvenanceWho produced it, under what request lineage
Delivery gate verdictWhether the gate admitted delivery

Artifact field admission lives in runtime/kernel/tables/multimodal-artifact-fields.yaml. An artifact missing a required field fails closed.

Delivery Gates

A multimodal artifact does not automatically reach the app the moment generation succeeds. The delivery gate decides when an artifact is allowed to be delivered.

Gate concernWhy it matters
Sensitivity classificationSome artifacts may need approval
ProvenanceProvenance-incomplete artifacts may be quarantined
Schema validationOff-contract artifacts fail closed
User policyUser preferences may gate delivery

The runtime delivery gates table (runtime/kernel/tables/runtime-delivery-gates.yaml) admits the specific gates. Apps see the gate verdict; they do not bypass it.

Music Iteration Support

Music generation admits an iteration model: an artifact can be iterated under typed parameters to produce variations.

PropertyValue
Iteration kindMUSIC_GENERATE (admitted under K-MMPROV-*)
LineageEach iteration references the previous artifact
AuditIterations recorded as part of workflow lineage

Iteration is bounded by the admitted contract; apps can't invent new iteration kinds at runtime.

Multimodal Provider Depth (R16)

Status: Running today. The multimodal provider contract sits downstream of provider capability profiles; it pins how providers participate in multimodal delivery beyond the gate verdict surface.

The multimodal provider contract describes how providers are bound, profiled, and admitted into multimodal capability paths (image / audio / video / file). The boundaries:

Owned by provider contractOwned elsewhere
Per-provider capability profile shapeConnector custody (K-CONN-*)
Provider lifecycle within delegated multimodal pathWorkflow execution (K-WF-*)
Provider drift detectionDelivery gates verdict (K-DGATE-*)
Provider native parameter encapsulationPublic delivery surface

Provider-native parameters do not pass through freely; they go through namespaced extensions and are bound by the admitted extension registry — same boundary that voice creation enforces.

Delivery Gates Verdicts (R8)

When multimodal delivery happens, the runtime delivery-gates contract emits a verdict. Verdicts include accepted, quarantined, rejected, plus typed reason codes.

A quarantined verdict means: the artifact is held; it does not flow to the consumer; the user sees the quarantine reason explicitly. Quarantine reasons cover sensitivity classification, descriptor drift, schema mismatch, prompt-poisoning detection, and more (see runtime/kernel/tables/runtime-delivery-gates.yaml).

A quarantined verdict is NOT a transient state. It is a typed terminal outcome until either:

  • The user / admitted approver explicitly releases the artifact, or
  • The flow is canceled

There is no silent retry that "fixes" quarantine.

Voice Cloning Support

Status: Running today. Voice creation (clone + design) and VoiceAsset lifecycle are shipped under K-VOICE-*.

The voice capability admits voice cloning + voice design under typed contracts. Both creation paths run through a unified Scenario abstraction:

Scenario typeDirection
VOICE_CLONEAudio sample → voice
VOICE_DESIGNText description → voice

The synthesis-side capability operations:

OperationPurpose
AI_TTS_CREATE_VOICECreate a voice profile from input audio
AI_TTS_SYNTHESIZESynthesize speech using an admitted voice profile
AI_TTSStandard TTS using an admitted voice

The runtime distinguishes durable voice resources from one-off synthesis:

ConceptOwnerPersistence
VoiceAssetRuntimeDurable (subject to admitted lifecycle)
VoiceReferenceRuntimeIdentifies which voice a synthesis call uses
provider_voice_refProviderProvider-native handle (does not become public asset truth)

For the full asset surface — discovery channel split, target model binding, tenant isolation, voice handle policy, and the VoiceReference boundary — see Voice Asset Lifecycle.

Reader Scenario: An Image Generation Workflow

An app generates an image with a long-running provider.

  1. Workflow node. An AI_IMAGE node is part of a workflow.
  2. ScenarioJob created. The node fans out to a ScenarioJob.
  3. Provider async task. The provider returns a task id; state moves queued → running.
  4. Polling / streaming. Runtime tracks the task. The workflow event stream emits external-async progress events.
  5. Task succeeds. Provider state moves to succeeded. Per K-MMPROV-027, the ScenarioJob terminal becomes COMPLETED.
  6. Artifact delivery. The image artifact has typed canonical fields, MIME type, provenance. The delivery gate validates schema, provenance, sensitivity. If admitted, delivery completes.
  7. App receives artifact. Through the SDK's typed artifact shape. The MIME type came from the contract; the app does not guess.

What did not happen: the app did not get a free-form URL with no provenance; the app did not see a guessed MIME type; the artifact did not bypass the delivery gate.

Reader Scenario: A Music Iteration

A user generates music and wants to iterate.

  1. First generation. A music workflow runs; an artifact is produced.
  2. Iteration request. The app issues an iteration with typed parameters referencing the original artifact.
  3. MUSIC_GENERATE admitted. The iteration is admitted under K-MMPROV-*.
  4. Provider async lifecycle. The iteration runs through the provider async lifecycle. State maps into ScenarioJob terminal as before.
  5. New artifact. The iteration artifact references the original; lineage is preserved.

The iteration is a typed operation; lineage is structural, not docstring.

Reader Scenario: A Provider Async Task Expires

A long video generation hits its provider-side timeout.

  1. Provider state moves from running to expired.
  2. Mapping. Per K-MMPROV-027, ScenarioJob terminal becomes TIMEOUT.
  3. Workflow effect. The node's workflow state moves to FAILED (or to a retry path under admitted retry policy).
  4. Audit. The expiry is recorded with reason.

The app sees a typed TIMEOUT not a "request failed in some way"; the ScenarioJob terminal type tells the app what happened.

Source Basis

Nimi AI open world platform documentation.