Local Models
Runtime can run AI capability on your hardware. This page covers local engines, local model lifecycle, and the engine-first routing model. For cloud routing see Connectors And Providers.
Engine-First Routing
Nimi's local routing is engine-first. You do not pick a model and hope something can run it; you pick an engine, and the engine resolves a model into a runnable bundle.
| Step | What happens |
|---|---|
| 1. Choose engine | E.g., a llama.cpp engine, a stable-diffusion engine, a sidecar engine |
| 2. Engine resolves model | The engine picks the right model bundle based on capability + device |
| 3. Bundle becomes runnable | Quantization, runtime context, GPU layers all admitted |
| 4. Route registered | The local route is now an admitted runtime route |
This is opposite to "model-first" routing where you pick a model name and hope for compatibility. Engine-first means the engine owns the compatibility decision.
Local Engine Catalog
Admitted engine types live in runtime/kernel/tables/local-engine-catalog.yaml. Common engine types include text engines (e.g., llama.cpp variants), image engines (e.g., stable-diffusion variants), audio engines, and sidecar engines for specialized work.
Each engine has:
| Field | Purpose |
|---|---|
| Engine id | Stable identity |
| Engine type | Text / image / audio / sidecar / etc. |
| Runtime mode | How the engine runs (inline, daemon, etc.) |
| Configuration priority | Which config layers apply |
| Capability surface | What this engine can do |
Engines are admitted; new engine types require kernel admission.
Device Profile
Runtime's device profile system describes what your hardware can support. The profile is generated from device detection and admits or refuses model bundles based on compatibility.
| Field | Purpose |
|---|---|
| GPU presence | Whether a discrete GPU is available |
| GPU memory | Available VRAM |
| CPU profile | Cores, architecture |
| Device tier | Admitted compatibility tier |
A model bundle that requires more VRAM than the device profile admits fails closed at admission. The platform does not silently load a quantization the device cannot run.
Local Adapter Routing
Once an engine is admitted and a model is resolved, requests from apps route through the local adapter routing layer. The adapter normalizes the call shape so an app cannot tell whether a generation came from a local engine or a cloud provider — same streaming shape, same error model, same metadata.
| Routing rule | Source |
|---|---|
| Capability → adapter | tables/local-adapter-routing.yaml |
| Engine → model bundle | tables/local-engine-catalog.yaml |
HuggingFace Catalog Search
Local model installation supports HuggingFace catalog search. The CLI / Desktop runtime config lets you search admitted model families and install a bundle that matches your engine's expectations.
| Step | What happens |
|---|---|
| 1. Search | Query an admitted catalog source |
| 2. Filter | Engine-compatible bundles only |
| 3. Install | Download + verify + register under engine |
| 4. Activate | Mark the model as active for this engine |
The catalog search is gated by admitted catalog routes. Random URLs are not loadable; only admitted catalog routes can install models.
Reader Scenario: Installing A Local Text Model
You want to run a local text model on your machine.
- Pick an engine. You select an admitted text engine (e.g., a llama.cpp engine). The engine has a known capability surface.
- Search. Through CLI or Desktop runtime config, you search admitted catalog routes for a compatible model.
- Filter. The search filters to bundles your engine can run on your device. Bundles that exceed your VRAM profile are either filtered out or shown as "device-too-small."
- Install. The selected bundle is downloaded, verified (checksum), and registered under the engine.
- Activate. The model becomes the active model for this engine. Local capability for text generation is now available.
- Use. An app issues a text request through
sdk/runtime. Runtime routes the request to the local engine and streams the result back through the normalized streaming shape.
The app code did not change between cloud routing and local routing. The local adapter normalized the shape.
Reader Scenario: A Model Install That Hits A Device Constraint
You attempt to install a model bundle larger than your VRAM.
- Search returns the bundle with a "device-too-small" marker, OR the bundle is filtered out depending on your CLI filter.
- If you proceed anyway, install fails closed at admission. The device profile says the bundle does not fit.
- Audit lineage. The failed install is recorded with reason.
- Remediation. You pick a smaller bundle (or a different quantization), or a different engine.
The platform does not silently load a quantization that will OOM. Fail-closed is the contract.
Reader Scenario: Multi-Engine On The Same Machine
You want both a text engine and an image engine running locally.
- Both engines admitted. Each runs under its own engine instance.
- GPU arbitration. Runtime arbitrates GPU access between engines under admitted GPU policy. Concurrent generation is subject to GPU budget.
- Capability surface. Apps can issue text requests routed to the text engine and image requests routed to the image engine. Each routes through the local adapter.
- Audit. Each generation is recorded under the engine that served it.
Multi-engine is the normal case. The engine-first routing is exactly what makes this manageable — capabilities resolve to engines, not to a global model namespace.
Cuda Dependency Setup
For engines that need CUDA, Runtime provides a materializer-based setup with explicit phases:
| Phase | Meaning |
|---|---|
queued | Setup queued |
downloading | Fetching dependencies |
verifying | Checksum / compatibility verification |
installing | Installing into runtime-managed location |
ready_system / ready_managed | Ready under system or managed mode |
failed | Setup failed; reason recorded |
repair_required | Setup needs repair |
cancelled | User cancelled |
The setup never runs PowerShell or bash directly; everything goes through the materializer with a single confirmation UI.
Source Basis
.nimi/spec/runtime/local-model.md.nimi/spec/runtime/kernel/local-category-capability.md.nimi/spec/runtime/kernel/local-engine-contract.md.nimi/spec/runtime/kernel/device-profile-contract.md.nimi/spec/runtime/kernel/tables/local-engine-catalog.yaml.nimi/spec/runtime/kernel/tables/local-adapter-routing.yaml.nimi/spec/runtime/kernel/endpoint-security.md.nimi/spec/runtime/kernel/scheduling-contract.md