mlc()
Creates a local inference provider using MLC Engine with WebGPU. Supports device-adaptive model selection via tier configuration, OPFS caching, and WebWorker isolation.
Import
import { mlc } from '@webllm-io/sdk/providers/mlc';Signature
function mlc(options?: MLCProviderOptions): ResolvedLocalBackend;Parameters
options (optional)
Configuration object for the MLC provider.
interface MLCProviderOptions { model?: string; tiers?: { high?: string | 'auto' | null; medium?: string | 'auto' | null; low?: string | 'auto' | null; }; useCache?: boolean; useWebWorker?: boolean;}model (optional)
Fixed model ID to use regardless of device capability. Overrides tier-based selection.
- Type:
string - Default:
undefined(uses tier-based auto-selection) - Example:
'Llama-3.1-8B-Instruct-q4f16_1-MLC'
tiers (optional)
Device-adaptive model mapping by performance grade.
- Type:
TiersConfig
interface TiersConfig { high?: string | 'auto' | null; medium?: string | 'auto' | null; low?: string | 'auto' | null;}Tier mapping:
high- Used for grades S and A (≥4GB VRAM)medium- Used for grade B (≥2GB VRAM)low- Used for grade C (<2GB VRAM)
Values:
- Model ID string (e.g.,
'Llama-3.1-8B-Instruct-q4f16_1-MLC') 'auto'- Use SDK’s default model for this tiernull- Disable local inference for this tier
Default tiers:
{ high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC'}useCache (optional)
Enable OPFS (Origin Private File System) caching for downloaded models.
- Type:
boolean - Default:
true - Recommended: Keep enabled to avoid re-downloading models
useWebWorker (optional)
Run MLC Engine in a WebWorker to prevent UI blocking during inference.
- Type:
boolean - Default:
true - Recommended: Keep enabled for better UX
Return Value
Returns a ResolvedLocalBackend instance ready for use with createClient().
Examples
Basic usage (auto tier selection)
import { createClient } from '@webllm-io/sdk';import { mlc } from '@webllm-io/sdk/providers/mlc';
const client = createClient({ local: mlc()});// Automatically selects model based on device gradeFixed model (no adaptive selection)
const client = createClient({ local: mlc({ model: 'Phi-3.5-mini-instruct-q4f16_1-MLC' })});// Always uses Phi-3.5-mini regardless of device capabilityCustom tier configuration
const client = createClient({ local: mlc({ tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', low: null // Disable local inference on low-end devices } })});Disable caching (testing/development)
const client = createClient({ local: mlc({ useCache: false // Models will be re-downloaded each time })});Main thread inference (not recommended)
const client = createClient({ local: mlc({ useWebWorker: false // Runs in main thread, may freeze UI })});Combine with cloud fallback
import { mlc } from '@webllm-io/sdk/providers/mlc';import { fetchSSE } from '@webllm-io/sdk/providers/fetch';
const client = createClient({ local: mlc({ tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', low: null // Low-end devices fall back to cloud } }), cloud: fetchSSE({ baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' })});Preload with progress tracking
const client = createClient({ local: mlc({ model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', useCache: true, useWebWorker: true }), onProgress: (progress) => { console.log(`${progress.stage}: ${Math.round(progress.progress * 100)}%`); if (progress.bytesLoaded && progress.bytesTotal) { const mb = (progress.bytesLoaded / 1024 / 1024).toFixed(1); const totalMb = (progress.bytesTotal / 1024 / 1024).toFixed(1); console.log(`Downloaded: ${mb}MB / ${totalMb}MB`); } }});
await client.local.load('Llama-3.1-8B-Instruct-q4f16_1-MLC');Device-specific configuration
import { checkCapability } from '@webllm-io/sdk';
const cap = await checkCapability();
const client = createClient({ local: mlc({ model: cap.grade === 'S' ? 'Llama-3.1-8B-Instruct-q4f16_1-MLC' : cap.grade === 'A' ? 'Phi-3.5-mini-instruct-q4f16_1-MLC' : 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC', useCache: cap.grade !== 'C', // Disable cache on low-end useWebWorker: true })});Conditional local provider
import { checkCapability } from '@webllm-io/sdk';
const cap = await checkCapability();
const client = createClient({ local: cap.webgpu ? mlc() : false, cloud: process.env.OPENAI_API_KEY});Model Compatibility
The mlc() provider works with MLC-compiled models from the @mlc-ai/web-llm library.
Common models:
Llama-3.1-8B-Instruct-q4f16_1-MLC(requires ~4.5GB VRAM)Phi-3.5-mini-instruct-q4f16_1-MLC(requires ~2GB VRAM)Qwen2.5-1.5B-Instruct-q4f16_1-MLC(requires ~1GB VRAM)
For a full list of available models, see the MLC Web LLM model library.
Performance Notes
- First load: Models are downloaded and cached in OPFS (several GB, can take minutes)
- Subsequent loads: Models load from cache (seconds)
- WebWorker overhead: Minimal (~10-20ms per request for message passing)
- Main thread mode: Faster startup but blocks UI during inference
Requirements
- Browser: Chrome 113+, Edge 113+, or Safari 18+ (WebGPU support required)
- Headers:
Cross-Origin-Opener-Policy: same-originandCross-Origin-Embedder-Policy: require-corp(for SharedArrayBuffer) - Dependency:
@mlc-ai/web-llmmust be installed (peer dependency)
Troubleshooting
Model loading fails
// Check capability firstconst cap = await checkCapability();if (!cap.webgpu) { console.error('WebGPU not available');}Out of memory errors
// Use smaller model or disable cacheconst client = createClient({ local: mlc({ model: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC', useCache: false })});UI freezing during inference
// Ensure WebWorker is enabledconst client = createClient({ local: mlc({ useWebWorker: true // Should be true (default) })});See Also
- createClient() - Client creation
- Config Types - All configuration options
- Providers (Fetch) - Cloud inference provider
- checkCapability() - Device detection
- Cache Management - OPFS cache utilities