Skip to content

mlc()

Creates a local inference provider using MLC Engine with WebGPU. Supports device-adaptive model selection via tier configuration, OPFS caching, and WebWorker isolation.

Import

import { mlc } from '@webllm-io/sdk/providers/mlc';

Signature

function mlc(options?: MLCProviderOptions): ResolvedLocalBackend;

Parameters

options (optional)

Configuration object for the MLC provider.

interface MLCProviderOptions {
model?: string;
tiers?: {
high?: string | 'auto' | null;
medium?: string | 'auto' | null;
low?: string | 'auto' | null;
};
useCache?: boolean;
useWebWorker?: boolean;
}

model (optional)

Fixed model ID to use regardless of device capability. Overrides tier-based selection.

  • Type: string
  • Default: undefined (uses tier-based auto-selection)
  • Example: 'Llama-3.1-8B-Instruct-q4f16_1-MLC'

tiers (optional)

Device-adaptive model mapping by performance grade.

  • Type: TiersConfig
interface TiersConfig {
high?: string | 'auto' | null;
medium?: string | 'auto' | null;
low?: string | 'auto' | null;
}

Tier mapping:

  • high - Used for grades S and A (≥4GB VRAM)
  • medium - Used for grade B (≥2GB VRAM)
  • low - Used for grade C (<2GB VRAM)

Values:

  • Model ID string (e.g., 'Llama-3.1-8B-Instruct-q4f16_1-MLC')
  • 'auto' - Use SDK’s default model for this tier
  • null - Disable local inference for this tier

Default tiers:

{
high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC',
low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC'
}

useCache (optional)

Enable OPFS (Origin Private File System) caching for downloaded models.

  • Type: boolean
  • Default: true
  • Recommended: Keep enabled to avoid re-downloading models

useWebWorker (optional)

Run MLC Engine in a WebWorker to prevent UI blocking during inference.

  • Type: boolean
  • Default: true
  • Recommended: Keep enabled for better UX

Return Value

Returns a ResolvedLocalBackend instance ready for use with createClient().

Examples

Basic usage (auto tier selection)

import { createClient } from '@webllm-io/sdk';
import { mlc } from '@webllm-io/sdk/providers/mlc';
const client = createClient({
local: mlc()
});
// Automatically selects model based on device grade

Fixed model (no adaptive selection)

const client = createClient({
local: mlc({
model: 'Phi-3.5-mini-instruct-q4f16_1-MLC'
})
});
// Always uses Phi-3.5-mini regardless of device capability

Custom tier configuration

const client = createClient({
local: mlc({
tiers: {
high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC',
low: null // Disable local inference on low-end devices
}
})
});

Disable caching (testing/development)

const client = createClient({
local: mlc({
useCache: false // Models will be re-downloaded each time
})
});
const client = createClient({
local: mlc({
useWebWorker: false // Runs in main thread, may freeze UI
})
});

Combine with cloud fallback

import { mlc } from '@webllm-io/sdk/providers/mlc';
import { fetchSSE } from '@webllm-io/sdk/providers/fetch';
const client = createClient({
local: mlc({
tiers: {
high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC',
low: null // Low-end devices fall back to cloud
}
}),
cloud: fetchSSE({
baseURL: 'https://api.openai.com/v1',
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o-mini'
})
});

Preload with progress tracking

const client = createClient({
local: mlc({
model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
useCache: true,
useWebWorker: true
}),
onProgress: (progress) => {
console.log(`${progress.stage}: ${Math.round(progress.progress * 100)}%`);
if (progress.bytesLoaded && progress.bytesTotal) {
const mb = (progress.bytesLoaded / 1024 / 1024).toFixed(1);
const totalMb = (progress.bytesTotal / 1024 / 1024).toFixed(1);
console.log(`Downloaded: ${mb}MB / ${totalMb}MB`);
}
}
});
await client.local.load('Llama-3.1-8B-Instruct-q4f16_1-MLC');

Device-specific configuration

import { checkCapability } from '@webllm-io/sdk';
const cap = await checkCapability();
const client = createClient({
local: mlc({
model: cap.grade === 'S' ? 'Llama-3.1-8B-Instruct-q4f16_1-MLC' :
cap.grade === 'A' ? 'Phi-3.5-mini-instruct-q4f16_1-MLC' :
'Qwen2.5-1.5B-Instruct-q4f16_1-MLC',
useCache: cap.grade !== 'C', // Disable cache on low-end
useWebWorker: true
})
});

Conditional local provider

import { checkCapability } from '@webllm-io/sdk';
const cap = await checkCapability();
const client = createClient({
local: cap.webgpu ? mlc() : false,
cloud: process.env.OPENAI_API_KEY
});

Model Compatibility

The mlc() provider works with MLC-compiled models from the @mlc-ai/web-llm library.

Common models:

  • Llama-3.1-8B-Instruct-q4f16_1-MLC (requires ~4.5GB VRAM)
  • Phi-3.5-mini-instruct-q4f16_1-MLC (requires ~2GB VRAM)
  • Qwen2.5-1.5B-Instruct-q4f16_1-MLC (requires ~1GB VRAM)

For a full list of available models, see the MLC Web LLM model library.

Performance Notes

  • First load: Models are downloaded and cached in OPFS (several GB, can take minutes)
  • Subsequent loads: Models load from cache (seconds)
  • WebWorker overhead: Minimal (~10-20ms per request for message passing)
  • Main thread mode: Faster startup but blocks UI during inference

Requirements

  • Browser: Chrome 113+, Edge 113+, or Safari 18+ (WebGPU support required)
  • Headers: Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp (for SharedArrayBuffer)
  • Dependency: @mlc-ai/web-llm must be installed (peer dependency)

Troubleshooting

Model loading fails

// Check capability first
const cap = await checkCapability();
if (!cap.webgpu) {
console.error('WebGPU not available');
}

Out of memory errors

// Use smaller model or disable cache
const client = createClient({
local: mlc({
model: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC',
useCache: false
})
});

UI freezing during inference

// Ensure WebWorker is enabled
const client = createClient({
local: mlc({
useWebWorker: true // Should be true (default)
})
});

See Also