Skip to content

Local Inference

WebLLM.io enables running large language models entirely in the browser using WebGPU and the MLC inference engine. This eliminates server costs and ensures complete privacy since all data stays on the user’s device.

Prerequisites

To use local inference, you need to install the @mlc-ai/web-llm peer dependency:

Terminal window
npm install @mlc-ai/web-llm
# or
pnpm add @mlc-ai/web-llm
# or
yarn add @mlc-ai/web-llm

Your browser must also support WebGPU. Use checkCapability() to verify device compatibility before enabling local features.

Basic Usage

The simplest way to enable local inference is with local: 'auto':

import { createClient } from '@webllm-io/sdk';
const client = createClient({
local: 'auto'
});
const completion = await client.chat.completions.create({
messages: [
{ role: 'user', content: 'What is WebGPU?' }
]
});
console.log(completion.choices[0].message.content);

With 'auto', the SDK automatically selects the optimal model based on your device’s capabilities (VRAM grade).

Explicit Model Selection

You can specify a model explicitly:

const client = createClient({
local: 'Llama-3.1-8B-Instruct-q4f16_1-MLC'
});

Or use an object configuration for more control:

const client = createClient({
local: {
model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
useCache: true, // Enable OPFS caching (default: true)
useWebWorker: true // Run in Web Worker (default: true)
}
});

Device-Aware Model Selection with Tiers

For adaptive model selection based on device capabilities, use the tiers configuration:

const client = createClient({
local: {
tiers: {
high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', // S/A grade (≥4GB VRAM)
medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', // B grade (≥2GB VRAM)
low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC' // C grade (<2GB VRAM)
}
}
});

The SDK maps device grades to tier keys: S and A grades use high, B grade uses medium, and C grade uses low. All grades support local inference.

WebWorker Execution

By default, inference runs in a Web Worker to prevent UI freezing during model loading and inference:

const client = createClient({
local: {
model: 'auto',
useWebWorker: true // default
}
});

Important: WebWorker mode requires proper COOP/COEP headers for SharedArrayBuffer support. See the Web Worker guide for configuration details.

To disable WebWorker (runs on main thread):

const client = createClient({
local: {
model: 'auto',
useWebWorker: false
}
});

OPFS Model Caching

Models are automatically cached using the Origin Private File System (OPFS) to avoid re-downloading on subsequent visits:

const client = createClient({
local: {
model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
useCache: true // default
}
});

To disable caching:

const client = createClient({
local: {
model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
useCache: false
}
});

Check cache status and manage cached models:

// Check if model is cached
const isCached = await client.hasModelInCache('Llama-3.1-8B-Instruct-q4f16_1-MLC');
// Delete model from cache
await client.deleteModelFromCache('Llama-3.1-8B-Instruct-q4f16_1-MLC');

Dynamic Model Selection

Use a function to decide the model at runtime based on device statistics:

const client = createClient({
local: (stats) => {
if (stats.grade === 'S' || stats.grade === 'A') {
return 'Llama-3.1-8B-Instruct-q4f16_1-MLC';
}
if (stats.grade === 'B') {
return 'Phi-3.5-mini-instruct-q4f16_1-MLC';
}
// Grade C or unsupported
return 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC';
}
});

The stats object contains { grade, webgpu, gpu, connection, battery, memory }.

Progress Tracking

Monitor model download and loading progress:

const client = createClient({
local: 'auto',
onProgress: (progress) => {
console.log(`Stage: ${progress.stage}`);
console.log(`Progress: ${progress.progress}%`);
console.log(`Model: ${progress.model}`);
if (progress.bytesLoaded && progress.bytesTotal) {
const mb = (progress.bytesLoaded / 1024 / 1024).toFixed(1);
const totalMb = (progress.bytesTotal / 1024 / 1024).toFixed(1);
console.log(`Downloaded: ${mb}MB / ${totalMb}MB`);
}
}
});

Next Steps