Local Inference
WebLLM.io enables running large language models entirely in the browser using WebGPU and the MLC inference engine. This eliminates server costs and ensures complete privacy since all data stays on the user’s device.
Prerequisites
To use local inference, you need to install the @mlc-ai/web-llm peer dependency:
npm install @mlc-ai/web-llm# orpnpm add @mlc-ai/web-llm# oryarn add @mlc-ai/web-llmYour browser must also support WebGPU. Use checkCapability() to verify device compatibility before enabling local features.
Basic Usage
The simplest way to enable local inference is with local: 'auto':
import { createClient } from '@webllm-io/sdk';
const client = createClient({ local: 'auto'});
const completion = await client.chat.completions.create({ messages: [ { role: 'user', content: 'What is WebGPU?' } ]});
console.log(completion.choices[0].message.content);With 'auto', the SDK automatically selects the optimal model based on your device’s capabilities (VRAM grade).
Explicit Model Selection
You can specify a model explicitly:
const client = createClient({ local: 'Llama-3.1-8B-Instruct-q4f16_1-MLC'});Or use an object configuration for more control:
const client = createClient({ local: { model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', useCache: true, // Enable OPFS caching (default: true) useWebWorker: true // Run in Web Worker (default: true) }});Device-Aware Model Selection with Tiers
For adaptive model selection based on device capabilities, use the tiers configuration:
const client = createClient({ local: { tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', // S/A grade (≥4GB VRAM) medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', // B grade (≥2GB VRAM) low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC' // C grade (<2GB VRAM) } }});The SDK maps device grades to tier keys: S and A grades use high, B grade uses medium, and C grade uses low. All grades support local inference.
WebWorker Execution
By default, inference runs in a Web Worker to prevent UI freezing during model loading and inference:
const client = createClient({ local: { model: 'auto', useWebWorker: true // default }});Important: WebWorker mode requires proper COOP/COEP headers for SharedArrayBuffer support. See the Web Worker guide for configuration details.
To disable WebWorker (runs on main thread):
const client = createClient({ local: { model: 'auto', useWebWorker: false }});OPFS Model Caching
Models are automatically cached using the Origin Private File System (OPFS) to avoid re-downloading on subsequent visits:
const client = createClient({ local: { model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', useCache: true // default }});To disable caching:
const client = createClient({ local: { model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', useCache: false }});Check cache status and manage cached models:
// Check if model is cachedconst isCached = await client.hasModelInCache('Llama-3.1-8B-Instruct-q4f16_1-MLC');
// Delete model from cacheawait client.deleteModelFromCache('Llama-3.1-8B-Instruct-q4f16_1-MLC');Dynamic Model Selection
Use a function to decide the model at runtime based on device statistics:
const client = createClient({ local: (stats) => { if (stats.grade === 'S' || stats.grade === 'A') { return 'Llama-3.1-8B-Instruct-q4f16_1-MLC'; } if (stats.grade === 'B') { return 'Phi-3.5-mini-instruct-q4f16_1-MLC'; } // Grade C or unsupported return 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC'; }});The stats object contains { grade, webgpu, gpu, connection, battery, memory }.
Progress Tracking
Monitor model download and loading progress:
const client = createClient({ local: 'auto', onProgress: (progress) => { console.log(`Stage: ${progress.stage}`); console.log(`Progress: ${progress.progress}%`); console.log(`Model: ${progress.model}`);
if (progress.bytesLoaded && progress.bytesTotal) { const mb = (progress.bytesLoaded / 1024 / 1024).toFixed(1); const totalMb = (progress.bytesTotal / 1024 / 1024).toFixed(1); console.log(`Downloaded: ${mb}MB / ${totalMb}MB`); } }});Next Steps
- Learn about Cloud Fallback for OpenAI-compatible API support
- Explore Hybrid Routing to combine local and cloud inference
- See Model Loading for progressive loading strategies