Local-Only Mode
Local-only mode ensures complete privacy by running all inference on the user’s device. No data is ever sent to external servers.
Basic Local-Only Setup
import { createClient } from '@webllm-io/sdk';
const client = await createClient({ local: 'auto' // No cloud configuration});
const response = await client.chat.completions.create({ messages: [ { role: 'user', content: 'What is the capital of France?' } ]});
console.log(response.choices[0].message.content);Showing Model Download Progress
When a model is loaded for the first time, it needs to be downloaded. Display progress to users:
import { createClient } from '@webllm-io/sdk';
// Create progress UI elementsconst progressBar = document.getElementById('progress-bar');const statusText = document.getElementById('status-text');
const client = await createClient({ local: { model: 'auto', // Auto-select based on device onProgress: (report) => { // Update progress UI const percent = Math.round(report.progress * 100); progressBar.style.width = `${percent}%`; statusText.textContent = report.text;
console.log(`${report.text} - ${percent}%`); } }});
// Hide progress UI once loadedprogressBar.parentElement.style.display = 'none';
// Client is now ready to useconst response = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Hello!' }]});Progress Report Structure
The onProgress callback receives reports with the following structure:
{ progress: 0.75, // 0.0 to 1.0 text: "Loading model weights...", // Additional fields may vary by stage}Explicit Model Selection
Instead of 'auto', you can specify an exact model:
import { createClient } from '@webllm-io/sdk';import { mlc } from '@webllm-io/sdk/providers/mlc';
const client = await createClient({ local: mlc({ model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', onProgress: (report) => { console.log(report.text, `${Math.round(report.progress * 100)}%`); } })});Disable Web Worker (Advanced)
By default, inference runs in a Web Worker. You can disable this for debugging:
const client = await createClient({ local: mlc({ model: 'auto', useWebWorker: false // Run in main thread (may block UI) })});Check Model Cache Before Initialization
Avoid unnecessary downloads by checking if a model is already cached:
import { hasModelInCache } from '@webllm-io/sdk';
const modelId = 'Llama-3.1-8B-Instruct-q4f16_1-MLC';
if (await hasModelInCache(modelId)) { console.log('Model is cached, initialization will be fast!');} else { console.log('Model needs to be downloaded (~4.5GB)'); // Show warning to user}
const client = await createClient({ local: { model: modelId }});Disable OPFS Caching (Testing Only)
For testing, you can disable persistent caching:
const client = await createClient({ local: mlc({ model: 'auto', useCache: false // Don't cache in OPFS })});Disabling cache means the model will be downloaded on every page refresh. Only use this for testing.
Privacy Benefits
Local-only mode provides:
- ✅ Zero data transmission — All processing happens on-device
- ✅ No API keys required — No authentication needed
- ✅ Offline capable — Works without internet (after initial download)
- ✅ Full control — You own the inference pipeline
- ✅ No usage limits — No rate limiting or quotas
Requirements
- WebGPU support — Chrome 113+, Edge 113+, or compatible browser
- Sufficient VRAM — At least 2GB (Grade C devices supported)
- Storage space — 1.5GB to 8GB depending on model
- COOP/COEP headers — Required for SharedArrayBuffer (see FAQ)
Next Steps
- Device Detection — Check capabilities before loading
- Cache Management — Manage downloaded models
- Hybrid Mode — Combine local and cloud for best of both worlds