Local-Only Mode

Local-only mode ensures complete privacy by running all inference on the user’s device. No data is ever sent to external servers.

Basic Local-Only Setup

import { createClient } from '@webllm-io/sdk';

const client = await createClient({
  local: 'auto'  // No cloud configuration
});

const response = await client.chat.completions.create({
  messages: [
    { role: 'user', content: 'What is the capital of France?' }
  ]
});

console.log(response.choices[0].message.content);

Showing Model Download Progress

When a model is loaded for the first time, it needs to be downloaded. Display progress to users:

import { createClient } from '@webllm-io/sdk';

// Create progress UI elements
const progressBar = document.getElementById('progress-bar');
const statusText = document.getElementById('status-text');

const client = await createClient({
  local: {
    model: 'auto',  // Auto-select based on device
    onProgress: (report) => {
      // Update progress UI
      const percent = Math.round(report.progress * 100);
      progressBar.style.width = `${percent}%`;
      statusText.textContent = report.text;

      console.log(`${report.text} - ${percent}%`);
    }
  }
});

// Hide progress UI once loaded
progressBar.parentElement.style.display = 'none';

// Client is now ready to use
const response = await client.chat.completions.create({
  messages: [{ role: 'user', content: 'Hello!' }]
});

Progress Report Structure

The onProgress callback receives reports with the following structure:

{
  progress: 0.75,      // 0.0 to 1.0
  text: "Loading model weights...",
  // Additional fields may vary by stage
}

Explicit Model Selection

Instead of 'auto', you can specify an exact model:

import { createClient } from '@webllm-io/sdk';
import { mlc } from '@webllm-io/sdk/providers/mlc';

const client = await createClient({
  local: mlc({
    model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
    onProgress: (report) => {
      console.log(report.text, `${Math.round(report.progress * 100)}%`);
    }
  })
});

Disable Web Worker (Advanced)

By default, inference runs in a Web Worker. You can disable this for debugging:

const client = await createClient({
  local: mlc({
    model: 'auto',
    useWebWorker: false  // Run in main thread (may block UI)
  })
});

Check Model Cache Before Initialization

Avoid unnecessary downloads by checking if a model is already cached:

import { hasModelInCache } from '@webllm-io/sdk';

const modelId = 'Llama-3.1-8B-Instruct-q4f16_1-MLC';

if (await hasModelInCache(modelId)) {
  console.log('Model is cached, initialization will be fast!');
} else {
  console.log('Model needs to be downloaded (~4.5GB)');
  // Show warning to user
}

const client = await createClient({
  local: { model: modelId }
});

Disable OPFS Caching (Testing Only)

For testing, you can disable persistent caching:

const client = await createClient({
  local: mlc({
    model: 'auto',
    useCache: false  // Don't cache in OPFS
  })
});

Disabling cache means the model will be downloaded on every page refresh. Only use this for testing.

Privacy Benefits

Local-only mode provides:

✅ Zero data transmission — All processing happens on-device
✅ No API keys required — No authentication needed
✅ Offline capable — Works without internet (after initial download)
✅ Full control — You own the inference pipeline
✅ No usage limits — No rate limiting or quotas

Requirements

WebGPU support — Chrome 113+, Edge 113+, or compatible browser
Sufficient VRAM — At least 2GB (Grade C devices supported)
Storage space — 1.5GB to 8GB depending on model
COOP/COEP headers — Required for SharedArrayBuffer (see FAQ)

Next Steps

Device Detection — Check capabilities before loading
Cache Management — Manage downloaded models
Hybrid Mode — Combine local and cloud for best of both worlds