Model Loading

WebLLM SDK handles model downloading, compilation, and caching automatically. This guide covers the loading lifecycle and how to monitor or customize it.

Progressive Loading

When both local and cloud are configured, the SDK uses a progressive loading strategy:

Cloud responds immediately — no waiting for model download
Local model downloads in background — via the MLC engine
Hot-switch to local — once the model is ready, subsequent requests use local inference

This happens transparently. Your code stays the same:

const client = createClient({
  local: 'auto',
  cloud: { baseURL: 'https://api.example.com/v1' },
});

// First call → cloud (instant)
// After model loads → local (automatic)
const res = await client.chat.completions.create({
  messages: [{ role: 'user', content: 'Hello' }],
});

Monitoring Progress

Use the onProgress callback to track download, compilation, and warmup stages:

const client = createClient({
  local: 'auto',
  cloud: { baseURL: 'https://api.example.com/v1' },
  onProgress(progress) {
    console.log(`[${progress.stage}] ${progress.model}: ${(progress.progress * 100).toFixed(1)}%`);
    if (progress.bytesLoaded && progress.bytesTotal) {
      const mb = (progress.bytesLoaded / 1024 / 1024).toFixed(1);
      const total = (progress.bytesTotal / 1024 / 1024).toFixed(1);
      console.log(`  ${mb} MB / ${total} MB`);
    }
  },
});

LoadProgress Interface

interface LoadProgress {
  stage: 'download' | 'compile' | 'warmup';
  progress: number;       // 0 to 1
  model: string;          // Model ID
  bytesLoaded?: number;   // Bytes downloaded so far
  bytesTotal?: number;    // Total bytes to download
}

Load States

Each model goes through these states:

Status	Description
`idle`	Not started
`downloading`	Fetching model weights
`compiling`	Compiling model for WebGPU
`ready`	Model loaded and ready for inference
`error`	Loading failed

OPFS Caching

By default, models are cached in the browser’s Origin Private File System (OPFS). Unlike HTTP cache or localStorage, OPFS is:

Persistent — not cleared by “Clear browsing data”
Large capacity — can store multi-GB model files
Per-origin — isolated to your domain

Check Cache Status

import { hasModelInCache } from '@webllm-io/sdk';

const cached = await hasModelInCache('Llama-3.1-8B-Instruct-q4f16_1-MLC');
if (cached) {
  console.log('Model is cached — loading will be fast');
}

Delete Cached Models

import { deleteModelFromCache } from '@webllm-io/sdk';

await deleteModelFromCache('Llama-3.1-8B-Instruct-q4f16_1-MLC');

Disable Caching

const client = createClient({
  local: {
    model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
    useCache: false,  // Always re-download
  },
});

Model Selection by Device Tier

When local: 'auto', the SDK selects a model based on the device grade:

Grade	VRAM	Default Model
S	≥8 GB	Large model (8B+)
A	≥4 GB	8B quantized model
B	≥2 GB	Small model (3B or less)
C	<2 GB	Lightweight model (Qwen2.5-1.5B)

Override with explicit tiers:

const client = createClient({
  local: {
    tiers: {
      high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
      medium: 'Llama-3.2-3B-Instruct-q4f16_1-MLC',
      low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC',
    },
  },
});