Skip to content

Three-Level API

WebLLM.io provides three levels of API complexity, allowing you to start simple and progressively opt into more control as your needs evolve. Each level builds on the previous one, providing a smooth learning curve.

API Levels Overview

Level 1: Zero Config
↓ (add device-specific model tiers)
Level 2: Responsive
↓ (add explicit provider functions)
Level 3: Full Control

Level 1: Zero Config

When to use: Prototyping, simple demos, or when you trust the SDK’s defaults.

The simplest possible configuration. Just specify 'auto' for local inference:

import { createClient } from '@webllm-io/sdk'
const client = createClient({
local: 'auto'
})
// That's it! The SDK handles everything:
// - WebGPU detection
// - VRAM estimation
// - Device scoring
// - Model selection
// - Progressive loading

What Happens Automatically

  1. Hardware Detection

    • Checks if WebGPU is available
    • Reads maxStorageBufferBindingSize to estimate VRAM
    • Assigns device grade (S/A/B/C)
  2. Model Selection

    • S grade (≥8GB): Llama-3.1-8B-Instruct-q4f16_1-MLC
    • A grade (≥4GB): Llama-3.1-8B-Instruct-q4f16_1-MLC
    • B grade (≥2GB): Phi-3.5-mini-instruct-q4f16_1-MLC
    • C grade (<2GB): Qwen2.5-1.5B-Instruct-q4f16_1-MLC
  3. Worker Initialization

    • Spawns Web Worker for non-blocking inference
    • Configures OPFS cache by default
  4. Fallback Strategy

    • If local inference unavailable, falls back to cloud (if configured)
    • Transparent error recovery

Zero Config with Cloud Fallback

const client = createClient({
local: 'auto',
cloud: {
baseURL: 'https://api.openai.com/v1',
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o-mini'
}
})

Now the SDK will:

  • Try local inference first (auto-selected model)
  • Fall back to gpt-4o-mini if local fails or is unavailable

Level 2: Responsive

When to use: Production apps targeting diverse devices (desktop, tablet, mobile) with different hardware capabilities.

Responsive configuration lets you define model tiers for different device grades while still letting the SDK handle detection and selection:

import { createClient } from '@webllm-io/sdk'
const client = createClient({
local: {
tiers: {
high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', // S/A grade
medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', // B grade
low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC' // C grade
}
},
cloud: {
baseURL: 'https://api.openai.com/v1',
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o-mini'
}
})

Tier Mapping

The SDK maps device grades to tiers:

Device GradeVRAM RangeSelected Tier
S≥8GBhigh
A≥4GBhigh
B≥2GBmedium
C<2GBlow

Additional Configuration Options

You can customize more aspects at this level:

const client = createClient({
local: {
tiers: {
high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC',
low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC'
},
useWorker: true, // Run in Web Worker (default: true)
useCache: true, // Enable OPFS cache (default: true)
},
cloud: {
baseURL: 'https://api.openai.com/v1',
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o-mini',
timeout: 30000, // Request timeout in ms
maxRetries: 2 // Retry count for network failures
}
})

When to Use Responsive Config

Responsive configuration is ideal when:

  • Your app targets multiple device types (desktop, tablet, mobile)
  • You want to optimize for quality on high-end devices
  • You need fast inference on low-end devices
  • You want automatic device-appropriate model selection
  • You still want the SDK to handle hardware detection

Level 3: Full Control

When to use: Advanced use cases requiring explicit control over providers, custom model selection, or integration with non-standard backends.

Full control configuration uses explicit provider functions:

import { createClient } from '@webllm-io/sdk'
import { mlc } from '@webllm-io/sdk/providers/mlc'
import { fetchSSE } from '@webllm-io/sdk/providers/fetch'
const client = createClient({
local: mlc({
model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
useWorker: true,
useCache: false, // Disable cache for always-fresh downloads
workerUrl: '/custom-worker.js', // Custom worker script
initProgressCallback: (progress) => {
console.log(`Loading: ${progress.text}`)
}
}),
cloud: fetchSSE({
baseURL: 'https://api.openai.com/v1',
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o-mini',
headers: {
'X-Custom-Header': 'value'
},
fetch: customFetchImpl // Custom fetch implementation
})
})

Provider Functions

mlc() Provider

The mlc() provider wraps @mlc-ai/web-llm for local inference:

import { mlc } from '@webllm-io/sdk/providers/mlc'
const localProvider = mlc({
model: string, // Required: MLC model ID
useWorker?: boolean, // Default: true
useCache?: boolean, // Default: true
workerUrl?: string, // Custom worker script
initProgressCallback?: (report) => void,
logLevel?: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR'
})

fetchSSE() Provider

The fetchSSE() provider implements OpenAI-compatible streaming:

import { fetchSSE } from '@webllm-io/sdk/providers/fetch'
const cloudProvider = fetchSSE({
baseURL: string, // Required: API endpoint
apiKey?: string, // Authentication key
model?: string, // Default model
timeout?: number, // Request timeout (ms)
maxRetries?: number, // Retry count
headers?: Record<string, string>, // Custom headers
fetch?: typeof fetch // Custom fetch impl
})

Custom Provider Function

You can implement custom providers for non-standard backends:

import { CloudFn } from '@webllm-io/sdk'
const customCloudProvider: CloudFn = async ({ messages, options }) => {
// Custom implementation
const response = await fetch('https://my-api.com/chat', {
method: 'POST',
body: JSON.stringify({ messages, ...options })
})
// Return AsyncIterable<string> for streaming
return {
async *[Symbol.asyncIterator]() {
const reader = response.body.getReader()
while (true) {
const { done, value } = await reader.read()
if (done) break
yield new TextDecoder().decode(value)
}
}
}
}
const client = createClient({
cloud: customCloudProvider
})

Disabling Local or Cloud

You can explicitly disable one engine:

// Cloud-only (no local inference)
const client = createClient({
local: false,
cloud: fetchSSE({ /* ... */ })
})
// Local-only (no cloud fallback)
const client = createClient({
local: mlc({ /* ... */ }),
cloud: false
})

When to Use Full Control

Full control configuration is necessary when:

  • You need a specific model regardless of device grade
  • You’re integrating with a custom or self-hosted backend
  • You want to disable caching for development
  • You need custom progress callbacks or logging
  • You’re implementing a non-standard provider
  • You want to benchmark different configurations

Choosing the Right Level

┌─────────────────────────────────────────────────────────┐
│ Use Case → Recommended Level │
├─────────────────────────────────────────────────────────┤
│ Quick prototype → Zero Config │
│ Simple demo → Zero Config │
│ Multi-device production app → Responsive │
│ Device-specific optimization → Responsive │
│ Custom backend integration → Full Control │
│ Specific model requirement → Full Control │
│ Advanced debugging → Full Control │
│ Non-standard provider → Full Control │
└─────────────────────────────────────────────────────────┘

Migration Path

You can smoothly migrate between levels as your needs evolve:

From Zero Config to Responsive

// Before (Zero Config)
const client = createClient({ local: 'auto' })
// After (Responsive)
const client = createClient({
local: {
tiers: {
high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC',
low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC'
}
}
})

From Responsive to Full Control

// Before (Responsive)
const client = createClient({
local: {
tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC' }
}
})
// After (Full Control)
const client = createClient({
local: mlc({ model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC' })
})

Next Steps