Three-Level API
WebLLM.io provides three levels of API complexity, allowing you to start simple and progressively opt into more control as your needs evolve. Each level builds on the previous one, providing a smooth learning curve.
API Levels Overview
Level 1: Zero Config ↓ (add device-specific model tiers)Level 2: Responsive ↓ (add explicit provider functions)Level 3: Full ControlLevel 1: Zero Config
When to use: Prototyping, simple demos, or when you trust the SDK’s defaults.
The simplest possible configuration. Just specify 'auto' for local inference:
import { createClient } from '@webllm-io/sdk'
const client = createClient({ local: 'auto'})
// That's it! The SDK handles everything:// - WebGPU detection// - VRAM estimation// - Device scoring// - Model selection// - Progressive loadingWhat Happens Automatically
-
Hardware Detection
- Checks if WebGPU is available
- Reads
maxStorageBufferBindingSizeto estimate VRAM - Assigns device grade (S/A/B/C)
-
Model Selection
- S grade (≥8GB):
Llama-3.1-8B-Instruct-q4f16_1-MLC - A grade (≥4GB):
Llama-3.1-8B-Instruct-q4f16_1-MLC - B grade (≥2GB):
Phi-3.5-mini-instruct-q4f16_1-MLC - C grade (<2GB):
Qwen2.5-1.5B-Instruct-q4f16_1-MLC
- S grade (≥8GB):
-
Worker Initialization
- Spawns Web Worker for non-blocking inference
- Configures OPFS cache by default
-
Fallback Strategy
- If local inference unavailable, falls back to cloud (if configured)
- Transparent error recovery
Zero Config with Cloud Fallback
const client = createClient({ local: 'auto', cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }})Now the SDK will:
- Try local inference first (auto-selected model)
- Fall back to
gpt-4o-miniif local fails or is unavailable
Level 2: Responsive
When to use: Production apps targeting diverse devices (desktop, tablet, mobile) with different hardware capabilities.
Responsive configuration lets you define model tiers for different device grades while still letting the SDK handle detection and selection:
import { createClient } from '@webllm-io/sdk'
const client = createClient({ local: { tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', // S/A grade medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', // B grade low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC' // C grade } }, cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }})Tier Mapping
The SDK maps device grades to tiers:
| Device Grade | VRAM Range | Selected Tier |
|---|---|---|
| S | ≥8GB | high |
| A | ≥4GB | high |
| B | ≥2GB | medium |
| C | <2GB | low |
Additional Configuration Options
You can customize more aspects at this level:
const client = createClient({ local: { tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC' }, useWorker: true, // Run in Web Worker (default: true) useCache: true, // Enable OPFS cache (default: true) }, cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini', timeout: 30000, // Request timeout in ms maxRetries: 2 // Retry count for network failures }})When to Use Responsive Config
Responsive configuration is ideal when:
- Your app targets multiple device types (desktop, tablet, mobile)
- You want to optimize for quality on high-end devices
- You need fast inference on low-end devices
- You want automatic device-appropriate model selection
- You still want the SDK to handle hardware detection
Level 3: Full Control
When to use: Advanced use cases requiring explicit control over providers, custom model selection, or integration with non-standard backends.
Full control configuration uses explicit provider functions:
import { createClient } from '@webllm-io/sdk'import { mlc } from '@webllm-io/sdk/providers/mlc'import { fetchSSE } from '@webllm-io/sdk/providers/fetch'
const client = createClient({ local: mlc({ model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', useWorker: true, useCache: false, // Disable cache for always-fresh downloads workerUrl: '/custom-worker.js', // Custom worker script initProgressCallback: (progress) => { console.log(`Loading: ${progress.text}`) } }), cloud: fetchSSE({ baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini', headers: { 'X-Custom-Header': 'value' }, fetch: customFetchImpl // Custom fetch implementation })})Provider Functions
mlc() Provider
The mlc() provider wraps @mlc-ai/web-llm for local inference:
import { mlc } from '@webllm-io/sdk/providers/mlc'
const localProvider = mlc({ model: string, // Required: MLC model ID useWorker?: boolean, // Default: true useCache?: boolean, // Default: true workerUrl?: string, // Custom worker script initProgressCallback?: (report) => void, logLevel?: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR'})fetchSSE() Provider
The fetchSSE() provider implements OpenAI-compatible streaming:
import { fetchSSE } from '@webllm-io/sdk/providers/fetch'
const cloudProvider = fetchSSE({ baseURL: string, // Required: API endpoint apiKey?: string, // Authentication key model?: string, // Default model timeout?: number, // Request timeout (ms) maxRetries?: number, // Retry count headers?: Record<string, string>, // Custom headers fetch?: typeof fetch // Custom fetch impl})Custom Provider Function
You can implement custom providers for non-standard backends:
import { CloudFn } from '@webllm-io/sdk'
const customCloudProvider: CloudFn = async ({ messages, options }) => { // Custom implementation const response = await fetch('https://my-api.com/chat', { method: 'POST', body: JSON.stringify({ messages, ...options }) })
// Return AsyncIterable<string> for streaming return { async *[Symbol.asyncIterator]() { const reader = response.body.getReader() while (true) { const { done, value } = await reader.read() if (done) break yield new TextDecoder().decode(value) } } }}
const client = createClient({ cloud: customCloudProvider})Disabling Local or Cloud
You can explicitly disable one engine:
// Cloud-only (no local inference)const client = createClient({ local: false, cloud: fetchSSE({ /* ... */ })})
// Local-only (no cloud fallback)const client = createClient({ local: mlc({ /* ... */ }), cloud: false})When to Use Full Control
Full control configuration is necessary when:
- You need a specific model regardless of device grade
- You’re integrating with a custom or self-hosted backend
- You want to disable caching for development
- You need custom progress callbacks or logging
- You’re implementing a non-standard provider
- You want to benchmark different configurations
Choosing the Right Level
┌─────────────────────────────────────────────────────────┐│ Use Case → Recommended Level │├─────────────────────────────────────────────────────────┤│ Quick prototype → Zero Config ││ Simple demo → Zero Config ││ Multi-device production app → Responsive ││ Device-specific optimization → Responsive ││ Custom backend integration → Full Control ││ Specific model requirement → Full Control ││ Advanced debugging → Full Control ││ Non-standard provider → Full Control │└─────────────────────────────────────────────────────────┘Migration Path
You can smoothly migrate between levels as your needs evolve:
From Zero Config to Responsive
// Before (Zero Config)const client = createClient({ local: 'auto' })
// After (Responsive)const client = createClient({ local: { tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC' } }})From Responsive to Full Control
// Before (Responsive)const client = createClient({ local: { tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC' } }})
// After (Full Control)const client = createClient({ local: mlc({ model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC' })})Next Steps
- Learn about Device Scoring for tier selection
- Understand Provider Composition patterns
- Explore the Request Queue for concurrent requests