Hybrid Routing
Hybrid routing combines the privacy and cost benefits of local inference with the instant availability and power of cloud APIs. WebLLM.io automatically decides which backend to use based on device capabilities, model availability, and loading state.
Basic Hybrid Configuration
Enable both local and cloud backends:
import { createClient } from '@webllm-io/sdk';
const client = createClient({ local: 'auto', // Device-aware model selection cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }});
// SDK automatically routes to the best available backendconst completion = await client.chat.completions.create({ messages: [ { role: 'user', content: 'Explain hybrid inference' } ]});Automatic Routing Logic
The SDK decides which backend to use based on:
- Device capability — If WebGPU is unavailable (
gpu === null) or battery is low, cloud is used - Model loading state — Cloud responds instantly while local model downloads
- Backend availability — If one backend fails, the other is tried automatically
- Explicit provider override — Request-level
providerparameter forces a specific backend
Routing by Device Grade
const client = createClient({ local: { tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', // S/A grade medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', // B grade low: null // C grade devices fall back to cloud } }, cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }});When low is set to null, C grade devices fall back to cloud. By default, all grades (including C) support local inference with a lightweight model.
Progressive Loading with Cloud Fallback
During model download, the cloud API serves requests immediately:
const client = createClient({ local: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }, onProgress: (progress) => { console.log(`Loading model: ${progress.progress}% (${progress.stage})`); }});
// First request uses cloud (local model still downloading)const completion1 = await client.chat.completions.create({ messages: [{ role: 'user', content: 'First message' }]});console.log('Backend:', completion1.model); // 'gpt-4o-mini'
// Wait for model to finish loading...// Subsequent requests automatically use local model
const completion2 = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Second message' }]});console.log('Backend:', completion2.model); // 'Llama-3.1-8B-Instruct-q4f16_1-MLC'This provides instant responsiveness while still leveraging local inference once available.
Automatic Fallback on Errors
If the primary backend fails, the SDK automatically tries the fallback:
const client = createClient({ local: 'auto', cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }});
// If local inference fails (e.g., out of memory), cloud is usedconst completion = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Complex task requiring lots of VRAM' }]});Fallback happens transparently without application code changes.
Force a Specific Provider
Override automatic routing with the provider parameter:
const client = createClient({ local: 'auto', cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }});
// Force local inferenceconst localCompletion = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Private data' }], provider: 'local'});
// Force cloud APIconst cloudCompletion = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Complex reasoning' }], provider: 'cloud'});If the forced provider is unavailable, the request will fail instead of falling back.
User-Controlled Provider Selection
Let users choose their preferred inference backend:
function createClientFromUserSettings(settings) { const config = {};
if (settings.enableLocal) { config.local = settings.localModel || 'auto'; }
if (settings.enableCloud && settings.apiKey) { config.cloud = { baseURL: settings.cloudEndpoint, apiKey: settings.apiKey, model: settings.cloudModel }; }
return createClient(config);}
// User preferences from UIconst userSettings = { enableLocal: true, localModel: 'auto', enableCloud: true, cloudEndpoint: 'https://api.openai.com/v1', apiKey: localStorage.getItem('openai_api_key'), cloudModel: 'gpt-4o-mini'};
const client = createClientFromUserSettings(userSettings);Cost Optimization
Use local inference by default and cloud as a fallback to minimize API costs:
const client = createClient({ local: { tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', // S/A grade medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', // B grade low: 'Qwen2.5-1.5B-Instruct-q4f16_1-MLC' // C grade — still local } }, cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }});
// Use local when possible to save API costs// Cloud only used during model loading or on errorsPrivacy-First Routing
Enforce local-only inference for sensitive data:
const client = createClient({ local: 'auto', cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }});
function createCompletion(messages, isPrivate) { return client.chat.completions.create({ messages, provider: isPrivate ? 'local' : undefined // Force local for private data });}
// Private data stays on deviceawait createCompletion([ { role: 'user', content: 'Analyze my medical records: ...' }], true);
// Non-sensitive data can use cloudawait createCompletion([ { role: 'user', content: 'What is the capital of France?' }], false);Detecting Active Backend
Check which model responded:
const completion = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Hello!' }]});
console.log('Model used:', completion.model);
if (completion.model.includes('MLC')) { console.log('Local inference was used');} else { console.log('Cloud API was used');}For streaming responses, check the first chunk:
const stream = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Hello!' }], stream: true});
for await (const chunk of stream) { if (chunk.model) { console.log('Backend:', chunk.model); break; // Model name is in the first chunk }}Next Steps
- Learn about Streaming for real-time responses
- Explore Model Loading for progressive loading strategies
- See Device Capability to check device compatibility