Architecture
WebLLM.io is designed as a dual-engine AI inference SDK that seamlessly combines local browser-based inference with cloud API fallback. This architecture enables progressive enhancement: applications work everywhere with cloud APIs, but leverage local GPU acceleration when available.
High-Level Architecture
┌─────────────────────────────────────────────────────────────┐│ Application Layer ││ createClient() → WebLLMClient │└─────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────┐│ Hardware Fingerprint ││ WebGPU Detection + VRAM Estimation │└─────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────┐│ Device Scoring ││ S (≥8GB) / A (≥4GB) / B (≥2GB) / C (<2GB) │└─────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────┐│ Route Decision Engine ││ Consider: Device Grade, Battery, Network, Backend Status │└─────────────────────────────────────────────────────────────┘ ↓ ┌────────────┴────────────┐ ↓ ↓ ┌────────────────────────┐ ┌────────────────────────┐ │ Local Backend │ │ Cloud Backend │ │ WebGPU + MLC Engine │ │ fetchSSE + OpenAI │ │ (in Web Worker) │ │ Compatible API │ └────────────────────────┘ └────────────────────────┘Module Layout
The SDK is organized into eight core modules:
1. Core Module
Location: packages/sdk/src/core/
Exports the primary createClient() factory function and fundamental types. Handles configuration resolution, normalization, and validation. Defines error classes and client lifecycle management.
Key responsibilities:
- Client factory and initialization
- Configuration type definitions and resolution
- Error handling abstractions
- Core interfaces and contracts
2. Capability Module
Location: packages/sdk/src/capability/
Detects WebGPU availability, estimates VRAM through maxStorageBufferBindingSize, and assigns device grades (S/A/B/C). This module acts as the hardware fingerprinting layer.
Key responsibilities:
- WebGPU feature detection
- VRAM estimation via WebGPU limits
- Device scoring algorithm
- Compatibility checks
3. Providers Module
Location: packages/sdk/src/providers/
Contains provider implementations for local and cloud backends. The mlc() provider wraps @mlc-ai/web-llm for local inference. The fetchSSE() provider implements OpenAI-compatible server-sent events parsing without external dependencies.
Key responsibilities:
mlc()provider for local MLC inferencefetchSSE()for cloud streaming APIs- Provider composition and wrapping logic
- Custom provider interface definitions
4. Inference Module
Location: packages/sdk/src/inference/
Defines the InferenceBackend interface and implements local/cloud backend adapters. Includes the RequestQueue that serializes concurrent requests to the single-threaded MLC engine.
Key responsibilities:
InferenceBackendinterface contract- Local backend (MLC) adapter
- Cloud backend (fetch) adapter
- Request queue for serialization
5. Router Module
Location: packages/sdk/src/router/
Implements the route decision engine that determines whether to use local or cloud inference for each request. Considers device grade, backend readiness, battery status, and network conditions.
Key responsibilities:
- Route decision algorithm
- Fallback strategy management
- Backend health monitoring
- Dynamic routing based on runtime conditions
6. Chat Module
Location: packages/sdk/src/chat/
Implements the OpenAI-compatible chat.completions API with automatic fallback logic. Handles streaming and non-streaming responses, abort signals, and error recovery.
Key responsibilities:
chat.completions.create()API- Streaming and non-streaming response handling
- Automatic local-to-cloud fallback
- Abort signal propagation
7. Loader Module
Location: packages/sdk/src/loader/
Manages progressive model loading with OPFS (Origin Private File System) caching. Tracks load state, emits progress events, and handles cache invalidation.
Key responsibilities:
- Progressive model download and initialization
- OPFS cache management
- Load state tracking and events
- Cache hit/miss optimization
8. Utils Module
Location: packages/sdk/src/utils/
Provides shared utilities including a lightweight EventEmitter for progress tracking, a self-implemented SSE parser (~30 lines, zero dependencies), and logging helpers.
Key responsibilities:
- EventEmitter for progress events
- SSE (Server-Sent Events) parser
- Logger abstraction
- Common helper functions
Request Flow
Standard Request Path
- Application calls
client.chat.completions.create(messages) - Chat module receives request with messages and options
- Router evaluates:
- Is local backend ready?
- Is device grade sufficient?
- Is battery low?
- Is network available?
- Route decision selects local or cloud backend
- Inference backend processes request:
- Local: Queue request → MLC engine → stream response
- Cloud: Fetch SSE → parse stream → return response
- Chat module returns streaming iterator or response object
- Application consumes response chunks
Fallback Flow
If local inference fails (e.g., out of memory, WebGPU context lost):
- Chat module catches error
- Router marks local backend as unavailable
- Request automatically retries with cloud backend
- Application receives response transparently
This automatic fallback ensures resilience without manual error handling.
Dual Engine Design
The dual-engine architecture provides three key benefits:
1. Progressive Enhancement
Applications work everywhere with cloud APIs, but automatically accelerate with local GPU when available. No feature detection required in application code.
2. Zero Configuration
createClient({ local: 'auto' }) automatically:
- Detects WebGPU support
- Estimates VRAM
- Selects appropriate model
- Falls back to cloud if needed
3. Full Control
Advanced use cases can explicitly configure providers:
createClient({ local: mlc({ model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', useCache: true }), cloud: fetchSSE({ baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY })})WebWorker Isolation
By default, local inference runs in a dedicated Web Worker to prevent UI freezing:
Main Thread Worker Thread───────────────── ─────────────────createClient() ↓Initialize Worker ────────────→ Load MLC Engine ↓ ↓Send inference request ─────────→ Enqueue in RequestQueue ↓ ↓Receive streaming chunks ←────── MLCEngine.generate() ↓ ↓Render UI Continue processingThis architecture ensures smooth user experience even during intensive inference operations.
Subpath Exports
The SDK uses Node.js subpath exports for selective importing:
@webllm-io/sdk— Main entry: createClient, checkCapability, types@webllm-io/sdk/providers/mlc— mlc() provider function@webllm-io/sdk/providers/fetch— fetchSSE() provider function@webllm-io/sdk/worker— Web Worker entry for MLC inference
This design reduces bundle size by allowing tree-shaking of unused providers.
Next Steps
- Learn about the Three-Level API design
- Understand Device Scoring in detail
- Explore Provider Composition patterns