Architecture

WebLLM.io is designed as a dual-engine AI inference SDK that seamlessly combines local browser-based inference with cloud API fallback. This architecture enables progressive enhancement: applications work everywhere with cloud APIs, but leverage local GPU acceleration when available.

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Application Layer                       │
│                  createClient() → WebLLMClient               │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    Hardware Fingerprint                      │
│              WebGPU Detection + VRAM Estimation              │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                      Device Scoring                          │
│              S (≥8GB) / A (≥4GB) / B (≥2GB) / C (<2GB)      │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                     Route Decision Engine                    │
│   Consider: Device Grade, Battery, Network, Backend Status   │
└─────────────────────────────────────────────────────────────┘
                              ↓
                 ┌────────────┴────────────┐
                 ↓                         ↓
    ┌────────────────────────┐  ┌────────────────────────┐
    │    Local Backend       │  │    Cloud Backend       │
    │  WebGPU + MLC Engine   │  │   fetchSSE + OpenAI    │
    │   (in Web Worker)      │  │   Compatible API       │
    └────────────────────────┘  └────────────────────────┘

Module Layout

The SDK is organized into eight core modules:

1. Core Module

Location: packages/sdk/src/core/

Exports the primary createClient() factory function and fundamental types. Handles configuration resolution, normalization, and validation. Defines error classes and client lifecycle management.

Key responsibilities:

Client factory and initialization
Configuration type definitions and resolution
Error handling abstractions
Core interfaces and contracts

2. Capability Module

Location: packages/sdk/src/capability/

Detects WebGPU availability, estimates VRAM through maxStorageBufferBindingSize, and assigns device grades (S/A/B/C). This module acts as the hardware fingerprinting layer.

Key responsibilities:

WebGPU feature detection
VRAM estimation via WebGPU limits
Device scoring algorithm
Compatibility checks

3. Providers Module

Location: packages/sdk/src/providers/

Contains provider implementations for local and cloud backends. The mlc() provider wraps @mlc-ai/web-llm for local inference. The fetchSSE() provider implements OpenAI-compatible server-sent events parsing without external dependencies.

Key responsibilities:

mlc() provider for local MLC inference
fetchSSE() for cloud streaming APIs
Provider composition and wrapping logic
Custom provider interface definitions

4. Inference Module

Location: packages/sdk/src/inference/

Defines the InferenceBackend interface and implements local/cloud backend adapters. Includes the RequestQueue that serializes concurrent requests to the single-threaded MLC engine.

Key responsibilities:

InferenceBackend interface contract
Local backend (MLC) adapter
Cloud backend (fetch) adapter
Request queue for serialization

5. Router Module

Location: packages/sdk/src/router/

Implements the route decision engine that determines whether to use local or cloud inference for each request. Considers device grade, backend readiness, battery status, and network conditions.

Key responsibilities:

Route decision algorithm
Fallback strategy management
Backend health monitoring
Dynamic routing based on runtime conditions

6. Chat Module

Location: packages/sdk/src/chat/

Implements the OpenAI-compatible chat.completions API with automatic fallback logic. Handles streaming and non-streaming responses, abort signals, and error recovery.

Key responsibilities:

chat.completions.create() API
Streaming and non-streaming response handling
Automatic local-to-cloud fallback
Abort signal propagation

7. Loader Module

Location: packages/sdk/src/loader/

Manages progressive model loading with OPFS (Origin Private File System) caching. Tracks load state, emits progress events, and handles cache invalidation.

Key responsibilities:

Progressive model download and initialization
OPFS cache management
Load state tracking and events
Cache hit/miss optimization

8. Utils Module

Location: packages/sdk/src/utils/

Provides shared utilities including a lightweight EventEmitter for progress tracking, a self-implemented SSE parser (~30 lines, zero dependencies), and logging helpers.

Key responsibilities:

EventEmitter for progress events
SSE (Server-Sent Events) parser
Logger abstraction
Common helper functions

Request Flow

Standard Request Path

Application calls client.chat.completions.create(messages)
Chat module receives request with messages and options
Router evaluates:
- Is local backend ready?
- Is device grade sufficient?
- Is battery low?
- Is network available?
Route decision selects local or cloud backend
Inference backend processes request:
- Local: Queue request → MLC engine → stream response
- Cloud: Fetch SSE → parse stream → return response
Chat module returns streaming iterator or response object
Application consumes response chunks

Fallback Flow

If local inference fails (e.g., out of memory, WebGPU context lost):

Chat module catches error
Router marks local backend as unavailable
Request automatically retries with cloud backend
Application receives response transparently

This automatic fallback ensures resilience without manual error handling.

Dual Engine Design

The dual-engine architecture provides three key benefits:

1. Progressive Enhancement

Applications work everywhere with cloud APIs, but automatically accelerate with local GPU when available. No feature detection required in application code.

2. Zero Configuration

createClient({ local: 'auto' }) automatically:

Detects WebGPU support
Estimates VRAM
Selects appropriate model
Falls back to cloud if needed

3. Full Control

Advanced use cases can explicitly configure providers:

createClient({
  local: mlc({
    model: 'Llama-3.1-8B-Instruct-q4f16_1-MLC',
    useCache: true
  }),
  cloud: fetchSSE({
    baseURL: 'https://api.openai.com/v1',
    apiKey: process.env.OPENAI_API_KEY
  })
})

WebWorker Isolation

By default, local inference runs in a dedicated Web Worker to prevent UI freezing:

Main Thread                    Worker Thread
─────────────────              ─────────────────
createClient()
    ↓
Initialize Worker ────────────→ Load MLC Engine
    ↓                              ↓
Send inference request ─────────→ Enqueue in RequestQueue
    ↓                              ↓
Receive streaming chunks ←────── MLCEngine.generate()
    ↓                              ↓
Render UI                      Continue processing

This architecture ensures smooth user experience even during intensive inference operations.

Subpath Exports

The SDK uses Node.js subpath exports for selective importing:

@webllm-io/sdk — Main entry: createClient, checkCapability, types
@webllm-io/sdk/providers/mlc — mlc() provider function
@webllm-io/sdk/providers/fetch — fetchSSE() provider function
@webllm-io/sdk/worker — Web Worker entry for MLC inference

This design reduces bundle size by allowing tree-shaking of unused providers.

Next Steps

Learn about the Three-Level API design
Understand Device Scoring in detail
Explore Provider Composition patterns