Request Queue
The MLC inference engine underlying WebLLM.io’s local backend is single-threaded. It can only process one inference request at a time. To handle concurrent requests safely, the SDK implements a RequestQueue that serializes requests in FIFO (first-in, first-out) order.
The Single-Threading Constraint
Unlike cloud APIs that can handle parallel requests, WebGPU-based inference engines run on a single execution thread:
Cloud API (Parallel) Local MLC Engine (Serial)──────────────────── ─────────────────────────Request 1 ──────→ Response 1 Request 1 ──→ Response 1Request 2 ──────→ Response 2 ↓Request 3 ──────→ Response 3 Request 2 ──→ Response 2 ↓(All process simultaneously) Request 3 ──→ Response 3
(One at a time)Why single-threaded?
- WebGPU contexts are not thread-safe
- Model weights are loaded into GPU memory once
- Inference uses mutable state (KV cache, attention heads)
Attempting concurrent inference would cause:
- Race conditions in GPU memory access
- Corrupted KV cache state
- Incorrect or garbled output
- Potential crashes
RequestQueue Architecture
The RequestQueue solves this by serializing requests:
Application Layer (Multiple concurrent calls) ↓chat.completions.create() × 3 ↓┌─────────────────────────────────────┐│ RequestQueue ││ ││ ┌─────┐ ┌─────┐ ┌─────┐ ││ │ Req │→ │ Req │→ │ Req │ FIFO ││ │ 1 │ │ 2 │ │ 3 │ Queue ││ └─────┘ └─────┘ └─────┘ │└─────────────────────────────────────┘ ↓ (dequeue one at a time)MLCEngine.generate() ↓Response chunksHow It Works
- Request arrives at
chat.completions.create() - Enqueued into the RequestQueue
- Queue checks if engine is idle
- If idle: dequeue and start inference immediately
- If busy: wait until current request completes
- Inference runs on the MLC engine
- Streaming chunks are yielded back to caller
- Request completes, queue dequeues next request
- Repeat until queue is empty
Code Example
Without Queue (Incorrect)
This code would fail or produce corrupted results:
// ❌ WRONG: Direct concurrent calls to MLC engineconst engine = await CreateMLCEngine(/* ... */)
// These requests race and corrupt each otherconst response1 = engine.chat.completions.create({ messages: [/* ... */] })const response2 = engine.chat.completions.create({ messages: [/* ... */] })const response3 = engine.chat.completions.create({ messages: [/* ... */] })With Queue (Correct)
WebLLM.io’s internal queue handles serialization automatically:
import { createClient } from '@webllm-io/sdk'
const client = createClient({ local: 'auto' })
// ✅ CORRECT: Queue automatically serializes theseconst [response1, response2, response3] = await Promise.all([ client.chat.completions.create({ messages: [{ role: 'user', content: 'Query 1' }] }), client.chat.completions.create({ messages: [{ role: 'user', content: 'Query 2' }] }), client.chat.completions.create({ messages: [{ role: 'user', content: 'Query 3' }] })])
// Requests execute in order: 1 → 2 → 3// Each waits for the previous to completeQueue Behavior
FIFO Ordering
Requests are processed in the order they arrive:
console.log('Starting 3 requests...')
// Request 1 starts immediatelyconst promise1 = client.chat.completions.create({ messages: [{ role: 'user', content: 'First' }]})
// Request 2 queued, waits for #1const promise2 = client.chat.completions.create({ messages: [{ role: 'user', content: 'Second' }]})
// Request 3 queued, waits for #2const promise3 = client.chat.completions.create({ messages: [{ role: 'user', content: 'Third' }]})
// Output order is guaranteed: First → Second → ThirdStreaming and Queue
Streaming requests hold the queue until completion:
// Request 1: streaming (holds queue for entire duration)const stream1 = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Write an essay' }], stream: true})
// Request 2: waits until stream1 is fully consumedconst stream2 = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Another query' }], stream: true})
// Consume stream1 chunksfor await (const chunk of stream1) { console.log(chunk.choices[0]?.delta?.content)}// Only after stream1 completes does stream2 startAbort and Queue
Aborting a request releases the queue immediately:
const controller = new AbortController()
// Request 1: starts immediatelyconst promise1 = client.chat.completions.create({ messages: [{ role: 'user', content: 'Long response' }], signal: controller.signal})
// Request 2: queuedconst promise2 = client.chat.completions.create({ messages: [{ role: 'user', content: 'Quick query' }]})
// Abort request 1 after 1 secondsetTimeout(() => controller.abort(), 1000)
// Request 2 starts immediately after abortPerformance Implications
Queueing Latency
If multiple requests arrive simultaneously, later requests experience queueing delay:
Request 1: 0s wait + 5s inference = 5s totalRequest 2: 5s wait + 3s inference = 8s totalRequest 3: 8s wait + 4s inference = 12s totalMitigation strategies:
- Batch related queries into a single request
- Use cloud backend for parallel requests
- Show queueing status in UI
Batching Example
Instead of multiple requests:
// ❌ Slow: 3 sequential requestsconst joke = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Tell me a joke' }]})const fact = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Tell me a fact' }]})const poem = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Write a haiku' }]})Combine into one:
// ✅ Fast: Single requestconst response = await client.chat.completions.create({ messages: [{ role: 'user', content: `Please provide three things:1. A joke2. An interesting fact3. A haiku` }]})Multi-User Scenarios
For apps with multiple concurrent users, the queue ensures fair access:
// User A sends a messageuserA.sendMessage('Hello') // Request 1, starts immediately
// User B sends a message (queued)userB.sendMessage('Hi there') // Request 2, waits
// User C sends a message (queued)userC.sendMessage('What is AI?') // Request 3, waits
// Execution order: A → B → CQueue Status UI
Show users their position in queue:
import { createClient } from '@webllm-io/sdk'
const client = createClient({ local: 'auto' })
// Hypothetical queue status API (not currently exposed)client.on('queueStatus', ({ position, length }) => { if (position > 0) { showNotification(`Your request is #${position} in queue`) }})Queue vs Cloud Parallelism
Consider using cloud backend for truly parallel workloads:
const client = createClient({ local: 'auto', cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY }})
// For single-user sequential chat, local is fineawait client.chat.completions.create({ messages: [/* ... */] })
// For bulk parallel processing, force cloud modeconst summaries = await Promise.all( documents.map(doc => client.chat.completions.create({ messages: [{ role: 'user', content: `Summarize: ${doc}` }], // Force cloud for parallel execution __forceBackend: 'cloud' }) ))Queue Internals
Implementation Pseudocode
class RequestQueue { private queue: Request[] = [] private processing = false
async enqueue(request: Request): Promise<Response> { this.queue.push(request)
if (!this.processing) { this.processNext() }
return request.promise }
private async processNext() { if (this.queue.length === 0) { this.processing = false return }
this.processing = true const request = this.queue.shift()!
try { const response = await this.executeOnEngine(request) request.resolve(response) } catch (error) { request.reject(error) }
// Process next request this.processNext() }
private async executeOnEngine(request: Request): Promise<Response> { // Actual MLC engine inference return await mlcEngine.chat.completions.create(request.params) }}Thread Safety
The queue itself runs on the JavaScript main thread (or worker thread if using useWorker: true). JavaScript’s single-threaded execution model ensures the queue’s own operations are atomic.
Best Practices
1. Avoid Unnecessary Concurrent Requests
If requests are sequential, use await:
// ❌ Unnecessarily concurrentconst promises = messages.map(msg => client.chat.completions.create({ messages: [{ role: 'user', content: msg }] }))await Promise.all(promises)
// ✅ Sequential is clearer and has same resultfor (const msg of messages) { await client.chat.completions.create({ messages: [{ role: 'user', content: msg }] })}2. Batch Related Queries
Combine multiple questions into one prompt when possible.
3. Use Abort Signals for User Cancellations
Allow users to cancel queued requests:
const controller = new AbortController()
const responsePromise = client.chat.completions.create({ messages: [{ role: 'user', content: 'Long task' }], signal: controller.signal})
// User clicks "Cancel"cancelButton.onclick = () => controller.abort()4. Show Queue Status
Inform users when requests are queued, especially in multi-user apps.
5. Consider Cloud for Parallel Workloads
If you need true parallelism (e.g., batch document processing), use cloud backend.
Next Steps
- Learn about Architecture for overall system design
- Understand Provider Composition for backend selection
- Explore Three-Level API for configuration patterns