Skip to content

Streaming Chat

Streaming responses provides a better user experience by displaying content as it’s generated, rather than waiting for the complete response.

Complete Example

import { createClient } from '@webllm-io/sdk';
const client = await createClient({
local: 'auto'
});
// Enable streaming with stream: true
const stream = await client.chat.completions.create({
messages: [
{ role: 'user', content: 'Explain how WebGPU enables in-browser AI inference.' }
],
stream: true
});
// Process each chunk as it arrives
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) {
// Display the token immediately (e.g., append to UI)
process.stdout.write(delta);
}
}

Streaming to the DOM

Here’s a practical example of streaming to a web page:

const messageElement = document.getElementById('assistant-message');
messageElement.textContent = '';
const stream = await client.chat.completions.create({
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: userInput }
],
stream: true
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) {
messageElement.textContent += delta;
}
}

Stream Chunk Format

Each chunk follows the OpenAI streaming format:

{
id: "chatcmpl-123",
object: "chat.completion.chunk",
created: 1234567890,
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
choices: [
{
index: 0,
delta: {
role: "assistant", // Only in first chunk
content: "Hello" // Token content
},
finish_reason: null // "stop" in final chunk
}
]
}

Handling Stream Completion

let fullResponse = '';
for await (const chunk of stream) {
const choice = chunk.choices[0];
if (choice?.delta?.content) {
fullResponse += choice.delta.content;
updateUI(fullResponse);
}
if (choice?.finish_reason === 'stop') {
console.log('Generation complete!');
console.log('Model used:', chunk.model);
}
}

Error Handling

try {
const stream = await client.chat.completions.create({
messages: [{ role: 'user', content: 'Hello!' }],
stream: true
});
for await (const chunk of stream) {
// Process chunks
}
} catch (error) {
console.error('Streaming failed:', error);
// Handle error (e.g., show user-friendly message)
}

Performance Tips

  • Use Web Workers: By default, inference runs in a Web Worker to keep the main thread responsive
  • Debounce UI updates: If rendering is expensive, consider debouncing DOM updates
  • Show loading states: Display a loading indicator before the first chunk arrives

Next Steps