Chat Completions
The chat completions API provides OpenAI-compatible text generation with both streaming and non-streaming modes, automatic provider routing, and abort support.
Interface
interface Completions { create(req: ChatCompletionRequest & { stream: true }): AsyncIterable<ChatCompletionChunk>; create(req: ChatCompletionRequest & { stream?: false }): Promise<ChatCompletion>;}Request
ChatCompletionRequest
interface ChatCompletionRequest { messages: Message[]; model?: string; stream?: boolean; temperature?: number; max_tokens?: number; response_format?: { type: 'text' | 'json_object' }; signal?: AbortSignal; provider?: 'local' | 'cloud';}messages (required)
Array of conversation messages in chronological order.
- Type:
Message[]
interface Message { role: 'system' | 'user' | 'assistant'; content: string;}Example:
messages: [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: 'What is the capital of France?' }, { role: 'assistant', content: 'The capital of France is Paris.' }, { role: 'user', content: 'What about Spain?' }]model (optional)
Override the model for this request. If not specified, uses the model configured in the provider.
- Type:
string - Default: Provider’s default model
Example:
model: 'gpt-4o-mini' // Cloudmodel: 'Llama-3.1-8B-Instruct-q4f16_1-MLC' // Localstream (optional)
Enable streaming mode for real-time token-by-token responses.
- Type:
boolean - Default:
false
temperature (optional)
Sampling temperature between 0 and 2. Higher values increase randomness.
- Type:
number - Range:
0.0-2.0 - Default:
1.0
max_tokens (optional)
Maximum number of tokens to generate.
- Type:
number - Default: Provider-specific (typically unlimited or model’s context limit)
response_format (optional)
Specify output format. Use 'json_object' to enable JSON mode.
- Type:
{ type: 'text' | 'json_object' } - Default:
{ type: 'text' }
Example:
response_format: { type: 'json_object' }See also withJsonOutput() helper.
signal (optional)
AbortSignal to cancel the request.
- Type:
AbortSignal - Default:
undefined
Example:
const controller = new AbortController();const response = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Long task...' }], signal: controller.signal});
// Cancel after 5 secondssetTimeout(() => controller.abort(), 5000);provider (optional)
Force a specific provider instead of automatic routing.
- Type:
'local' | 'cloud' - Default: Automatic routing based on availability and device capability
Response
Non-streaming (stream: false)
Returns a ChatCompletion object with the complete generated response.
interface ChatCompletion { id: string; object: 'chat.completion'; created: number; model: string; choices: ChatCompletionChoice[]; usage?: { prompt_tokens: number; completion_tokens: number; total_tokens: number; };}
interface ChatCompletionChoice { index: number; message: Message; finish_reason: 'stop' | 'length' | 'content_filter' | null;}Example:
const response = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Hello!' }]});
console.log(response.choices[0].message.content);// "Hello! How can I assist you today?"
console.log(response.model);// "gpt-4o-mini" or "Llama-3.1-8B-Instruct-q4f16_1-MLC"
console.log(response.usage);// { prompt_tokens: 10, completion_tokens: 8, total_tokens: 18 }Streaming (stream: true)
Returns an AsyncIterable<ChatCompletionChunk> for real-time token streaming.
interface ChatCompletionChunk { id: string; object: 'chat.completion.chunk'; created: number; model: string; choices: ChatCompletionChunkChoice[];}
interface ChatCompletionChunkChoice { index: number; delta: { role?: 'assistant'; content?: string; }; finish_reason: 'stop' | 'length' | 'content_filter' | null;}Example:
const stream = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Write a poem about the ocean.' }], stream: true});
for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { process.stdout.write(content); }}Usage Examples
Basic completion
const response = await client.chat.completions.create({ messages: [ { role: 'user', content: 'What is 2 + 2?' } ]});
console.log(response.choices[0].message.content);// "2 + 2 equals 4."Streaming with React
import { useState } from 'react';
function ChatComponent() { const [output, setOutput] = useState('');
const handleSubmit = async (userMessage: string) => { const stream = await client.chat.completions.create({ messages: [{ role: 'user', content: userMessage }], stream: true });
for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content || ''; setOutput(prev => prev + content); } };
return <div>{output}</div>;}Multi-turn conversation
const messages: Message[] = [ { role: 'system', content: 'You are a math tutor.' }];
// First turnmessages.push({ role: 'user', content: 'What is a prime number?' });const response1 = await client.chat.completions.create({ messages });messages.push(response1.choices[0].message);
// Second turnmessages.push({ role: 'user', content: 'Give me examples.' });const response2 = await client.chat.completions.create({ messages });messages.push(response2.choices[0].message);JSON output
import { withJsonOutput } from '@webllm-io/sdk';
const response = await client.chat.completions.create( withJsonOutput({ messages: [ { role: 'user', content: 'List 3 colors in JSON format: {"colors": ["...", "...", "..."]}' } ] }));
const data = JSON.parse(response.choices[0].message.content);console.log(data.colors); // ["red", "blue", "green"]Force local or cloud
// Force local inferenceconst localResponse = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Hello' }], provider: 'local'});
// Force cloud inferenceconst cloudResponse = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Hello' }], provider: 'cloud'});Abort streaming request
const controller = new AbortController();
const stream = client.chat.completions.create({ messages: [{ role: 'user', content: 'Write a very long story...' }], stream: true, signal: controller.signal});
setTimeout(() => controller.abort(), 3000);
try { for await (const chunk of stream) { console.log(chunk.choices[0]?.delta?.content); }} catch (err) { if (err.code === 'ABORTED') { console.log('Request aborted'); }}Temperature and max_tokens
// More creative outputconst creative = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Write a tagline for a coffee shop' }], temperature: 1.5, max_tokens: 20});
// More deterministic outputconst deterministic = await client.chat.completions.create({ messages: [{ role: 'user', content: 'What is 15 * 23?' }], temperature: 0.0});Error Handling
import { WebLLMError } from '@webllm-io/sdk';
try { const response = await client.chat.completions.create({ messages: [{ role: 'user', content: 'Hello' }] });} catch (err) { if (err instanceof WebLLMError) { switch (err.code) { case 'INFERENCE_FAILED': console.error('Inference failed:', err.message); break; case 'CLOUD_REQUEST_FAILED': console.error('Cloud API error:', err.message); break; case 'ABORTED': console.log('Request aborted'); break; default: console.error('Unknown error:', err); } }}See Also
- WebLLMClient - Client interface
- Structured Output - JSON mode helper
- Errors - Error types and codes
- Config Types - Provider configuration