Playground
Playground
The WebLLM.io Playground is an interactive demo application where you can test local and cloud inference capabilities in real-time.
Accessing the Playground
Online
Visit the hosted playground at:
https://webllm.io/playgroundLocal Development
Run the playground locally:
# From the monorepo rootpnpm --filter @webllm-io/playground devThe playground will start at http://localhost:5173.
Features
Device Capability Display
When the playground loads, it automatically detects your device’s WebGPU capability and displays:
- Device Grade — S, A, B, or C based on estimated VRAM
- WebGPU Support — Whether local inference is available
- Recommended Model — Auto-selected model for your device tier
Mode Selection
Choose how WebLLM.io routes your requests:
- Local — Force local inference (requires WebGPU)
- Cloud — Use cloud provider only
- Hybrid — Smart routing based on device capability (default)
Settings Panel
Click the Settings button to configure local and cloud options.
Local Settings
- Model — Override auto-selection with a specific model name
- Leave empty for automatic device-based selection
- Example:
Llama-3.1-8B-Instruct-q4f16_1-MLC
- WebWorker — Run inference in a Web Worker (default: Enabled)
- Keeps UI responsive during inference
- Recommended for production use
- Cache (OPFS) — Enable model caching in Origin Private File System (default: Enabled)
- Dramatically improves load times on subsequent visits
- Models are cached locally and reused
Cloud Settings
- Base URL — Cloud API endpoint
- Example:
https://api.openai.com/v1 - Compatible with OpenAI-style APIs
- Example:
- API Key — Authentication token
- Stored securely in
localStorage - Displayed as password-masked input
- Stored securely in
- Model — Cloud model identifier
- Example:
gpt-4o-mini,gpt-4-turbo
- Example:
- Timeout — Request timeout in milliseconds
- Default: 30000 (30 seconds)
- Retries — Number of retry attempts on failure
- Default: 2
Settings Persistence
All settings are automatically saved to localStorage under the key webllm-playground-config and restored when you reload the page.
Click Apply & Reinitialize to apply changes and restart the client.
Model Tag
Each assistant message displays the model that generated the response:
- For local inference:
Llama-3.1-8B-Instruct-q4f16_1-MLC(example) - For cloud inference:
gpt-4o-mini(example)
The model name is extracted from the first streaming chunk’s model field and displayed as an italic tag below the message.
Chat Interface
Sending Messages
Type your message in the input box and press Send or hit Enter.
Streaming Responses
Responses stream in real-time, displaying tokens as they’re generated.
Message History
The playground maintains conversation context automatically. Previous messages are included in subsequent requests.
Clear Conversation
Click Clear to reset the conversation and start fresh.
Browser Requirements
- Chrome 113+ or Edge 113+ for local inference (WebGPU required)
- Any modern browser for cloud-only mode
Troubleshooting
Local Inference Not Working
- Check WebGPU Support — Ensure your browser supports WebGPU
- Verify MLC Dependency — The playground includes
@mlc-ai/web-llmby default - Check Console — Look for detailed error messages in the browser console
Cloud Inference Failing
- Verify API Key — Ensure your API key is correct
- Check Base URL — Confirm the endpoint URL format
- Network Issues — Check browser network tab for CORS or connection errors
Slow Initial Load
The first time you use local inference, the model needs to download (1-4 GB depending on device). Subsequent loads use the OPFS cache and are much faster.
Next Steps
- Quick Start — Integrate WebLLM.io into your app
- Configuration — Advanced configuration options
- Local Inference — Learn about WebGPU models
- Examples — More code examples