Device Scoring
WebLLM.io automatically detects your device’s hardware capabilities and assigns a score (S/A/B/C) to select the optimal model. This ensures the best possible user experience across devices ranging from high-end desktops to mobile phones.
Device Grade System
The SDK uses a four-tier grading system based on available GPU memory:
| Grade | VRAM Range | Typical Devices | Default Model | Model Size |
|---|---|---|---|---|
| S | ≥8192 MB | High-end desktop GPUs (RTX 3080+, M2 Max+) | Llama-3.1-8B-Instruct-q4f16_1-MLC | ~5.5 GB |
| A | ≥4096 MB | Mid-range GPUs (RTX 3060, M1/M2 Pro, iPad Pro) | Llama-3.1-8B-Instruct-q4f16_1-MLC | ~5.5 GB |
| B | ≥2048 MB | Entry-level GPUs (Integrated Intel/AMD, M1 base) | Phi-3.5-mini-instruct-q4f16_1-MLC | ~2.2 GB |
| C | <2048 MB | Mobile devices, older laptops | Qwen2.5-1.5B-Instruct-q4f16_1-MLC | ~1.0 GB |
Important: All grades support local inference. The C grade uses a lightweight but capable 1.5B parameter model, ensuring even low-end devices can run local AI.
VRAM Estimation Method
Unlike traditional VRAM detection (which requires OS-level APIs), WebLLM.io uses a WebGPU-based estimation technique:
WebGPU Adapter ↓adapter.limits.maxStorageBufferBindingSize ↓Proxy for VRAM capacity ↓Device Grade (S/A/B/C)Why maxStorageBufferBindingSize?
The maxStorageBufferBindingSize limit indicates the maximum size of a single storage buffer binding in bytes. This value correlates strongly with total GPU memory:
- High VRAM devices expose large buffer limits (≥1GB)
- Low VRAM devices expose smaller limits (<128MB)
While not a perfect 1:1 mapping, this heuristic works reliably across browsers and platforms.
Detection Code Example
// Simplified version of capability detectionasync function detectDeviceGrade(): Promise<'S' | 'A' | 'B' | 'C'> { if (!navigator.gpu) { throw new Error('WebGPU not supported') }
const adapter = await navigator.gpu.requestAdapter() if (!adapter) { throw new Error('No WebGPU adapter available') }
const maxBufferSize = adapter.limits.maxStorageBufferBindingSize const vramMB = Math.floor(maxBufferSize / (1024 * 1024))
if (vramMB >= 8192) return 'S' if (vramMB >= 4096) return 'A' if (vramMB >= 2048) return 'B' return 'C'}Model Selection Strategy
Automatic Selection (Zero Config)
With local: 'auto', the SDK selects the best model for your device:
import { createClient } from '@webllm-io/sdk'
const client = createClient({ local: 'auto'})
// Automatically selects:// S/A grade → Llama-3.1-8B-Instruct-q4f16_1-MLC// B grade → Phi-3.5-mini-instruct-q4f16_1-MLC// C grade → Qwen2.5-1.5B-Instruct-q4f16_1-MLCCustom Tiers (Responsive Config)
You can override default models for each tier:
const client = createClient({ local: { tiers: { high: 'Llama-3.1-70B-Instruct-q3f16_1-MLC', // For S/A grade medium: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', // For B grade low: 'TinyLlama-1.1B-Chat-q4f16_1-MLC' // For C grade } }})The SDK maps grades to tiers:
- S and A grades →
hightier - B grade →
mediumtier - C grade →
lowtier
Device Grade Examples
Grade S: High-End Desktop
Hardware:
- NVIDIA RTX 4090 (24GB VRAM)
- Apple M2 Max (32GB unified memory)
- AMD RX 7900 XTX (24GB VRAM)
Model Selection:
- Default: Llama-3.1-8B-Instruct (~5.5GB)
- Can run: Any quantized model up to 13B parameters
- Inference Speed: ~30-50 tokens/sec
Grade A: Mid-Range
Hardware:
- NVIDIA RTX 3060 (12GB VRAM)
- Apple M1/M2 Pro (16GB unified memory)
- iPad Pro M2 (8GB RAM)
Model Selection:
- Default: Llama-3.1-8B-Instruct (~5.5GB)
- Can run: 7B-8B quantized models comfortably
- Inference Speed: ~15-25 tokens/sec
Grade B: Entry-Level
Hardware:
- Intel Iris Xe (integrated GPU)
- AMD Radeon 680M (integrated GPU)
- Apple M1 base (8GB unified memory)
Model Selection:
- Default: Phi-3.5-mini-instruct (~2.2GB)
- Can run: 3B-4B parameter models
- Inference Speed: ~8-15 tokens/sec
Grade C: Mobile/Low-End
Hardware:
- iPhone 15 Pro
- Android flagship phones
- Low-power laptops with integrated graphics
Model Selection:
- Default: Qwen2.5-1.5B-Instruct (~1.0GB)
- Can run: Lightweight models up to 1.5B parameters
- Inference Speed: ~5-10 tokens/sec
Checking Device Capability
Use the checkCapability() function to inspect your device grade:
import { checkCapability } from '@webllm-io/sdk'
const capability = await checkCapability()
console.log(capability)// {// webgpu: true,// gpu: { vendor: 'Apple', name: 'Apple M1 Pro', vram: 5120 },// grade: 'A',// connection: { type: 'wifi', downlink: 10, saveData: false },// battery: { level: 0.85, charging: false },// memory: 16// }This is useful for debugging or showing system information in your UI:
// Show capability to userconst cap = await checkCapability()document.getElementById('gpu-info').textContent = `Device Grade: ${cap.grade} (${cap.gpu?.vram ?? 0}MB estimated VRAM)`Model Size vs Quality Trade-offs
Understanding the trade-offs helps in custom tier configuration:
Model Parameters Model Size Quality Speed VRAM Required────────────────────────────────────────────────────────────────70B (q3f16_1) ~40 GB Excellent Very Slow S grade only13B (q4f16_1) ~8 GB Very Good Slow S grade8B (q4f16_1) ~5.5 GB Good Medium S/A grade3.5B (q4f16_1) ~2.2 GB Fair Fast B grade1.5B (q4f16_1) ~1.0 GB Basic Very Fast C gradeQuantization Levels:
q4f16_1: 4-bit weights, 16-bit activations (good balance)q3f16_1: 3-bit weights, 16-bit activations (higher compression)
Handling Detection Failures
If WebGPU is unavailable or detection fails, the SDK falls back to cloud:
const client = createClient({ local: 'auto', cloud: { baseURL: 'https://api.openai.com/v1', apiKey: process.env.OPENAI_API_KEY, model: 'gpt-4o-mini' }})
// If WebGPU unavailable → automatically uses cloudawait client.chat.completions.create({ messages: [{ role: 'user', content: 'Hello' }]})Detection Edge Cases
-
Browser doesn’t support WebGPU
- Solution: Fall back to cloud backend
- Affected: Safari < 18, Firefox < 120 (partial support)
-
WebGPU disabled by user/policy
- Solution: Cloud fallback
- Common in enterprise environments with strict security policies
-
Adapter request denied
- Solution: Cloud fallback
- Rare, may occur on headless systems or virtual machines
-
Unusually low maxStorageBufferBindingSize
- Solution: Assign C grade and use lightweight model
- May occur on very old integrated GPUs
Best Practices
1. Trust Auto-Detection for Most Use Cases
The default model selection is optimized for each grade:
// Recommended for most appsconst client = createClient({ local: 'auto' })2. Customize Tiers for Specific Needs
If you need specific quality/speed trade-offs:
const client = createClient({ local: { tiers: { high: 'Llama-3.1-8B-Instruct-q4f16_1-MLC', // Quality focus medium: 'Phi-3.5-mini-instruct-q4f16_1-MLC', // Balanced low: 'TinyLlama-1.1B-Chat-q4f16_1-MLC' // Speed focus } }})3. Show Capability Info to Users
Let users know what they’re getting:
const cap = await checkCapability()showNotification( `Running locally on Grade ${cap.grade} device (${cap.gpu?.vram ?? 0}MB VRAM)`)4. Test Across Device Grades
Use browser DevTools to simulate different limits:
// Mock lower-grade device for testingif (import.meta.env.DEV) { Object.defineProperty(navigator.gpu, 'requestAdapter', { value: async () => ({ limits: { maxStorageBufferBindingSize: 2048 * 1024 * 1024 } // Force B grade }) })}Next Steps
- Learn about the Three-Level API for progressive configuration
- Understand Request Queue for concurrent inference
- Explore Architecture for overall system design