Run a Language Model in Your Browser With WebGPU โ How It Works
WebGPU lets browsers use the GPU for AI inference. Here is how this enables full LLMs to run locally in Chrome and Edge without any server.
Running a language model in a browser sounds like it shouldn't work. These models are supposed to need massive servers. But WebGPU changes what's possible, and the results are genuinely surprising.
What WebGPU is
WebGPU is a web standard (supported in Chrome 113+ and other modern browsers) that gives web applications direct access to your GPU. Not a slow, sandboxed version: actual GPU compute that can run hundreds of parallel operations efficiently. This is the same hardware that games use for rendering and that machine learning engineers use for training.
Before WebGPU, browser-based computation was limited to WebGL (designed for graphics, not general compute) or WebAssembly running on the CPU (much slower for this kind of work). WebGPU makes browser-based neural network inference fast enough to be usable.
How a model runs in the browser
The model file (a compressed set of learned weights, typically 2-8 GB for a usable chat model) is downloaded and cached. A JavaScript library like Transformers.js or MLC-LLM loads the weights and handles the computation using WebGPU. When you send a message to our Browser AI Chat, the inference runs directly on your GPU through the browser, with no server involved.
Performance you can expect
On a laptop with an integrated GPU, you might see 5-15 tokens per second. On a machine with a dedicated GPU (like an Nvidia RTX 3060 or better), you can reach 30-60 tokens per second. That's fast enough for natural conversation. On a low-end device without WebGPU support, the model falls back to CPU, which is slower, roughly 1-3 tokens per second.
Browser requirements
- Chrome 113 or later: best WebGPU support
- Edge 113 or later: same as Chrome (same engine)
- Firefox: experimental WebGPU support, behind a flag as of mid-2025
- Safari: WebGPU support since Safari 17 (macOS Sonoma and iOS 17)
If WebGPU is unavailable, the tool automatically falls back to WebAssembly on the CPU, which works but runs slower.
Memory requirements
A 7-billion-parameter model at 4-bit quantization needs roughly 4-6 GB of RAM/VRAM. If you're running Chrome with 20 tabs open on an 8 GB machine, you might run into issues. Close other tabs and applications before running a browser LLM for best results.