WebSAM | Xevion

WebSAM puts the Segment Anything models from Meta inside a browser tab. You load an image, click on something, and it gets cut out. Nothing is uploaded and no server runs the model: inference happens locally, on your own GPU, and that constraint is the whole reason it was worth building.

How it works

Segment Anything takes an image and a prompt, which can be a point, a box, or nothing at all, and returns a mask for whatever you pointed at. WebSAM runs that entire pipeline client-side. You pick a model, and the only server involved, a Cloudflare Worker, hands back a short-lived presigned URL for the weights it keeps on Cloudflare R2. The browser streams them down once, caches them in the private filesystem it gives each origin, and after that every click is local. That Worker never sees your image and never runs the model.

The work splits in two. Pushing the image through the large vision backbone is the expensive step, so it runs a single time per image and the result is kept around. Every click afterward only runs the small decoder against that cached encoding, which is what keeps the interaction feeling instant.

The models

Three families, trading download size for mask quality:

SAM 2.1: Tiny through Large, 145 MB to 878 MB, runs on WebGPU
SAM 2: the same size range, also WebGPU
SlimSAM-77: a 14 MB INT8 distillation that runs on CPU through WASM, for browsers without WebGPU

The hard parts

Getting the SAM 2 encoders onto WebGPU took a detour: the raw model graph crashed the graph optimizer on a transpose pass. The fix is to convert each encoder offline into the pre-optimized ORT format with convert_onnx_models_to_ort, which skips the step that breaks.

Why convert the encoders offline?

ONNX Runtime can pre-optimize a model graph the first time it loads one, but doing that in the browser means shipping the optimizer and paying the cost on every cold start, and here it was the exact step that crashed. The .ort format bakes the already-optimized graph into the file itself.

So each encoder is converted once, ahead of time, and the resulting .ort file is uploaded to Cloudflare R2 rather than shipped with the app. The browser pulls it through a short-lived presigned URL and caches it in OPFS, so the optimizer never runs on the client and the heavy weights stay out of the deploy.

SAM 1 and SAM 2 also disagree at the tensor level, with different label dtypes and a different number of encoder outputs, so the decoder carries two prompt-encoding paths instead of one. And since a single decode can still stall a frame, the whole inference layer runs in a Web Worker. The canvas never waits on the model.

Using it

Point mode marks foreground with a click and background with a right-click. Box mode drags a rectangle. Everything mode segments the whole image at once. Moving the cursor previews the mask underneath before you commit to it, and masks draw with animated outlines rather than a flat fill, so the edge stays visible against the image.

Keyboard shortcuts cover the rest: P and B switch between point and box, Esc clears every prompt, and D downloads the mask (Shift+D for the transparent cut-out).

Honest limitations

The Large weights are 878 MB. Instant once cached, but a real commitment on first load, which is why the picker leads with the smaller encoders and SlimSAM stays the works-anywhere default.