Phase 0 result · 2026–04–20
The physics work.
We measured cross–machine inference over a private network between two consumer machines. The gate was “less than 2× overhead vs. running on the faster machine alone.” We came in at 0.87×.
What we tested
Two machines, two Ollama instances, one model (llama3.2:1b), one coordinator. The task: eight independent sub–prompts, split four to each machine, run in parallel, results concatenated. Baseline: the same eight prompts run sequentially on the faster machine alone.
Network: a private mesh between the two machines. No third–party API. No data left the boxes.
Headlines
- 1.15×
- distributed run faster than single–machine baseline (2,515 ms vs. 2,903 ms)
- 0.87×
- overhead ratio — gate PASSED (< 2.0)
- 22 ms
- average added latency per remote request, 13–60 ms range
- ~4 %
- of total request time was network, on a coarse–grained call
Sub–request timings (good run)
local wall=430.4 server=370.0 over=60.4 tokens=45
local wall=291.4 server=270.2 over=21.3 tokens=32
local wall=403.8 server=376.2 over=27.6 tokens=48
local wall=400.6 server=398.4 over= 2.2 tokens=53
remote wall=789.8 server=759.6 over=30.2 tokens=46
remote wall=568.0 server=554.9 over=13.1 tokens=33
remote wall=570.2 server=540.5 over=29.6 tokens=32
remote wall=570.6 server=554.0 over=16.7 tokens=33
The failure mode that taught us how to build this
The first run dispatched all eight sub–prompts in parallel to both machines. Each individual generation, normally ~340 ms, ballooned to 9.6–13 seconds — a 17× per–request slowdown. Total time: 6.47× slower than the baseline.
The cause is well–known if you’ve looked: bare Ollama doesn’t have continuous batching. Hammer it with concurrent requests and it thrashes. The fix is one of two things:
- Schedule serially per provider at the orchestrator. Parallelism comes from distributing across providers, not from concurrent requests at one.
- Or run a serving engine that does batch properly —
vllmorllama-server --cont-batching.
This is now an architectural condition for the platform, not an optimization. Capacity–aware concurrency control at the routing layer. Documented before we built it.
What we are not claiming
- True layer–pipelined inference across the public internet (one model sharded across many machines). Phase 0 was capacity pooling, not layer parallelism. The thesis is sound; the wire test is open.
- Speedup beyond the slower machine’s share. Hardware asymmetry made the slower box the binding leg. A capability–aware split would push speedup higher; we ran a fair 50/50 to keep the test honest.
- Anything about cost, billing, or production SLAs. This was a physics test, not a service.
What this unlocks
Network overhead at 22 ms on a coarse request is a small tax, not a wall. That means a global pool of contributors, served async, is realistic. Here’s how that becomes a network.
Reproducibility
The coordinator code, the raw report JSONs (good run and failing run), and a run-poc.sh all live in our project repo under phase-0-poc/. The exact invocation that produces the gate–passing run is documented in–tree. We’ll publish a public mirror with the launch.
Source data: phase-0-poc/report-B-serial-per-side.json (good run) and report-A-all-parallel.json (failing run). Strategy alternatives, latency breakdowns, failure modes, and conditions for advancing to Phase 1 are in PHASE-0-RESULT.md in the same tree.