Phase 0 result · 2026–04–20

The physics work.

We measured cross–machine inference over a private network between two consumer machines. The gate was “less than 2× overhead vs. running on the faster machine alone.” We came in at 0.87×.

What we tested

Two machines, two Ollama instances, one model (llama3.2:1b), one coordinator. The task: eight independent sub–prompts, split four to each machine, run in parallel, results concatenated. Baseline: the same eight prompts run sequentially on the faster machine alone.

Network: a private mesh between the two machines. No third–party API. No data left the boxes.

Headlines

1.15×: distributed run faster than single–machine baseline (2,515 ms vs. 2,903 ms)
0.87×: overhead ratio — gate PASSED (< 2.0)
22 ms: average added latency per remote request, 13–60 ms range
~4 %: of total request time was network, on a coarse–grained call

Sub–request timings (good run)

local  wall=430.4 server=370.0 over=60.4 tokens=45
local  wall=291.4 server=270.2 over=21.3 tokens=32
local  wall=403.8 server=376.2 over=27.6 tokens=48
local  wall=400.6 server=398.4 over= 2.2 tokens=53
remote wall=789.8 server=759.6 over=30.2 tokens=46
remote wall=568.0 server=554.9 over=13.1 tokens=33
remote wall=570.2 server=540.5 over=29.6 tokens=32
remote wall=570.6 server=554.0 over=16.7 tokens=33

The failure mode that taught us how to build this

The first run dispatched all eight sub–prompts in parallel to both machines. Each individual generation, normally ~340 ms, ballooned to 9.6–13 seconds — a 17× per–request slowdown. Total time: 6.47× slower than the baseline.

The cause is well–known if you’ve looked: bare Ollama doesn’t have continuous batching. Hammer it with concurrent requests and it thrashes. The fix is one of two things:

Schedule serially per provider at the orchestrator. Parallelism comes from distributing across providers, not from concurrent requests at one.
Or run a serving engine that does batch properly — vllm or llama-server --cont-batching.

This is now an architectural condition for the platform, not an optimization. Capacity–aware concurrency control at the routing layer. Documented before we built it.

What we are not claiming

True layer–pipelined inference across the public internet (one model sharded across many machines). Phase 0 was capacity pooling, not layer parallelism. The thesis is sound; the wire test is open.
Speedup beyond the slower machine’s share. Hardware asymmetry made the slower box the binding leg. A capability–aware split would push speedup higher; we ran a fair 50/50 to keep the test honest.
Anything about cost, billing, or production SLAs. This was a physics test, not a service.

What this unlocks

Network overhead at 22 ms on a coarse request is a small tax, not a wall. That means a global pool of contributors, served async, is realistic. Here’s how that becomes a network.

Reproducibility

The coordinator code, the raw report JSONs (good run and failing run), and a run-poc.sh all live in our project repo under phase-0-poc/. The exact invocation that produces the gate–passing run is documented in–tree. We’ll publish a public mirror with the launch.

Source data: phase-0-poc/report-B-serial-per-side.json (good run) and report-A-all-parallel.json (failing run). Strategy alternatives, latency breakdowns, failure modes, and conditions for advancing to Phase 1 are in PHASE-0-RESULT.md in the same tree.