The Arboria Swarm Benchmark
A fixed set of scenarios and baselines. Every Arboria paper — and any follow-on work that wants to be comparable — reports against the same matrix. This is what separates “we tuned flocking” from “we made measurable progress against prior art.”
Source: gossamer.benchmarks.
Scenarios
Each scenario defines an initial state, a per-step reward, and a
terminal metric. The full reference is in
gossamer/benchmarks/scenarios.py; short form:
| Name | Question | Terminal metric | Agent range |
|---|---|---|---|
dispersal | How fast can a clumped swarm spread without colliding? | Mean NN distance at termination | 100 – 10,000 |
rendezvous | How fast does a scattered swarm meet at a common point? | Final mean distance to centroid (lower is better) | 100 – 10,000 |
coverage | Explore a bounded region; maximize cells visited per unit time. | Unique cells visited / total cells | 500 – 50,000 |
leader_follower | One agent is exogenously driven; keep the swarm within range. | Mean follower distance to leader path | 100 – 10,000 |
predator_prey | Adversarial agents chase. Measure survival and evasion. | Survival rate | 100 – 10,000 |
byzantine | Inject k% silently faulty agents. Measure robustness of the base scenario. | Terminal metric under perturbation | 100 – 10,000 |
Baselines
Every new policy must report against the same set so reviewers have a stable reference point:
random— uniform random accelerations. Lower bound.greedy— scenario-appropriate hand-crafted greedy solution (go-to-centroid for rendezvous, push-from-nearest for dispersal, persistent random walk for coverage, etc.).gossamer_flocking— classical Boids viagossamer.algorithms.coordination.flocking.flock_step.mappo— learned policy trained viagossamer.learning.mappoagainst the same scenarios.
Add your own policy to the leaderboard by implementing a Baseline
callable ((pos, vel, rng) -> accel) and passing it to
gossamer.benchmarks.run_benchmark.
Leaderboard
Run leaderboard() over any subset of scenarios and baselines and
generate_leaderboard_md() emits a Markdown table ready to paste
into a paper or the docs. The canonical leaderboard is regenerated on
every tool release and published alongside latest.mdx.
from gossamer.benchmarks import leaderboard, generate_leaderboard_md
results = leaderboard(num_seeds=5)
print(generate_leaderboard_md(results))Reproducibility
Each benchmark row carries: scenario, baseline, num_agents,
steps, seed, metric, mean_reward, elapsed_sec. Seeds are
published; rerunning with the same seed yields identical numbers.
The benchmark harness uses a pure-NumPy stepper by default so the
suite has no physics-engine dependency at test time. For numbers that
feed directly into a Leviathan paper, re-run with
ENGINE_MODE=inprocess to route through the actual C++ core.