The Arboria Swarm Benchmark

A fixed set of scenarios and baselines. Every Arboria paper — and any follow-on work that wants to be comparable — reports against the same matrix. This is what separates “we tuned flocking” from “we made measurable progress against prior art.”

Source: gossamer.benchmarks.

Scenarios

Each scenario defines an initial state, a per-step reward, and a terminal metric. The full reference is in gossamer/benchmarks/scenarios.py; short form:

Name	Question	Terminal metric	Agent range
`dispersal`	How fast can a clumped swarm spread without colliding?	Mean NN distance at termination	100 – 10,000
`rendezvous`	How fast does a scattered swarm meet at a common point?	Final mean distance to centroid (lower is better)	100 – 10,000
`coverage`	Explore a bounded region; maximize cells visited per unit time.	Unique cells visited / total cells	500 – 50,000
`leader_follower`	One agent is exogenously driven; keep the swarm within range.	Mean follower distance to leader path	100 – 10,000
`predator_prey`	Adversarial agents chase. Measure survival and evasion.	Survival rate	100 – 10,000
`byzantine`	Inject k% silently faulty agents. Measure robustness of the base scenario.	Terminal metric under perturbation	100 – 10,000

Baselines

Every new policy must report against the same set so reviewers have a stable reference point:

random — uniform random accelerations. Lower bound.
greedy — scenario-appropriate hand-crafted greedy solution (go-to-centroid for rendezvous, push-from-nearest for dispersal, persistent random walk for coverage, etc.).
gossamer_flocking — classical Boids via gossamer.algorithms.coordination.flocking.flock_step.
mappo — learned policy trained via gossamer.learning.mappo against the same scenarios.

Add your own policy to the leaderboard by implementing a Baseline callable ((pos, vel, rng) -> accel) and passing it to gossamer.benchmarks.run_benchmark.

Leaderboard

Run leaderboard() over any subset of scenarios and baselines and generate_leaderboard_md() emits a Markdown table ready to paste into a paper or the docs. The canonical leaderboard is regenerated on every tool release and published alongside latest.mdx.


from gossamer.benchmarks import leaderboard, generate_leaderboard_md
 
results = leaderboard(num_seeds=5)
print(generate_leaderboard_md(results))

Reproducibility

Each benchmark row carries: scenario, baseline, num_agents, steps, seed, metric, mean_reward, elapsed_sec. Seeds are published; rerunning with the same seed yields identical numbers.

The benchmark harness uses a pure-NumPy stepper by default so the suite has no physics-engine dependency at test time. For numbers that feed directly into a Leviathan paper, re-run with ENGINE_MODE=inprocess to route through the actual C++ core.