Modern MARL, GNNs, and World Models

The Techniques pages cover the classical swarm algorithms — Boids, PSO, ACO, GA, RL. Research today moves through a second layer of methods that sit on top of (or replace) those classical policies. This page is the map.

Multi-Agent Reinforcement Learning (MARL)

MAPPO — Multi-Agent PPO with parameter sharing and a centralized critic. The current default baseline for continuous-control swarm tasks because it scales, is stable, and lands sensible policies with modest sample budgets. Reference: Yu et al., 2021, The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games.

QMIX / VDN / MADDPG — value-decomposition and mixed actor-critic families. Useful when rewards are discrete or sparse; less suited to continuous physics than MAPPO for our use cases.

IPPO — independent PPO per agent, no centralized critic. A useful ablation comparator to measure how much the centralized critic buys.

The Arboria reference implementation lives in gossamer.learning.mappo; production runs use CleanRL or MARLlib .

Graph Neural Network policies

Swarms are naturally graphs: agents as nodes, near-neighbor interactions as edges, interaction rules as message passing. gossamer.graph.MessagePassingPolicy is the common interface for both hand-crafted policies (classical Boids becomes a zero-parameter GNN layer) and learned policies (attention over neighbors, edge- conditioned networks, etc.). This matters because:

It handles variable agent counts by construction — no padding, no masking hacks.
It is permutation-invariant — the same policy that trains on 100 agents runs on 100,000 without retraining.
It composes: consensus and flocking can chain into a single forward pass without ad-hoc glue code.

The two libraries we default to are PyTorch Geometric for PyTorch users and Jraph for JAX users. Arboria’s interface mirrors theirs so switching is cheap.

World models

Classical RL learns a policy directly. World models learn a predictive model of the environment, then plan or train a policy against the model — more sample-efficient when the environment is expensive to simulate, which is the regime many swarm questions live in.

JEPA — Joint-Embedding Predictive Architecture (LeCun, 2022). Predicts representations of future observations in a latent space rather than pixels. Promising for swarm research because most per-agent observations are low-dimensional but the joint state is high-dimensional; latent prediction sidesteps that mismatch. See V-JEPA for video, I-JEPA for images, and the hierarchical variants.

Dreamer (Hafner et al., V1 through V3). Recurrent latent world model with an actor-critic policy trained on imagined rollouts. The most battle-tested baseline in model-based RL; the V3 paper shows strong sample-efficiency across 150+ tasks.

Graph world models. GraphCast-style architectures that predict future node states directly over the interaction graph. Likely the best architectural fit for swarms, and a natural anchor for the research direction outlined in CLAUDE.md §B.1. Open research — fewer shipping implementations than JEPA / Dreamer.

When world models help and when they don’t. Pure consensus convergence doesn’t need them; classical control theory is sharper. Delayed / partitioned communication (ICCD-style) does need them — anticipation is worth modeling explicitly rather than reacting.

Emergent communication

When MARL agents are given a learnable communication channel, they learn protocols. Whether those protocols generalize and transfer is the research question.

Channel constraints (bandwidth, latency, loss, energy cost) produce structurally different protocols. See gossamer.learning.comm_channel.CommChannel.
Standard instruments: topographic similarity, context independence, compositionality, disentanglement.
Key question in this thread: do the learned protocols rediscover CRDT-like abstractions under DTN conditions? See the research direction in CLAUDE.md §B.3.

Foundational references: Foerster et al. 2016 (DIAL/RIAL), Lazaridou & Baroni 2020 (review), Chaabouni et al. 2020 (compositionality).

Putting it together

A realistic 2026 Arboria paper stack looks like:

Physics: Leviathan (C++) for large-N, Brax/JAX for differentiable experiments.
Policy substrate: gossamer.graph.MessagePassingPolicy — the same interface for classical and learned baselines.
Learned policy: MAPPO with CTDE, optionally with a learnable comm channel.
World model (where anticipatory behavior is needed): Dreamer or a graph JEPA.
Evaluation: the Arboria Swarm Benchmark with seed-controlled reproducibility.

That’s the surface area modern swarm research needs to cover to be legible to today’s reviewers.