Skip to content

Agent Updates

Agent Updates

Admin only. This panel is hidden from non-admin operators. If the tab isn’t visible in your Settings nav, your account doesn’t have the Admin role.

The Agent Updates panel is the rollout dial for the next Mimir agent binary. Your CI pipeline (or whoever runs your deployment) publishes a signed manifest pointing at a new agent build; this panel decides what percentage of your fleet picks it up. The dial ranges from 0% (every agent stays on its current version, the update is paused) to 100% (every agent that asks for an update gets the new one).

Two reasons most admins live in this panel during a release:

  1. Staged rollout. Push to 10% Monday, watch dashboards for 24 hours, push to 50% Tuesday, push to 100% Wednesday. If anything goes sideways, drop back to 0% and the rollout pauses immediately — agents already on the new version stay there, but no new ones cross over.
  2. Emergency stop. Something in the new build is producing bad telemetry, crashing on a specific OS family, or triggering a hot-path alert. Slam it to 0% and hold while you triage.

The actual mechanics of getting a new manifest onto the server (signing, promoting, revoking pins) are in your operator runbook — those flows aren’t surface-able from this panel. Once the manifest is on the server, this panel is the only knob you need.

What you get

A single card with four data points and a slider:

  • StatusPaused (when rollout is 0%) or Active — N% (when rollout is above 0). The text is muted-grey for paused, green for active.
  • Latest version — the agent version Mimir would deliver if a host’s cohort lands inside the rollout band, shown in monospace. This is the manifest that’s currently available, not necessarily what every host is running.
  • Updated hostsX / Y where Y is the total fleet count and X is the number of those hosts already reporting back as running the latest version. The denominator updates as hosts enroll or decommission; the numerator climbs as agents poll, match their cohort, and download the binary.
  • Rollout percentage — a 0–100 range slider with a numeric read-out to its right. The slider steps in 5% increments — the granularity that handles every staged rollout we’ve actually seen in the wild.
  • Quick — five preset buttons below the slider for 0%, 10%, 25%, 50%, 100%. The button matching the current slider value is outlined in the accent color so you can see the snap point at a glance.
  • Save — disabled until the slider differs from the currently-saved rollout. An “Unsaved change: X% → Y%” hint appears next to the button while you’re editing.

Below the controls, a small footer reminds you:

0% = paused. Changes take effect at next agent check-in (up to 4 hours).

That last line is load-bearing but a floor, not a ceiling. Mimir doesn’t push updates to agents — agents poll. The poll interval is 4 hours plus up to 20% random jitter (so each individual agent sees your rollout change between 4 and ~4.8 hours after you click Save). And once an agent does notice an available update, it applies a second client-side stagger window of up to 24 hours (default) before actually downloading and applying — a deliberate spread that prevents a single bad rollout from infecting the whole fleet in one minute even at 100%. End-to-end, plan for “most hosts will have the new version within 24 hours of clicking 100%”, not “every host within four hours.” The slider doesn’t quietly fail; the change is just lagged by where each host happens to be in its poll-plus-stagger cycle.

How to use it

Standard staged rollout

  1. Open Settings → Agent Updates.
  2. Confirm the Latest version matches what your operator says is ready to ship. If it doesn’t, the new manifest hasn’t been promoted on the server yet — talk to whoever runs your deployment.
  3. Click the 10% preset (or drag the slider) and click Save. The status pill flips to Active — 10%.
  4. Wait the soak period your team has agreed on. Watch the Updated hosts counter creep up; cross-check against your dashboard alerts and any per-host signals you care about.
  5. Click the next preset (25%, 50%, 100%) and click Save. Each save is an independent transaction — you don’t have to “approve” anything between steps.

Emergency pause

  1. Click the 0% preset.
  2. Click Save.
  3. New polls return no update available for every host.

Hosts that already downloaded and applied the new version stay on it — Mimir does not roll backward via this dial. To roll back, your operator promotes the previous version’s manifest on the server side; rolling forward to that “older” manifest is indistinguishable from a normal forward rollout from this panel’s perspective.

What “rollout percentage” actually means

The percentage controls a deterministic cohort, not a random sample. Every host’s cohort number is computed once from its host-id and the manifest version. The same host always lands at the same number for a given version; the rollout dial defines a band, and any host whose cohort falls inside the band gets the update.

Two implications:

  1. A 10% rollout selects the same 10% reliably. Agent A either gets the new build at 10% or doesn’t; it won’t flip between polls. If you set 10%, observe, then drop to 0%, then push back to 10%, Mimir picks the same 10% both times.
  2. Increasing the dial monotonically grows the cohort. Going from 10% to 25% means the 10% that already had it still have it, plus another 15% join. You don’t shuffle the sample by changing the dial.

This makes staged rollouts much less surprising than random sampling. If you saw an issue on a specific host at 10%, you’ll still see it on that same host at 50%; you don’t have to wonder whether the problem moved.

Reading the “Updated hosts” counter

X / Y is a progress bar over the currently-reachable fleet:

  • Y counts every host whose status isn’t offline — anything online or stale shows up in the denominator. Hosts that have fully gone offline (no heartbeat past MIMIR_OFFLINE_AFTER, default 24 hours) drop out of Y entirely. A long-quiet decommissioned VM stops contributing once its status flips to offline.
  • X counts non-offline hosts whose most recent heartbeat reports the literal Latest version string. Hosts that are between polls, hosts that crashed mid-update, hosts that haven’t reported their new version yet — all subtract from X.
  • At 100% rollout, X catches Y exactly once every non-offline host has polled, downloaded the new binary, and reported the new version on its next heartbeat. The lag between “download successful” and “X increments” is the time until the next post-update heartbeat.

If you want to see which hosts haven’t picked up the latest version, the Hosts page lets you filter by agent version. At 100% rollout, hosts still reporting an older version are typically candidates for follow-up — agents that crashed, services that aren’t restarting cleanly, or stale-status hosts caught between heartbeats. Hosts that have fully gone offline don’t appear here at all (they drop out of the denominator); find them via the Hosts page’s status filter instead.

Troubleshooting

The slider moves but Save stays disabled. You moved back to the saved value. The button only enables when the pending percentage differs from the current saved one.

Save succeeded but no hosts are picking up the update. Two common cases:

  1. You’re inside the 4-hour poll window. Wait. Most agents check in well before four hours, but the worst case is what the footer text says.
  2. The Latest version field is empty. That means no manifest is currently published; the dial has nothing to roll out. Coordinate with your operator.

Updated hosts never catches up to total hosts. Expected behavior — see the section above. Drill into the Hosts page, filter by agent version, and check whether the laggards are genuinely offline.

Status says “Paused” but Latest version is populated. This is the normal “manifest published, rollout not yet started” state. Slide above 0% to start.

Status went from “Active” back to “Paused” on its own. It shouldn’t. The dial only changes when you (or another admin) click Save here, when the server-side rollout setting is changed via the API, or when someone hits the manifest revoke endpoint (which zeros the rollout as part of the kill-switch flow). Check the audit log — most likely a peer admin paused the rollout in response to an incident, or your security team revoked the active manifest.

See also

  • Hosts → filter by agent version — find the laggards once rollout completes.
  • Your operator runbook for the manifest-publishing pipeline (signing, the high-water mark file, pin revocation, the per-launcher stagger window). All of that is server-side operator territory, not exposed from this panel.