Distinguishing a real anomaly from normal variation

Infrastructure signals are variable by nature, cost among them. The objective is to react to the change that matters while staying silent for the many that do not.

The PYXIS3 team 6 min read

Cloud spend is never a flat line. Batch jobs run overnight. Traffic peaks during the day and falls off at night. Deployments land. Month-end reporting processes a quarter of data in a few hours. A sale, a launch, or a traffic surge raises usage for a week before it settles. All of this is normal, and all of it makes spend inherently variable. The same is true beyond spend: idle capacity headroom and saturation on steady workloads drift just as much, and the detection below applies equally to all of them.

The difficult problem in detecting anomalies is therefore not noticing that a number moved. Numbers move constantly, and detecting movement alone is both trivial and unhelpful. The difficult problem is reacting to the movement that matters while staying silent for the movement that does not. An approach that fails in either direction is worse than none at all.

Why a fixed threshold cannot succeed

A fixed budget threshold handles this distinction poorly, because it carries one number and the workload has many states. Set the threshold low and it fires on every busy day, every legitimate batch run, every expected peak. The people receiving those alerts learn within a week that the alert is usually wrong, and they begin dismissing it unread. The one time it is correct, it is dismissed along with all the times it was not.

A variable metric curve with two overlays. A single flat threshold line is crossed by several routine peaks, each marked as a false alarm, while a learned baseline with a shaded tolerance band follows the curve and is exceeded only once, by the genuine runaway. The same data judged two ways: the flat line fires on the wrong days, the band fires on the right one.

The natural response is to raise the threshold to stop the false alarms. It then sits above the level a real runaway would have to reach to trigger it, so it fails to fire on precisely the event it was installed to catch. No single setting resolves both failure modes, because they are the same setting pulled in opposite directions: low and noisy, or high and unresponsive. Both outcomes produce an alert no one trusts, which is the least useful kind.

Compare against the metric's own history

The effective approach is to stop comparing against a number someone guessed and compare instead against the metric's own past behavior. Learn a normal level for each metric that matters from its own history, then monitor for movement beyond a tolerance band around that normal. Variation inside the band does not fire. Movement past it does. Because the comparison is against that specific metric's normal rather than a global ceiling, a busy day that is busy in the usual way triggers nothing, while a quiet day with an unusual increase does, even when the absolute amount is small.

This point is important. The same dollar increase can be entirely normal for one metric and a clear anomaly for another. Judging each against its own baseline is what allows a system to be both sensitive and quiet, which a single shared threshold can never achieve.

What "learns your normal" actually means

The phrase "learns your normal" is overused across this industry, so precision is warranted. An accurate version means a learned baseline plus a tolerance band. It does not forecast future periods and it does not understand your business. It provides two concrete things: a reference level drawn from your own data instead of a number a person guessed, and a control for how sensitive the detection should be.

That control is a genuine trade-off, not a free improvement. Raise it and the watch catches smaller drifts earlier and fires more often, including more false alarms. Lower it and the watch stays quieter, reacting only to larger movements, at the cost of catching the smaller ones later. The correct setting differs by metric and turns on one question: how much would this metric cost if it ran away undetected? Set tight tolerances on the metrics that can cause material damage, and looser ones on the metrics that are naturally variable and inexpensive.

Monitor the components, not only the total

A single budget number over the whole account is coarse for another reason: it monitors only the sum, and the sum conceals movement within it. A forty percent increase in data-transfer cost can be a serious problem while total monthly spend still appears normal, because the rest of the bill is large enough to mask it. The signal is in the individual line, not the aggregate. A system that monitors each component independently detects the data-transfer increase immediately. A system monitoring only the total detects nothing until the leak grows large enough to move the whole account, which is exactly the delay you were trying to avoid.

A new resource has no history to compare against

A baseline drawn from a metric's own past has one real limitation: a resource created yesterday has no past. A learned watch cannot judge it for as long as it takes to establish a normal, and the early days of a new account or a newly launched service are precisely when a misconfiguration is most likely to run undetected. The answer is not to wait. It is to fall back to a coarser comparison while the specific history accumulates: judge the new resource against the typical behavior of its peers, or against a reasonable absolute ceiling, until it has enough of its own record to be judged against itself. Sensitivity should tighten as the history deepens, rather than start tight on data that does not yet exist.

Spend can spike, and it can also creep

Anomaly detection tends to focus on the spike: the number that jumps within an hour. The more expensive pattern is often the opposite. A slow creep of five percent more each week never triggers a watch tuned to sudden movement, because no single day looks unusual against the day before it. Yet a quarter of compounding five-percent weeks produces a substantially different bill. A tolerance band that flags only sharp deviations will miss this entirely. Catching it requires comparing against a longer reference than the previous day: this week against last month, this month against the same month last quarter. The sharp jump and the slow creep are different failures, and a watch that catches only one of them allows the other to grow.

Detection has no value without an action

Detection is worth something only if an action follows when a watch fires. A precisely tuned signal that lands in an unread inbox is a slower, more expensive dashboard. The purpose of detecting drift within the hour is to act within that hour, and acting requires reaching the resource, not only sending an email about it. Connected to the account, the same system that detected the drift can reach the resource behind it and hold the spend, stop the cause when the action is reversible and within your limits, open the task, or route it to the one person who can decide. The watch identifies the change you did not anticipate. The access is what allows it to act. The action is what makes detection worthwhile. All three are required, or none of them delivers value.

See it on your own estate

We connect to your accounts, map every resource inside them, and show you what PYXIS3 would operate and the savings it would realize in the first month, before you pay anything.

Book a demo