Where infrastructure drifts out of shape

A single large overspend is rare. The costly pattern is hundreds of small resources that drift out of use, lose their owner, and never stop running. Bringing them back into shape is operations work, not bill review.

The PYXIS3 team 7 min read

Infrastructure rarely fails loudly. It drifts. A resource is spun up for a reason, the reason passes, and the resource keeps running. Multiply that by a few thousand resources across a dozen accounts and you have the dominant shape of inefficiency in the cloud: not one dramatic misconfiguration, but a long tail of small things that outlived their purpose and never stopped running. Year after year, industry surveys put the share of cloud spend that is simply wasted at roughly a quarter to a third, and almost none of it is one large line. It is the tail. It persists precisely because no single piece is ever urgent enough to prioritize.

Finding waste is not the difficult part. The bill reports the total. It does not tell you which of ten thousand resources to act on, and it cannot act on any of them. Clearing distributed waste requires operating the account directly: inspecting every instance, disk, snapshot, address, load balancer, and database, determining which are safe to change, and then making the change. That is the work, and it is why an operator connected to the resources is more effective than a report that only describes them.

This is why distributed waste is so durable. A large overspend is noticed and removed. A few dollars of idle spend, repeated across hundreds of resources and dozens of accounts, never rises to anyone's attention and never gets touched. Over a year it is often the largest entry in the waste column. The sections below describe where it accumulates, roughly in the order teams tend to overlook it.

Idle and underused compute

The most common case is the instance running at single-digit CPU utilization, retained in case it is needed again. It does meaningful work for a few minutes a day and bills for all twenty-four hours. Across every temporary instance that became permanent, the total is significant.

Less obvious, and more common than most teams assume, is the stopped instance. Stopping a machine ends the compute charge, but the disk attached to it continues to bill at its full provisioned size, month after month, whether the instance restarts or not. A fleet of stopped but undeleted machines is a standing storage charge that goes untracked because nothing appears to be running. The same applies to managed services left provisioned with no traffic: a database, a cache, a cluster. If it is provisioned, it is billed, independent of usage. The remedy for resources that are genuinely no longer needed is to Retire them; the remedy for those that are merely oversized is to Rightsize.

Orphaned storage and networking

This category is routinely overlooked because there is no server to inspect. When an instance is removed, its disk is often left behind, detached and unused, and continues to bill at the full per-gigabyte rate for as long as it exists. Snapshots and backups accumulate the same way: each is inexpensive, retiring them is no one's defined responsibility, and over time they total a material amount. An allocated public IP address bound to nothing bills by the hour. So do load balancers and NAT gateways, which charge both an hourly fee and a per-gigabyte fee on everything they process, regardless of how little traffic passes through. None of these appear as a large resource in any console, so cleanup efforts pass them by. Each item here is a resource that nothing depends on, which is the clearest case for Retire: there is no capacity to lose, only a charge to stop.

Oversized for a peak that rarely occurs

An instance or database sized for its busiest day runs at that size on every ordinary day as well. Sizing has an outsized effect because the price ladder is geometric rather than linear: each step up is roughly double the step below it. The corollary is direct. An instance one tier larger than required wastes about half of its cost every hour, with no error and no alert, because over-provisioning never causes a failure. It only adds cost.

The cause is structural incentive. Sizing up carries no operational penalty, while sizing down too far risks an outage, so the bias runs consistently toward larger. Left unaddressed, that bias compounds across a fleet into a permanent surcharge. The remedy is to Rightsize the chronically underused resource down a tier, which on a geometric ladder is a large saving rather than a marginal one.

Non-production running around the clock

Development, test, and staging environments are used during working hours and ignored the rest of the time, yet they are almost always left running continuously. The arithmetic is stark. A working week runs somewhere around forty to fifty hours. A calendar week is one hundred and sixty-eight. An environment left on all week is therefore idle for roughly three hours out of every four, and every one of those idle hours bills at the same rate as a working one.

A 7-by-24 grid of 168 cells, one per hour of a week. Roughly 45 cells across weekday daytime are lit to mark working hours; the other 123, including every night and the full weekend, are dark. Every cell, lit or dark, carries the same hourly charge.

Turning non-production off on nights and weekends removes about two thirds of its cost and has no effect on how the team works, because the environment was idle during the hours it was off. The saving carries no operational trade-off. It remains unrealized because capturing it requires configuring a schedule once, and that task rarely reaches the top of anyone's list.

Storage in the wrong tier, and data transfer

Storage is priced in tiers, and the gap between them is large. Hot, low-latency storage costs many times what cold, archival storage does. Most data goes cold within weeks of being written and then remains in the expensive tier indefinitely, because no lifecycle rule was configured to move it down. Logs and backups are the most common offenders: retained by default, never expired, and growing without bound.

Data transfer carries its own cost that is frequently underestimated. Sending data to the internet or across regions is charged per gigabyte, while keeping it within one region is often free or close to it. A workload that communicates heavily across regions, or moves large volumes outbound, can spend more on data transfer than on the compute performing the work. The charge does not appear on a per-instance view and surfaces only when the transfer line on the bill is examined.

Commitment gaps

Nearly every account carries a steady baseline of usage that runs continuously: the always-on services, the production floor that never drops to zero. Paying full on-demand rates for that baseline, when a one or three year commitment would discount it substantially, is waste by omission rather than by error. Nothing is misconfigured. You are paying list price for usage you could have purchased at a discount. The remedy is to Commit the floor and discount it, and the dollars forgone are equivalent to any other form of waste.

Why it persists

The common factor is scale. Each item is individually too small to prioritize. The orphaned disk is a few dollars. The oversized instance is a partial saving that requires verification. The idle gateway is a rounding error. None of them clears the threshold that would justify interrupting active work, so none of them is addressed. They persist because each one is minor in isolation. They also accumulate in the interval between cost reports, the period when no one is reviewing spend, which is precisely when small charges compound into large ones.

That interval is the reason a continuous operator is necessary. A person cannot economically spend an afternoon on a five-dollar disk. An operator connected to the account can: it inventories every disk, every stopped-instance volume, every idle gateway, and every oversized instance across all your accounts, attaches a dollar figure to each, Retires and Schedules the safe ones automatically within the limits you set, and presents the Rightsize and Commit decisions with the figure already attached. It does not merely report the waste. It removes it and records exactly what it changed. The work was never technically difficult. It was simply never economical at human scale, which is exactly the work suited to continuous automation.

And drift is rarely only a cost problem. The same orphaned resource that bills for nothing is often the one left reachable from the internet, running with no owner or tag, or quietly approaching its limits. An operator that sees the estate rather than the bill catches all of it in one pass: it retires what is idle, hardens what is exposed, brings what has drifted out of policy back to standard, and reinforces what is running hot. They are the same problem wearing different clothes, a resource that no longer matches what it should be, and cost is only the face of it that shows up on the invoice.

See it on your own estate

We connect to your accounts, map every resource inside them, and show you what PYXIS3 would operate and the savings it would realize in the first month, before you pay anything.

Book a demo