SLO Math for Humans

Feb 27, 2026

—

in DevOps

The actual calculations, the conversations with leadership, and why your first SLO will probably be wrong

Most SRE guides explain Service Level Objectives like this:

“An SLO is a target value or range for a service level that’s measured by an SLI.”

That definition is correct.

It’s also useless.

What people actually need to know:

What number do I pick?
How do I defend that number to my VP?
How do I know if we’re meeting it?
What do I do when we’re not?

Let’s talk about the math you actually use — not the theory, the practice.

The Starting Point: What Are You Promising?

An SLO is a promise.

That’s it.

You’re telling users:

“This thing will work X% of the time.”

The math comes from deciding what X should be and what “work” actually means.

Most teams start here:

SLO = 99.9% availability

That sounds good. Three nines. Professional.

But what does it actually mean?

It means you’re allowed 43 minutes of downtime per month.

If you’re down for 44 minutes, you’ve blown your SLO.

If you’re down for 42 minutes, you’re fine.

The math:

1 month = 30 days × 24 hours × 60 minutes = 43,200 minutes

99.9% uptime = 43,200 × 0.999 = 43,156.8 minutes

Allowed downtime = 43,200 − 43,156.8 = 43.2 minutes

That’s the promise.

Forty-three minutes.

Not 44.

Now the real question is: can you keep it?

The Real Question: What Reliability Do You Actually Need?

This is where most teams get stuck.

They pick a number that sounds good instead of a number that makes business sense.

So flip the question.

What happens if you’re down for an hour?

Do customers leave?
Do you lose revenue?
Do you violate a contract?
Does someone call the CEO?

If the answer is “yes” to any of those, you probably need a tight SLO.

If the answer is “probably not,” you don’t.

Here’s a rough guide:

99% uptime = 7.2 hours of downtime per month

Good for: Internal tools, batch jobs, non-critical features

99.5% uptime = 3.6 hours per month

Good for: Business-critical systems with workarounds

99.9% uptime = 43 minutes per month

Good for: Customer-facing services, revenue systems

99.95% uptime = 21 minutes per month

Good for: Payments, authentication, critical infrastructure

99.99% uptime = 4.3 minutes per month

Good for: Regulatory, contractual, existential risk systems

Notice something?

The tighter your SLO, the less room you have for mistakes.

At 99%, you can survive a bad deploy.

At 99.99%, one bad deploy and your month is over.

Higher SLOs cost more.

In engineering time.

In architecture complexity.

In operational stress.

So the real question isn’t:

“What’s the best SLO?”

It’s:

“What’s the cheapest SLO that keeps the business safe?”

That’s a systems question, not a math question.

Error Budgets: The Part Everyone Gets Wrong

Once you set an SLO, you create an error budget.

The error budget is simply the inverse of your SLO — how much failure you’re allowed before you’ve broken your promise.

Example:

SLO = 99.9%
Error budget = 0.1%
In 30 days = 43.2 minutes

That’s your budget.

Spend it intentionally.

Here’s what most teams miss:

Your error budget isn’t just for outages.

It’s for everything that makes your service unavailable.

Deployments that cause downtime? Budget.
Database migrations that lock tables? Budget.
Experiments that go sideways? Budget.
Restarting the wrong service? Budget.

It all counts.

So when someone asks:

“Can we deploy this risky feature Friday afternoon?”

The real answer is:

“How much error budget do we have left?”

If you’re at 95% of your budget with a week left in the month, the answer is no.

If you’re at 20%, ship it.

The budget creates discipline without emotional debate.

Burn Rate: How Fast Are You Spending?

Burn rate tells you how quickly you’re consuming your error budget.

This is what separates mature SRE from basic uptime tracking.

The math:

Burn rate = (actual error rate) / (allowed error rate)

Example:

SLO = 99.9% (0.1% allowed error rate)
Current error rate = 0.5%
Burn rate = 0.5 / 0.1 = 5x

A burn rate of 5x means you’re spending your monthly budget five times faster than planned.

At that rate, you’ll exhaust your entire budget in about six days.

This is why serious SRE teams alert on burn rate — not raw error counts.

Burn rate > 2x for 1 hour? Page someone.
Burn rate > 10x for 10 minutes? Wake people up.

The formula for time until budget exhaustion:

Time remaining = (remaining budget) / (current error rate)

If you have 20 minutes left and you’re burning at 1 minute per hour, you have 20 hours.

If you’re burning at 5 minutes per hour, you have 4 hours.

That’s your incident clock.

The Conversation with Leadership

At some point, you’ll need to explain this to someone who doesn’t live in this world.

Here’s the practical version.

“We’re setting a 99.9% availability target for checkout. That means about 43 minutes of downtime per month.”

They’ll ask:

“Why not 100%?”

Because 100% is asymptotic. You can approach it, but you can’t reach it without infinite cost.

“At 99.9%, we can deploy safely and move the product forward.

At 99.99%, every deploy becomes high risk and we’d likely need more redundancy, more testing gates, and more people. That’s real cost.”

Then they’ll ask:

“What happens if we miss it?”

“As long as we’re inside the error budget, we’re operating normally.

If we exceed it, we stop feature work and focus on stability until we’re back under control.”

That’s the contract.

Not perfection.

Predictability.

Why Your First SLO Will Probably Be Wrong

You won’t know the right SLO until you try to live inside it.

You’ll set 99.9%.

Then you’ll realize:

Deployments cause 5 minutes of downtime.
You deploy 4 times a week.
That’s 20 minutes per week.
That’s 80 minutes per month.
Your budget is 43 minutes.

You’re already out of budget.

Now you have two choices:

Improve your deployment process.
Adjust your SLO.

Most teams do both.

Start looser. Measure reality. Tighten over time.

SLOs are not aspirational slogans.

They’re negotiated agreements with physics, architecture, and process maturity.

The Formula Sheet

Total time in period:

30 × 24 × 60 = 43,200 minutes

Allowed downtime:

Total time × (1 − SLO)

43,200 × (1 − 0.999) = 43.2 minutes

Current availability:

Successful requests / Total requests

Error budget remaining:

Allowed downtime − Actual downtime

Burn rate:

Actual error rate / Allowed error rate

Time until exhausted:

Remaining budget / Current burn rate

That’s it.

No magic.

Final Thought

SLO math isn’t complicated.

It’s just unfamiliar.

After a while, you stop reaching for a calculator.

You know what 99.9% feels like.

You instinctively know when you’re in danger during an incident.

But the math isn’t the hard part.

The hard part is the conversation:

What are we promising?
What does that promise cost?
What happens when we break it?

That’s not algebra.

That’s alignment.

The math just forces everyone to be honest.