Service Level Agreement

Agent Prompt Snippet

Ensure the project has an SLA definition with measurable uptime, latency, throughput, and error rate targets, including how they are monitored and what happens when they are breached.

Purpose

A Service Level Agreement (SLA) is the public or internal contract that specifies what level of reliability a service commits to delivering to its users. It defines concrete, measurable targets—uptime percentage, latency percentiles, throughput limits, error rate thresholds—and establishes what happens when those targets are missed (remedies, escalation procedures, or simply accountability).

Without an SLA, “the service should be reliable” is the only guidance engineers have when making architectural trade-offs. With an SLA, those trade-offs become objective: a p99 latency target of 200ms rules out certain database access patterns; a 99.9% monthly uptime target (43 minutes of allowed downtime) determines whether you need multi-region failover.

The SLA serves three audiences simultaneously. For users and customers, it sets expectations and provides recourse. For engineers, it translates business requirements into architectural constraints. For operations teams, it defines what constitutes an incident and when escalation is required.

There is an important distinction between SLA, SLO, and SLI. An SLI (Service Level Indicator) is what you measure (e.g., latency). An SLO (Service Level Objective) is the internal target (e.g., p99 < 200ms). An SLA is the external commitment (e.g., 99.9% of requests complete in < 500ms, measured monthly). This document covers all three in service of the external SLA.

Who needs this document

Persona	Why they need it	How they use it
Sam (Indie Dev)	Sets expectations with paying customers; defines what “good enough” means for engineering decisions	Writes SLA targets before launch; uses them to scope reliability investment
Claude Code (AI Agent)	Needs quantified targets to evaluate whether proposed implementations are architecturally sufficient	Reads SLA before proposing caching strategy, database topology, or retry logic
Priya (Eng Lead)	Drives SLA targets into sprint planning and incident review; holds team accountable to commitments	Uses SLA as input to quarterly OKRs; chairs SLA review after every incident
DevOps (CI Operator)	Implements monitoring alerts based on SLA targets; on-call response is triggered by SLA breaches	Configures Prometheus/Datadog alert thresholds from SLA targets; writes runbook entries for SLA breach scenarios

What separates a good version from a bad one

Criterion 1: Latency targets use percentiles, not averages

✓ Strong: “Latency targets for /v1/query endpoint: p50 < 50ms, p95 < 200ms, p99 < 500ms, p99.9 < 2000ms. Measured over 5-minute windows, reported monthly. Alert triggers when p99 exceeds 500ms for two consecutive 5-minute windows.”

✗ Weak: “Average response time should be under 100ms.” (Averages hide tail latency. A service where 1% of requests take 10 seconds can still have an “average” response time of 150ms. The 1% of users experiencing 10-second delays are real customers.)

Criterion 2: Uptime is specified with measurement window and exclusions

✓ Strong: “Monthly uptime commitment: 99.9% (≤ 43.8 minutes downtime per month). Uptime is measured as the percentage of 1-minute intervals in which the health check at /healthz returns HTTP 200 within 5 seconds. Excludes: (1) scheduled maintenance windows (announced 48h in advance), (2) force majeure events per Section 12.”

✗ Weak: “99.9% uptime.” (No measurement method, no exclusions, no time window. This figure is meaningless without the methodology that defines it.)

Criterion 3: Error budget and consequences are stated

✓ Strong: “Error budget: 0.1% monthly downtime (43.8 minutes). When 50% of the error budget is consumed, the team shifts to reliability work and freezes feature deployments. When 100% is consumed, the SRE lead initiates a post-mortem and the product roadmap for the following month is reprioritized toward reliability. Customer SLA credits: 10% service credit for each additional 1% downtime beyond the commitment.”

✗ Weak: “If the SLA is breached, we will work to resolve the issue.” (No incentive structure, no process trigger, no customer remedy. This is not an SLA—it is a press release.)

Criterion 4: Each metric has a corresponding alert

✓ Strong: “Each SLO has a corresponding Prometheus alert rule in infra/alerts/slo_alerts.yaml. Alerts fire at 50% error budget burn (early warning) and 100% burn (page). Alert routing: 50% → Slack #sre-alerts, 100% → PagerDuty on-call rotation.”

✗ Weak: “We monitor our systems using industry-standard tools.” (No link between SLA targets and actual monitoring configuration. The SLA is aspirational, not enforced.)

Common mistakes

Setting SLAs before measuring baselines. Teams write “99.9% uptime” because it sounds professional, without knowing what their current uptime is. Set SLA targets from 90-day baseline measurements, not from what sounds reasonable. If your service is currently at 99.5%, committing to 99.9% is a reliability roadmap, not an SLA.

Including only uptime, ignoring latency and error rate. A service can be “up” (returning HTTP 200) while being completely unusable due to extreme latency or high error rates. A complete SLA specifies: uptime, latency (p50/p95/p99), error rate (5xx responses as % of total), and—for data services—data durability.

Committing to SLAs without the infrastructure to achieve them. An SLA is a budget for operational investment. A 99.99% SLA requires multi-region failover, automated recovery, and permanent on-call. If you haven’t invested in those capabilities, committing to that SLA is a debt you’ll pay during incidents.

No process for SLA review after incidents. Incidents are the primary driver of SLA evolution. Every significant incident should include an SLA review: did we breach? Was the target set correctly? Does the target need to change? Without this review loop, SLAs drift out of alignment with operational reality.

How to use this document

When to create it

Write the SLA definition before the first production deployment. For internal services, the SLA is an internal commitment that drives engineering decisions. For external-facing services, the SLA is published and often legally binding—it must be reviewed by legal before publication.

Who owns it

The engineering lead or SRE team owns the technical SLO targets. The product or business team owns the customer-facing SLA language. Both must align: the engineering team’s SLOs must provide sufficient margin to reliably meet the SLA.

How AI agents should reference it

get_standard_docs(type="backend_service", features=[])
→ sla_definition in documents[]
→ agent reads SLA before designing caching, replication, or retry strategies
→ agent verifies proposed architecture can meet the specified latency targets
→ agent flags if a proposed change reduces reliability below SLA thresholds

The prompt_snippet — “Ensure the project has an SLA definition with measurable uptime, latency, throughput, and error rate targets, including how they are monitored and what happens when they are breached” — tells the agent to verify all four targets (uptime, latency, throughput, error rate) plus monitoring and consequence sections.

How it connects to other documents

The SLA definition constrains the Architecture Overview (must have redundancy commensurate with uptime commitments), the API Contract (latency targets apply per-endpoint), and the Failover Runbook (defines when to invoke failover procedures). Cost Model decisions are directly constrained by SLA targets—higher availability requires more infrastructure.