Observability Patterns for Distributed Systems

When a request crosses a dozen services, “check the logs” stops being useful advice. Observability is what lets us ask new questions about a running system without shipping new code to answer them. These are the patterns we rely on.

The three signals, and what each is for

We treat metrics, logs, and traces as complementary rather than interchangeable:

Metrics answer how much and how often — and they’re cheap enough to keep for everything. They’re how we notice a problem.
Traces answer where — they show the path of a single request across services and where the time went. They’re how we localize a problem.
Logs answer why — the detailed, contextual record of what a specific component did. They’re how we explain a problem.

You reach for them in that order: a metric alerts, a trace narrows it to a service, logs explain the failure.

Structured logs over string soup

Free-text logs are easy to write and miserable to query. We emit structured logs — key/value fields, not interpolated sentences — so they can be filtered and aggregated.

{
  "level": "error",
  "msg": "downstream_request_failed",
  "service": "sync-worker",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "destination": "crm",
  "status_code": 503,
  "attempt": 3,
  "latency_ms": 812
}

The single most valuable field there is trace_id. Propagating it through every service and stamping it on every log line is what turns three disconnected signals into one coherent story.

Correlate everything with a trace ID

A request gets a trace ID at the edge, and every service passes it along in headers and includes it in logs and spans. With that one piece of plumbing, you can pivot from a slow trace straight to the exact log lines for that request — no guessing about timestamps.

Design alerts around symptoms, not causes

Alerting on causes (“CPU is high”) produces noise; CPU being high is often fine. We alert on symptoms that map to user pain — error rate, latency, and freshness — and we tie them to service-level objectives. An alert should mean “a customer is likely being affected right now,” and it should be actionable. If an alert fires and the answer is “yeah, that’s normal,” it’s not an alert, it’s noise — and we delete or tune it.

Keep cardinality under control

High-cardinality labels (like user ID on a metric) can quietly explode storage and cost. We keep unbounded identifiers in traces and logs, where they belong, and keep metric labels low-cardinality and bounded. This is the difference between an observability bill that scales linearly and one that surprises you.

Takeaways

Use metrics to detect, traces to localize, and logs to explain.
Emit structured logs and propagate a trace ID through every service.
Alert on user-facing symptoms tied to SLOs, not on raw resource causes.
Keep metric cardinality bounded; put high-cardinality detail in traces/logs.
Instrument for the unknown questions, because those are the ones that page you.

The three signals, and what each is for

Structured logs over string soup

Correlate everything with a trace ID

Design alerts around symptoms, not causes

Keep cardinality under control

Takeaways

Related posts

What We Learned From Running Background Workers in Production

Designing Reliable Data Synchronization at Scale