When a request crosses a dozen services, “check the logs” stops being useful advice. Observability is what lets us ask new questions about a running system without shipping new code to answer them. These are the patterns we rely on.
The three signals, and what each is for
We treat metrics, logs, and traces as complementary rather than interchangeable:
- Metrics answer how much and how often — and they’re cheap enough to keep for everything. They’re how we notice a problem.
- Traces answer where — they show the path of a single request across services and where the time went. They’re how we localize a problem.
- Logs answer why — the detailed, contextual record of what a specific component did. They’re how we explain a problem.
You reach for them in that order: a metric alerts, a trace narrows it to a service, logs explain the failure.
Structured logs over string soup
Free-text logs are easy to write and miserable to query. We emit structured logs — key/value fields, not interpolated sentences — so they can be filtered and aggregated.
{
"level": "error",
"msg": "downstream_request_failed",
"service": "sync-worker",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"destination": "crm",
"status_code": 503,
"attempt": 3,
"latency_ms": 812
}
The single most valuable field there is trace_id. Propagating it through every
service and stamping it on every log line is what turns three disconnected
signals into one coherent story.
Correlate everything with a trace ID
A request gets a trace ID at the edge, and every service passes it along in headers and includes it in logs and spans. With that one piece of plumbing, you can pivot from a slow trace straight to the exact log lines for that request — no guessing about timestamps.
Design alerts around symptoms, not causes
Alerting on causes (“CPU is high”) produces noise; CPU being high is often fine. We alert on symptoms that map to user pain — error rate, latency, and freshness — and we tie them to service-level objectives. An alert should mean “a customer is likely being affected right now,” and it should be actionable. If an alert fires and the answer is “yeah, that’s normal,” it’s not an alert, it’s noise — and we delete or tune it.
Keep cardinality under control
High-cardinality labels (like user ID on a metric) can quietly explode storage and cost. We keep unbounded identifiers in traces and logs, where they belong, and keep metric labels low-cardinality and bounded. This is the difference between an observability bill that scales linearly and one that surprises you.
Takeaways
- Use metrics to detect, traces to localize, and logs to explain.
- Emit structured logs and propagate a trace ID through every service.
- Alert on user-facing symptoms tied to SLOs, not on raw resource causes.
- Keep metric cardinality bounded; put high-cardinality detail in traces/logs.
- Instrument for the unknown questions, because those are the ones that page you.