Reading Production Logs Without Going Mad

Date · 8 May, 2026

Cat · Notes

Read · 3 min

Logs are the single largest source of "we know we have the information but we can't find it" in software. Every team produces them. Most teams don't use them well. This is what we've learnt about making logs actually useful when something is on fire.

The rule that fixes most teams' logs

Every log line should answer three questions: when, what, and about whom.

When: ISO 8601 timestamp with timezone, ideally with microseconds.
What: Event name (not a sentence). "user.signup.attempted" beats "User attempted to sign up with the email field empty."
About whom: Stable identifiers — user ID, request ID, account ID — not display names.

A log line that doesn't answer all three is something you wrote for a human reading it once. Useful for development, useless for incident response.

Structured logs, always

JSON. Or logfmt. Pick one, stick with it, do not mix. Plain-text logs are unparseable at scale. Every modern logging library supports structured output. There is no defensible reason to ship plain text in 2026.

{
  "ts": "2026-05-08T14:23:11.421Z",
  "level": "warn",
  "event": "rate_limit.exceeded",
  "request_id": "req_3kz9a",
  "user_id": "usr_8821",
  "endpoint": "POST /api/checkout",
  "current_rate": 142,
  "limit": 100
}

Now you can grep, group, filter, aggregate. The same line in prose is unsearchable.

Log levels mean something

Most teams degrade their log levels into noise within a year. The discipline:

ERROR — something failed that requires human attention. Should page someone or end up in a triage queue. If you have a million errors, none of them are errors anymore.
WARN — something is suspicious but the system handled it. Used for fallbacks, retries, degraded mode.
INFO — meaningful business events. Order placed, user registered. One line per request is too many; one line per business event is right.
DEBUG — turned off in production by default, turned on when you need to trace a specific issue.

The error level is the one that breaks most often. Aspirational: an ERROR log should be rare enough that a human looks at every one.

Correlation IDs

One ID, generated at the entry point of a request, propagated through every downstream call, attached to every log line. Without this, debugging a multi-service request is archaeology. With it, you grep one ID and see the whole story.

OpenTelemetry handles this automatically if you let it. Use OpenTelemetry.

Sampling, not filtering

Production traffic is too high to log every request fully. The naive answer is to drop log lines — and you always drop the wrong ones. The better answer is to sample: log 100% of errors and slow requests, log 1% of everything else, but keep the structure consistent so the sample can be aggregated.

Most modern logging pipelines (Honeycomb, Datadog, Grafana) support this natively. Use it. Your bill will thank you and the data quality will improve.

Things that have never helped us

"Verbose mode" logs in production. Always overwhelming, never useful.
Logging full request bodies. PII risk + storage cost + signal-to-noise problem. Log structure, not contents.
Logging response bodies on success. Same problems, less reason.
Manual console.log sprinkled before deployment. You will forget at least one. Use a logger, route it through your normal pipeline.

The two tools we actually use

You don't need everything. We rely on two patterns in 2026:

A structured logging library (Pino in Node, Zap in Go, structlog in Python) writing JSON to stdout. The container runtime collects it.
An observability backend (we're mostly on Grafana Loki or Honeycomb, depending on the project's budget). The backend handles indexing, querying, alerting.

That's it. We don't self-host Elasticsearch. We don't ship logs through a third queueing system. Simpler infrastructure has simpler failure modes when the alarm goes off at 3am.

The habit that pays off

When an incident is over, before the post-mortem, write down the single query that would have surfaced the issue fastest. If that query is hard or impossible, the logs aren't doing their job — fix the logs before you fix the bug. Teams that do this consistently halve their mean-time-to-resolution within a quarter. Teams that don't treat logs as a product end up rediscovering the same blind spots every time.

Good logs are written for the person who will read them at 3am. That person is usually you, six months from now, with less context than you have today. Write accordingly.

Reading Production Logs Without Going Mad

The rule that fixes most teams' logs

Structured logs, always

Log levels mean something

Correlation IDs

Sampling, not filtering

Things that have never helped us

The two tools we actually use

The habit that pays off

01Search

02Recent

03Categories

04Archive

Reading Production Logs Without Going Mad

The rule that fixes most teams' logs

Structured logs, always

Log levels mean something

Correlation IDs

Sampling, not filtering

Things that have never helped us

The two tools we actually use

The habit that pays off

↳Keep Reading

Why We Still Bet on Postgres

What "Shared Hosting" Even Means in 2026

AI in Web Design 2026: Lessons from a Year of Production Use

01Search

02Recent

03Categories

04Archive