superlog-debug

Pull production observability context from the Superlog MCP to ground debugging, incident response, and 'how is this behaving in prod right now?' questions. Triggers whenever the user is investigating a bug, regression, or incident; asking about real traffic, error rates, latency, or throughput; validating a deploy; or wants the recent history of a service.

Skill file

Preview skill file
---
name: superlog-debug
description: "Pull production observability context from the Superlog MCP to ground debugging, incident response, and 'how is this behaving in prod right now?' questions. Triggers whenever the user is investigating a bug, regression, or incident; asking about real traffic, error rates, latency, or throughput; validating a deploy; or wants the recent history of a service."
---

# Superlog production debugging

When the user is investigating something happening in production, **query their actual telemetry before reasoning about the code**. Superlog ingests OpenTelemetry traces, logs, and metrics for the user's services — the answer to "why did this 500?" / "is the deploy healthy?" / "what changed?" is almost always in there.

## When to activate

Activate as soon as the user signals they are looking into real production behavior. Examples:

- "users can't sign in" / "checkout is broken" / "this is 500ing in prod"
- "is the deploy healthy?" / "did my PR regress anything?"
- "what's our p95 on /api/foo right now?" / "how many errors in the last hour?"
- "what's been happening with the worker overnight?"
- An incident channel message, a Sentry/PagerDuty link, an on-call ping.

Skip when:

- The work is local only (worktree dev, tests, type errors, lint).
- The user already pasted the relevant logs / traces / span.
- The question is purely "how is this code structured?" with no prod angle.

## Prerequisite: the Superlog MCP must be installed

The tools below are exposed by the Superlog MCP server. If they are not available in the current agent, install it first.

For **Claude Code**:

```
claude mcp add --transport http superlog https://api.superlog.sh/mcp
```

For **Codex**:

```
codex mcp add superlog --url https://api.superlog.sh/mcp
codex mcp login superlog
```

For **Cursor and others**: copy the `mcpServers` snippet from <https://superlog.sh/> → Connect.

After install, restart the agent and authenticate via the OAuth flow the MCP triggers on first use.

## First call every session

```
superlog.get_active_project
```

If it is not the project the user is debugging, switch:

```
superlog.list_projects
superlog.set_active_project { project_id: "<id>" }
```

You can also pass `project_id` on any individual call without changing session state. If the user has multiple environments wired as separate projects (e.g. `prod` vs `staging`), confirm which one before querying.

## Tool cheatsheet

All query tools default to the active project and to the **last 1 hour**. Override with `range: { since, until }` — both accept ISO-8601 (`"2026-05-28T14:00:00Z"`) or ClickHouse expressions (`"now() - INTERVAL 30 MINUTE"`).

| Tool | Use for |
|---|---|
| `superlog.list_services` | Confirm which services emitted telemetry in the window |
| `superlog.query_logs` | Search log bodies; filter by `service`, `severity`, `resource_attrs`, `log_attrs` |
| `superlog.query_traces` | Find slow/errored spans; filter by `service`, `span_name`, `min_duration_ms`, `status_code`, `span_attrs` |
| `superlog.query_metrics` | Pull recent metric points by `metric_name` + `service` |
| `superlog.list_alerts` / `get_alert` | Check existing alerts that may already cover what the user is seeing |
| `superlog.list_dashboards` / `get_dashboard` | Find an existing dashboard for the affected surface before building queries from scratch |

### `query_logs` tips

- `severity` matches the stored OTel severity text case-insensitively. Use the short forms the SDKs actually emit — `"ERROR"`, `"WARN"`, `"INFO"`, `"DEBUG"` — **not** `"WARNING"`, which matches nothing because the stored text is `WARN`. Filter aggressively — error-only is usually the right starting point during an incident.
- `search` is a case-insensitive substring match on the log body. Cheap and effective for finding a stack trace or a specific message.
- `log_attrs` filters **per-record** attributes (e.g. `event.name`, custom structured fields the app sets on each log).
- `resource_attrs` filters **service-level** attributes from the OTel resource (e.g. `deployment.environment.name=production`, `service.version=1.42.0`). Use the current OTel semconv keys — `deployment.environment.name` superseded the old `deployment.environment`, and apps on recent SDKs only emit the new one. Each entry also accepts an optional `op` of `eq` (default), `neq`, or `not_contains`.
- The response includes a `resource_attrs` map per log, so you can read service / env / version off the result without a second call.
- Logs emitted **outside** any span will have empty `trace_id` / `span_id` strings — that is expected, not a bug. Don't conclude the trace pipeline is broken just because some logs lack trace context.

### `query_traces` tips

- **HTTP status codes are in `span_attrs`, not in `status_code`.** OTel's `status_code` is the semantic OK/ERROR/UNSET — and by OTel HTTP semconv, only 5xx server responses auto-flip a span to ERROR. To find 4xx responses, use:

  ```
  span_attrs: [{ key: "http.response.status_code", value: "404" }]
  ```

- The `status_code` filter takes the enum form (`STATUS_CODE_OK`, `STATUS_CODE_ERROR`, `STATUS_CODE_UNSET`), but the value the response returns on each span is the title-case form (`"Ok"`, `"Error"`, `"Unset"`) — don't try to equality-match the response against the enum.
- For "slow X requests" use `service` + `span_name` + `min_duration_ms: 1000`.
- For route-level filtering use `span_attrs` with `http.route` (the template, e.g. `/api/users/:id`) — not the raw URL, which has high cardinality.
- For database calls, filter by an HTTP-style client span pointing at the DB host, or by `db.system` if the app's instrumentation emits it (some SDKs don't — check `span_attrs` on a sample span first).
- Each returned span includes both `span_attrs` and `resource_attrs`, so you can read the service version, env, and instance off the result directly.

### `query_metrics` tips

- Always pass `metric_name` (and usually `service`). The call works without them but returns whatever was most recently written, which is rarely what you want.
- Each returned point has `kind` (`gauge`/`sum`/`histogram`/`summary`), `metric_name`, `unit`, `service`, plus the per-point `attributes` (the series dimensions — route, status, tenant, etc.) and `resource_attrs`. Use `attributes` to disambiguate which series a point belongs to.
- **gauge/sum** points carry a scalar `value`. **histogram/summary** points have no scalar value: they carry `count` and `sum` instead (and histograms also carry `min`/`max`). So for a histogram, average = `sum / count`, and `value` is `null` — that's expected, read `count`/`sum`/`min`/`max`.
- For rates, pull the underlying counter over two time windows and diff yourself rather than expecting the MCP to compute it. For latency percentiles, build a dashboard widget (the dashboard query layer reconstructs histogram quantiles); the raw `query_metrics` points won't give you p95 directly.

## Investigation playbooks

### "Is X broken in prod right now?"

1. `list_services` over the last 30 minutes — is the service even reporting? Silence is a signal.
2. `query_logs` with `service: "<svc>"`, `severity: "ERROR"`, last 30 minutes. Quote a sample stack trace verbatim to the user.
3. `query_traces` with `service: "<svc>"`, `status_code: "STATUS_CODE_ERROR"` and/or `span_attrs` filtering on `http.response.status_code` to catch 4xx.
4. If errors cluster on a route, pull the slowest recent traces for that route and read `SpanAttributes` for the failing operation (DB statement, downstream URL, exception message).

### "This endpoint is slow"

1. `query_traces` with `service`, plus either `span_name` or `span_attrs: [{ key: "http.route", value: "/api/foo" }]`, plus `min_duration_ms: 1000`, last hour.
2. Sort/eyeball the slowest spans and expand `span_attrs` (and `resource_attrs` for service version / env). Look for: long DB calls (high duration with `http.host` pointing at the DB or `db.system` set), downstream HTTP fanout, queue waits.
3. If you find a slow child operation, re-query that span name directly to see if it is globally slow or just slow on this route.

### "Did my deploy regress anything?"

1. Note the deploy timestamp (ask the user if not obvious).
2. `query_logs` `severity: "ERROR"` for the affected service across two equal windows: one before, one after. Compare counts and unique error signatures.
3. `query_metrics` for the request rate, error rate, and latency metrics over both windows.
4. `query_traces` `status_code: "STATUS_CODE_ERROR"` post-deploy, scoped by `resource_attrs: [{ key: "service.version", value: "<new-version>" }]` if the app tags releases. The returned spans carry `resource_attrs`, so you can confirm which version emitted each error.

### "What was this service doing overnight?"

1. `list_services` with the overnight `range` to confirm continuous reporting.
2. `query_logs` `severity: "ERROR"` then `severity: "WARN"` across the window — read for clusters and time-of-day patterns.
3. `query_metrics` on the service's key business counters to spot drops or spikes.

### "Are users hitting this code path?"

1. `query_traces` with `service` + `span_name` (or `span_attrs` filtering on a custom business attribute like `feature.flag` or `tenant.id`).
2. If counts are zero, double-check the span name and attribute keys against the source — typos here are the #1 cause of false negatives.

## Output discipline

- **Lead with the finding**, not the methodology. "47 ERROR logs in `api` in the last 15 min, all `TimeoutError` from `db.query`" beats a paragraph about which tool you called.
- **Quote real evidence**: timestamps, trace IDs, span IDs, exact error messages. The user should be able to pivot from your answer into the Superlog UI for any of them.
- **State the window**: "in the last 30 min" / "since 14:00 UTC". A finding without a window is not actionable.
- **Distinguish zero from unknown**. "Zero matching spans" means the query ran and returned nothing — call that out explicitly. "I couldn't query because the active project isn't this service's project" is a different sentence; say so.
- **Don't over-recommend**. If telemetry shows the service is healthy, say so plainly and stop. Don't manufacture a "next step" to look busy.

## Pairing with other context

The Superlog MCP only sees what the user's apps export as OpenTelemetry. It does **not** see:

- Row-level database state — query the DB directly (or via a Postgres/MySQL MCP if installed).
- Cloud-provider host/container logs that never made it into the OTel pipeline (CloudWatch, GCP Logging, Fly logs, Railway logs, etc.).
- Queue depth (SQS, RabbitMQ, …) unless the app explicitly emits a metric for it.
- Anything the app simply never instrumented.

If the MCP returns nothing for a service, **don't conclude the service is silent** — first check whether telemetry is reaching ingest at all. A missing/misconfigured exporter, a wrong `service.name`, or a project-mismatched ingest key all show up as "MCP sees nothing" even when the service is alive and noisy at the host level. The `superlog-onboard` skill covers the install side if the user needs to fix that.

One specific failure to rule out first: **every ingest request is 401-ing even though the token is valid.** The token is sent under the wrong header name. Ingest reads the token only from `x-api-key: <token>` or `Authorization: Bearer <token>` (literal `Bearer ` prefix required); anything else — `api-key`, `x-superlog-token`, or `Authorization: <token>` without `Bearer ` — returns 401 on every request, so the install looks broken when the key is actually fine. A quick way to confirm the key works independently of the app's exporter config:

```bash
curl -i -X POST https://intake.superlog.sh/v1/traces \
  -H "x-api-key: sl_public_..." \
  -H "content-type: application/json" --data '{"resourceSpans":[]}'
# 2xx → key is valid; the app is sending the wrong header. Fix the exporter to use x-api-key.
# 401 → the token itself is wrong/revoked.
```

## Hard rules

- **Read-only by default.** `query_*` and `list_*` calls are safe. `create_alert`, `update_dashboard`, `delete_*`, etc. mutate the user's account — never invoke those without an explicit ask, and even then confirm the exact change first.
- **Don't echo full ingest tokens or secrets** that show up in span/log attributes back to chat. Truncate if you must reference them.
- **Don't switch the active project silently.** If you call `set_active_project`, tell the user you did and why.
- **Match the scope of the question.** A user asking "is prod ok?" wants a 2-line answer, not a 12-query sweep. Escalate depth only as the investigation requires it.

Source

Creator's repository · superloglabs/skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk