Apply when making VTEX IO services easier to observe, troubleshoot, and operate in production. Covers metrics, structured logging, failure visibility, rate-limit awareness, and production readiness checks for backend apps. Use for integration monitoring, error diagnosis, or improving the operational quality of VTEX IO services before or after release.
---
name: vtex-io-observability-and-ops
description: "Apply when making VTEX IO services easier to observe, troubleshoot, and operate in production. Covers metrics, structured logging, failure visibility, rate-limit awareness, and production readiness checks for backend apps. Use for integration monitoring, error diagnosis, or improving the operational quality of VTEX IO services before or after release."
---
# Observability & Operational Readiness
## When this skill applies
Use this skill when a VTEX IO service needs better production visibility, troubleshooting behavior, or operational safety.
- Adding metrics to important client calls or flows
- Improving logs for routes, workers, or integrations
- Surfacing failures clearly for operations and support
- Reviewing whether a service is ready for production
- Monitoring rate-limit-sensitive integrations
Do not use this skill for:
- app policy declaration
- trust-boundary modeling
- frontend analytics or browser monitoring
- route contract design by itself
## Decision rules
- Log enough structured context to debug failures, but do not log secrets or sensitive payloads.
- Use `ctx.vtex.logger` with appropriate log levels such as `info`, `warn`, and `error` instead of `console.log`, so logs are properly collected and searchable in the VTEX logging stack.
- Treat `ctx.vtex.logger` as the native platform logging mechanism. If a partner needs to forward logs to its own logging system, prefer doing that through a dedicated integration app or client instead of replacing the VTEX logger pattern inside every service.
- Use client-level metrics on important downstream calls so integration behavior is visible below the handler layer.
- Choose metric names that reflect the integration and operation, such as `partner-get-order` or `partner-sync-catalog`, so counts, latency, and error rates can be tracked over time.
- Make failures observable at the point where they happen. Do not swallow errors silently in routes, events, or workers.
- For rate-limit-sensitive APIs, combine short timeouts, backoff-aware retries, and caching of frequent reads to reduce burst pressure and avoid hitting hard limits.
- Review whether expensive or fragile flows expose enough operational signals before releasing them.
## Hard constraints
### Constraint: Important failures must be visible in logs, metrics, or durable state
Routes, event handlers, and workers MUST not hide important failures from operators.
**Why this matters**
If failures disappear silently, the service becomes impossible to diagnose under real traffic and retries.
**Detection**
If an error is caught and ignored without logging, metric emission, or explicit failure state, STOP and surface the failure.
**Correct**
```typescript
try {
await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
ctx.vtex.logger.error({
message: 'Failed to send order to partner',
orderId,
account: ctx.vtex.account,
routeId: ctx.vtex.route?.id,
})
throw error
}
```
**Wrong**
```typescript
try {
await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
return
}
```
### Constraint: Metrics should be attached to important integration calls
Client calls that are operationally important SHOULD include `metric` so request behavior can be tracked consistently.
**Why this matters**
Without metrics, integration failures and latency patterns are much harder to isolate from generic route behavior.
**Detection**
If a key downstream integration call has no `metric` and operations depend on it, STOP and add a meaningful metric name.
**Correct**
```typescript
return this.http.get(`/orders/${id}`, {
metric: 'partner-get-order',
})
```
**Wrong**
```typescript
return this.http.get(`/orders/${id}`)
```
### Constraint: Logs must stay useful without leaking sensitive data
Logs MUST contain enough context to debug production behavior, but MUST NOT include secrets, tokens, or unnecessarily sensitive payloads.
**Why this matters**
Operational logs are only valuable if they are safe to retain and inspect. Sensitive logging creates security risk while still failing to guarantee useful diagnosis.
**Detection**
If a log line includes tokens, auth headers, raw personal payloads, or entire downstream responses, STOP and sanitize the log.
**Correct**
```typescript
ctx.vtex.logger.info({
message: 'Partner sync started',
orderId,
account: ctx.vtex.account,
})
```
**Wrong**
```typescript
ctx.vtex.logger.info({
message: 'Partner sync started',
body: ctx.request.body,
auth: ctx.request.header.authorization,
})
```
## Preferred pattern
Operationally healthy VTEX IO services should:
- emit metrics for important client calls so counts, latency, and error rates are visible
- log failures with enough structured context such as domain IDs, account, and `routeId`
- avoid silent error swallowing
- sanitize sensitive data before logging
- review retries, caching, and throughput with rate-limit behavior in mind
Use observability to shorten diagnosis time, not just to create more logs.
## Common failure modes
- Catching and ignoring errors in async flows.
- Logging too little context to diagnose production incidents.
- Logging too much sensitive data.
- Omitting metrics from important integration calls.
- Treating rate-limit failures as isolated bugs instead of operational signals.
## Review checklist
- [ ] Are important failures visible to operators?
- [ ] Do key integrations emit useful metrics?
- [ ] Are logs structured and safe?
- [ ] Are retries, caching, and rate-limit behavior considered together?
- [ ] Would someone on call be able to diagnose this flow from the available signals?
## Reference
- [Using Node Clients](https://developers.vtex.com/docs/guides/using-node-clients) - Client usage patterns relevant to metrics and retries
- [Best practices for avoiding rate-limit errors](https://developers.vtex.com/docs/guides/best-practices-for-avoiding-rate-limit-errors) - Operational guidance for stable integrations
Creator's repository · vtex/skills