querying-mlflow-metrics

Fetches aggregated trace metrics (token usage, latency, trace counts, quality evaluations) from MLflow tracking servers. Triggers on requests to show metrics, analyze token usage, view LLM costs, check usage trends, or query trace statistics.

Skill file

Preview skill file↓↑

---
name: querying-mlflow-metrics
description: Fetches aggregated trace metrics (token usage, latency, trace counts, quality evaluations) from MLflow tracking servers. Triggers on requests to show metrics, analyze token usage, view LLM costs, check usage trends, or query trace statistics.
---

# MLflow Metrics

Run `scripts/fetch_metrics.py` to query metrics from an MLflow tracking server.

## Examples

**Token usage summary:**
```bash
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM,AVG
```
Output: `AVG: 223.91  SUM: 7613`

**Hourly token trend (last 24h):**
```bash
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM \
    -t 3600 --start-time="-24h" --end-time=now
```
Output: Time-bucketed token sums per hour

**Latency percentiles by trace:**
```bash
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m latency -a AVG,P95 -d trace_name
```

**Error rate by status:**
```bash
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m trace_count -a COUNT -d trace_status
```

**Quality scores by evaluator (assessments):**
```bash
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
    -m assessment_value -a AVG,P50 -d assessment_name
```
Output: Average and median scores for each evaluator (e.g., correctness, relevance)

**Assessment count by name:**
```bash
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
    -m assessment_count -a COUNT -d assessment_name
```

**JSON output:** Add `-o json` to any command.

## Arguments

| Arg | Required | Description |
|-----|----------|-------------|
| `-s, --server` | Yes | MLflow server URL |
| `-x, --experiment-ids` | Yes | Experiment IDs (comma-separated) |
| `-m, --metric` | Yes | `trace_count`, `latency`, `input_tokens`, `output_tokens`, `total_tokens` |
| `-a, --aggregations` | Yes | `COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, `P50`, `P95`, `P99` |
| `-d, --dimensions` | No | Group by: `trace_name`, `trace_status` |
| `-t, --time-interval` | No | Bucket size in seconds (3600=hourly, 86400=daily) |
| `--start-time` | No | `-24h`, `-7d`, `now`, ISO 8601, or epoch ms |
| `--end-time` | No | Same formats as start-time |
| `-o, --output` | No | `table` (default) or `json` |

For SPANS metrics (`span_count`, `latency`), add `-v SPANS`.
For ASSESSMENTS metrics, add `-v ASSESSMENTS`.

See [references/api_reference.md](references/api_reference.md) for filter syntax and full API details.

Source

Creator's repository · mlflow/skills

View on GitHub ↗

Security

Security checks in progress

Results will appear here once audits complete

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk