tao-route-visual-changenet-samples

Routes the weakest VCN samples (output of `tao-analyze-gaps-visual-changenet`) into per-augmentation-module

Best for: ML engineers tuning computer-vision models and wondering why a data pipeline step broke accuracy.

Engineering / debugging-investigationatomicfor-engineersneeds-integrationfrom-repo

Skill file

Preview skill file↓↑

---
name: tao-route-visual-changenet-samples
description: Routes the weakest VCN samples (output of `tao-analyze-gaps-visual-changenet`) into per-augmentation-module
  subsets — one parquet for k-NN mining, one for AnomalyGen (Cosmos SDG) — based on each module's label eligibility. Use as the
  immediate next step after DEFT gap analysis in a VCN AOI SDA iteration.
license: Apache-2.0
compatibility: Standalone — no external runtime requirements.
metadata:
  author: NVIDIA Corporation
  version: "0.1.0"
allowed-tools: Read Bash
tags:
- data
- routing
- vcn
- aoi
- sda
---

# TAO VCN Sample Routing Skill

You are the dispatcher between gap analysis and the augmentation modules in a VCN AOI SDA pipeline. Each augmentation module can only act on labels it knows how to handle:

- **k-NN Mining** can only mine real-image neighbors for labels that already exist in the **source pool CSV**. There is no point looking for `SHIFT` neighbors if the pool has no `SHIFT` rows.
- **AnomalyGen** (Cosmos SDG) can only generate synthetic anomalies for the classes its inference pipeline supports: `PASS`, `EXCESS_SOLDER`, `MISSING`, `BRIDGE`. A weak sample with a label outside this set is unroutable to AnomalyGen.

This skill runs **once per SDA iteration immediately after gap analysis**. It splits the gap-analysis parquet into one filtered parquet per module so each module operates on its own eligible subset, and it writes a human-readable summary of the per-label routing decisions.

The work is intentionally trivial: read a parquet, do two `.isin(...)` filters, write two parquets, write one summary. The skill exists to make those decisions auditable — every label must show up in the summary with a yes/no verdict for each module so a downstream reviewer can spot when a label is silently dropped because no module accepted it.

---

## Inputs

1. **`gaps_parquet`** — the gap-analysis output (typically `<exp_dir>/rca_results/<timestamp>/gaps.parquet` from `tao-analyze-gaps-visual-changenet`). Required columns: `filepath`, `label`. Other columns (`siamese_score`, `weakness`) are preserved verbatim.
2. **`source_pool_csv`** — VCN-format mining source pool CSV with a `label` column. Empty string or non-existent path is allowed; the mining subset will simply be empty in that case.
3. **Output directory** — where the two routed parquets, the summary, and the report are written. Default: a timestamped folder under the gap-analysis result directory: `<rca_result_dir>/routing_results/<timestamp>/`.
4. **`anomalygen_supported_labels`** *(optional)* — override the default AnomalyGen-eligible label set. Default: `{"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}`. **Warning:** This must stay in sync with `ANOMALYGEN_SUPPORTED_LABELS` in `mdo-kratos-workflows/pipelines/sda/routing.py` and the AnomalyGen integration's actual generator coverage. Adding a new defect class to AnomalyGen means adding it here too.

---

## Method

The whole skill is two `.isin(...)` masks against the uppercased label column.

### Step 1 — Load and uppercase

```python
df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()
```

The match is **case-insensitive** for both module checks. The original `label` column is preserved unchanged in the output parquets — only the comparison key is uppercased.

### Step 2 — Mining subset

```python
if source_pool_csv and os.path.isfile(source_pool_csv):
    pool_df = pd.read_csv(source_pool_csv)
    pool_labels = {str(l).upper() for l in pool_df["label"].unique()}
    mn_mask = labels_upper.isin(pool_labels)
    mn_df = df[mn_mask]
else:
    pool_missing = True
    pool_labels = set()
    mn_df = df.iloc[0:0]   # empty, but with the same schema
mn_df.to_parquet(mining_gaps_parquet, index=False)
```

If the pool CSV is missing or empty, the mining subset is an empty DataFrame **with the same columns as the input** so downstream readers don't crash on schema mismatch. Flag this case in the summary.

### Step 3 — AnomalyGen subset

```python
ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED)
ag_df = df[ag_mask]
ag_df.to_parquet(anomalygen_gaps_parquet, index=False)
```

Rows whose label is in the AnomalyGen-supported set are written verbatim to `anomalygen_gaps.parquet`. The schema matches the input parquet exactly — downstream AnomalyGen (Cosmos SDG) needs no other changes.

### Step 4 — Per-label routing breakdown

For every distinct label in the input gaps parquet (uppercased), record:
- `count` — how many rows have this label
- `mining` — yes if the label is in `pool_labels`, otherwise no
- `anomalygen` — yes if the label is in `ANOMALYGEN_SUPPORTED`, otherwise no

A label can route to **both** modules (e.g. PASS rows route to AnomalyGen, and if the source pool also contains PASS rows they route to Mining too). A label can also route to **none** — flag those, since they are silently dropped and may signal a configuration mismatch.

Write the breakdown to `routing_summary.txt`. The format mirrors the reference component exactly:

```
Weak-sample routing summary
Total weak samples: <N>
Mining subset:      <N_mn> -> <mining_gaps_parquet>
AnomalyGen subset:  <N_ag> -> <anomalygen_gaps_parquet>

[If pool missing:]
No source pool CSV at '<path>'; mining subset is empty.

Per-label breakdown (count, mining, anomalygen):
  PASS: 50 (mining=yes, anomalygen=yes)
  MISSING: 32 (mining=no, anomalygen=yes)
  SHIFT: 14 (mining=yes, anomalygen=no)
  EXCESS_SOLDER: 9 (mining=yes, anomalygen=yes)
  ...
```

### Step 5 — Sanity checks

After both subsets are written, verify:
- The sum of subset sizes is *not* required to equal `len(df)` — overlap is allowed (a label can route to both modules). What matters is that **every input row appears in at least one subset, OR appears in the "none" list with an explicit reason**.
- If `len(mn_df) == 0` and `len(ag_df) == 0`, something is wrong — flag prominently in the report.
- If an entire label group routes to no module, the `Recommended Actions` section must call this out so the user can either seed the source pool with that label or extend AnomalyGen's supported set.

---

## Reference Python Recipe

This is the exact computation, lifted from `mdo-kratos-workflows/pipelines/sda/routing.py`. Run as a single Python script via Bash; it produces every artifact except the report.

```python
import os
import pandas as pd

ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}

df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()

# Mining subset
pool_missing = False
if source_pool_csv and os.path.isfile(source_pool_csv):
    pool_df = pd.read_csv(source_pool_csv)
    pool_labels = {str(l).upper() for l in pool_df["label"].unique()}
    mn_mask = labels_upper.isin(pool_labels)
    mn_df = df[mn_mask]
else:
    pool_missing = True
    pool_labels = set()
    mn_df = df.iloc[0:0]
os.makedirs(os.path.dirname(mining_gaps_parquet) or ".", exist_ok=True)
mn_df.to_parquet(mining_gaps_parquet, index=False)

# AnomalyGen subset
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED)
ag_df = df[ag_mask]
os.makedirs(os.path.dirname(anomalygen_gaps_parquet) or ".", exist_ok=True)
ag_df.to_parquet(anomalygen_gaps_parquet, index=False)

# Per-label breakdown
summary_lines = [
    "Weak-sample routing summary",
    f"Total weak samples: {len(df)}",
    f"Mining subset:      {len(mn_df)} -> {mining_gaps_parquet}",
    f"AnomalyGen subset:  {len(ag_df)} -> {anomalygen_gaps_parquet}",
    "",
]
if pool_missing:
    summary_lines.append(f"No source pool CSV at {source_pool_csv!r}; mining subset is empty.")
    summary_lines.append("")
summary_lines.append("Per-label breakdown (count, mining, anomalygen):")
label_counts = labels_upper.value_counts()
for label, count in label_counts.items():
    in_mn = (not pool_missing) and label in pool_labels
    in_ag = label in ANOMALYGEN_SUPPORTED
    summary_lines.append(
        f"  {label}: {count} "
        f"(mining={'yes' if in_mn else 'no'}, "
        f"anomalygen={'yes' if in_ag else 'no'})"
    )
summary_text = "\n".join(summary_lines) + "\n"

os.makedirs(logs_dir, exist_ok=True)
with open(os.path.join(logs_dir, "routing_summary.txt"), "w", encoding="utf-8") as f:
    f.write(summary_text)
print(summary_text.strip())
```

---

## Outputs

Write everything into a timestamped folder. The packaging hook will copy `routing_config/` and `claude_session.jsonl` automatically when `Routing_Report.md` is written.

```
<output_dir>/routing_results/YYYY-MM-DD_HHMMSS/
├── Routing_Report.md           # Full routing report
├── mining_gaps.parquet         # Subset routed to k-NN Mining
├── anomalygen_gaps.parquet     # Subset routed to AnomalyGen (Cosmos SDG)
├── routing_summary.txt         # Plain-text per-label breakdown
├── routing_config/             # Auto-copied by hook
└── claude_session.jsonl        # Auto-copied by hook
```

At the start of the run, get the real timestamp by running `date +%Y-%m-%d_%H%M%S` in Bash. If the user specifies a custom output path, use it directly but maintain the internal layout.

---

## Report Structure

Keep the report short (400–800 words). Routing is a deterministic decision; the value is making the decisions auditable, not narrative.

```
# VCN Routing Report: <Iteration / Experiment Name>

## 1. Verdict
- Total weak samples in: <N>
- Mining subset:     <N_mn> rows  →  `mining_gaps.parquet`
- AnomalyGen subset: <N_ag> rows  →  `anomalygen_gaps.parquet`
- Source pool present? <yes/no — and the path>
- One-line headline: "<X> labels routed, <Y> labels dropped (no module accepted)"

## 2. Inputs
| Input | Path | Notes |
|-------|------|-------|
| gaps_parquet     | … | rows=<N>, columns=<col list> |
| source_pool_csv  | … | rows=<M> or "not provided" / "missing" |

## 3. Per-Label Routing Decisions
| Label | Count in gaps | In source pool? | Mining? | AnomalyGen? | Routed To |
|-------|----------------|------------------|----------|--------------|-----------|

(One row per distinct label in `gaps_parquet`, uppercased. `Routed To` is one of:
`mining only`, `anomalygen only`, `mining+anomalygen`, `neither (DROPPED)`.
Use `neither (DROPPED)` whenever no module accepted the label. Sort by count descending.)

## 4. Module-Level Summaries
### 4.1 k-NN Mining
- Pool labels (from source_pool_csv): <list, or "pool missing">
- Labels accepted from input: <list>
- Total rows routed: <N_mn>
- Per-label row counts: <breakdown>

### 4.2 AnomalyGen (Cosmos SDG)
- Eligible labels (configured): PASS, EXCESS_SOLDER, MISSING, BRIDGE
- Labels accepted from input: <list>
- Total rows routed: <N_ag>
- Per-label row counts: <breakdown>

## 5. Dropped Labels (routed to NEITHER module)
| Label | Count | Why dropped | Suggested fix |
|-------|-------|-------------|----------------|

(Empty table is OK and means no labels were dropped. If non-empty, every row needs a
"why" — typically one of: "not in source pool AND not in AnomalyGen supported set",
"source pool missing entirely AND label not in AnomalyGen set", "label name doesn't
match any module's expected canonicalization".)

## 6. Recommended Actions
1. **If any labels are dropped**: seed the source pool with that label, OR extend
   `ANOMALYGEN_SUPPORTED_LABELS` (and the AnomalyGen generator coverage).
2. **If source pool is missing**: provide `source_pool_csv` to enable the Mining branch.
   Without it, half of the augmentation pipeline is dark.
3. **If AnomalyGen subset is empty**: gap analysis only surfaced labels AnomalyGen cannot
   generate; rely on Mining for this iteration, or extend the AnomalyGen integration.
4. **If both subsets are empty**: stop the SDA iteration. Nothing downstream can run.
```

---

## Execution Order

1. Run `date +%Y-%m-%d_%H%M%S` to get the timestamp; create `<output_dir>/routing_results/<timestamp>/`.
2. Run the Python recipe (Steps 1–4) to produce `mining_gaps.parquet`, `anomalygen_gaps.parquet`, and `routing_summary.txt`. Print summary stats to stdout so the script-check hook can verify it ran.
3. Build the per-label decision table by reading both parquets and computing the routed-to verdict per label.
4. Write `Routing_Report.md` last — writing it triggers the packaging hook, which copies session logs and skill config alongside.

Source

Creator's repository · nvidia/skills

View on GitHub ↗

License: Apache-2.0

Security

Security checks in progress

Results will appear here once audits complete

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk