Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills.
--- name: nemo-evaluator-plugin description: Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills. metadata: owner: nemo-platform maturity: active license: Apache-2.0 --- # Evaluator Plugin Use this skill for evaluation tasks against a running NeMo Platform server. The plugin-backed CLI interface is `nemo evaluator`; the legacy generated `nemo evaluation` API command group is not the target surface for new guidance. ## CLI Interface ### Prerequisites - all commands in this file assume that the shell's working dir is at the root of the Nvidia-NeMo/nemo-platform repo - activate the Python virtual environment before invoking the `nemo` CLI: `source .venv/bin/activate` Check plugin status from the CLI: ```bash nemo evaluator info ``` ## Metric Types ### Explore Available Metrics To view available metric names, run: ```bash nemo evaluator metric-types ``` To view a specific metric schema, pass a metric name from the `metric_types` list above: ```bash nemo evaluator metric-types <metric-name> ``` Inspect all the registered metric schema contracts: ```bash nemo evaluator evaluate explain ``` > Note: use `nemo evaluator evaluate explain` as the source of truth for the current plugin input schema. It will return a large json schema response, so strongly prefer `nemo evaluator metric-types` when you only need metric names and corresponding schemas. ## Evaluation Spec Evaluation spec is a payload that is provided to CLI as an input to execute evaluation. At a high level, a spec describes: - `metrics`: bundled Evaluator SDK metric configurations - `dataset`: inline rows to evaluate or platform FilesetRef that contains the dataset - `params`: optional Evaluator SDK execution parameters - `target`: optional model or agent target for online evaluation See the LLM-judge spec example at [assets/specs/llm_as_judge.json](./assets/specs/llm_as_judge.json). ### Metric Bundle Payloads The checked-in [spec examples](./assets/specs) use bundled SDK metrics. The fields under `metrics[*].payload` are generated by `bundle_metric(metric, CloudpickleMetricBundlePackager())`. To see the pattern for configuring a pre-defined SDK metric, for example `ExactMatchMetric`, and converting it into bundled metric JSON, inspect `build_metric_bundle_example()` in [generate_example_specs.py](./scripts/generate_example_specs.py) and run: ```bash uv run --frozen python skills/nemo-evaluator-plugin/scripts/generate_example_specs.py ``` ## Run Evaluations ### Run Using File Spec Reference When using the `nemo evaluator evaluate run` command, results are saved into local temporary directories and the link is printed to stdout. Prefer the `--spec-file` named argument over inline shell JSON because metric bundles include serialized payloads. Examples of various specs are provided in the [assets/specs](./assets/specs/) directory. #### Evaluate using `exact-match` metric See the spec example at [assets/specs/exact_match_metric.json](./assets/specs/exact_match_metric.json). ```bash nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json ``` #### Evaluate using a benchmark metric set ```bash nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_benchmark.json ``` #### Evaluate using `LLM-Judge` metric Uses an LLM to score responses. See the spec example at [assets/specs/llm_as_judge.json](./assets/specs/llm_as_judge.json). ```bash nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/llm_as_judge.json ``` ### Run Evaluation As A Durable Job Use the `nemo evaluator evaluate submit` command to create a durable evaluation job. The response of this command returns a job handler object instead of the evaluation result. ```bash nemo evaluator evaluate submit \ --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json ``` The submit response includes the generated job's `name` field, for example `nemo-evaluator-zlhn1ecd`. Wait for the job to complete, then list and download the job results. ```bash nemo jobs get-status <job-name> nemo jobs get <job-name> nemo jobs results list <job-name> nemo jobs results download aggregate-scores --job <job-name> --output-file aggregate-scores.json nemo jobs results download row-scores --job <job-name> --output-file row-scores.jsonl ``` ## Python SDK Interface Evaluator Python SDK client is exposed as `evaluator` variable on `NeMoPlatform` instance: ```python from nemo_platform import NeMoPlatform platform_client = NeMoPlatform(base_url="http://localhost:8080") status = platform_client.evaluator.plugin_status() ``` See examples of using the plugin SDK interface in [plugin_sdk_examples.py](./assets/examples/plugin_sdk_examples.py). ## Security Make sure not to print any secrets to stdout since this can be collected as logs ## Additional Resources For LLM-judge setup notes, see [LLM Judge Notes](references/llm-judge.md). For evaluator API key auth, see [Evaluator API Auth](references/api-auth.md). For local and cluster troubleshooting, see [Evaluation Troubleshooting](references/troubleshooting.md).
Creator's repository · promptingcompany/nv-skills
License: Apache-2.0