momentic-result-classification

Skill file

Preview skill file↓↑

---
name: momentic-result-classification
description:
Classify or explain Momentic test run results using Momentic MCP tools.
Use when the user asks to categorize a failure, understand why a run failed,
triage test results, or compare run results to past run results.
---

# Momentic result classification (MCP)

Momentic is an end-to-end testing framework where each test is composed of browser interaction steps. Each step combines Momentic-specific behavior (AI checks, natural-language locators, ai actions, etc.) with Playwright capabilities wrapped in our YAML step schema. When these tests are run, they produce results data that can be used to analyze the outcome of the test. The results data contains metadata about the run as well as any assets generated by the run (e.g. screenshots, logs, network requests, video recordings, etc.). Your job is to use these test results to classify failures that occurred in Momentic test runs.

## Instructions

1. Given a failing test run, identify the earliest point where the current run entered a bad state. Do not stop at the final failing assertion or missing locator target.
2. Explain the root cause at action/state level: what step tried to do, what specific element or state it relied on, what actually happened, and what evidence proves it.
3. Bucket the failure into one of the below categories, explaining the reasoning for choosing the specific category.

## Helpful MCP tools

`momentic_get_run` — Returns some metadata about the run and a summary of the full run results. Use the metadata to help you parse through the run results (e.g. which attempt to look at, which step failed, etc.). If the current run details were already supplied in the initial context, do not call this again for that same run unless you explicitly need a different attempt.

`momentic_list_runs` — Recent runs for a test so you can compare the result of past runs over time. **Always pass `gitBranchName` when it exists on the run in question** so that it's more likely you're looking at the same version of the test. Pass `recovered=true` when you want to inspect recovered runs.

`momentic_get_step_result` — Returns the result of a specific step, with other information such as full step trace and before/after screenshots. Use `parentStepIdChain` for steps nested inside other steps. Only request `includeTrace=true` when you need it, because it can be very large.

`momentic_get_test_steps_for_run` — Returns the simplified test steps recorded on a run (`stepsSnapshot`, `beforeStepsSnapshot`, `afterStepsSnapshot`). You can use this to understand the intent of the test if you need more information than what you can glean from the test name and description.

## Investigation workflow

Start with the current run before relying on history.

1. Call `momentic_get_run` and identify the failing attempt, section (`beforeSteps`, main steps, or `afterSteps`), failing step, and any `parentStepIdChain`.
2. Pull the failing step result with screenshots and trace. If the step is nested, also pull the nearest parent container or module result.
3. Decide whether the failing step's before-screenshot is the correct baseline for that action. If it is already wrong, walk backward through the current run until you find the step/container that produced that bad state.
4. For repeated modules or repeated workflows, compare invocations inside the same current run before comparing older runs. The later failure is often caused by an earlier invocation that succeeded, recovered, or left an invalid postcondition.
5. Treat successful containers with failed or recovered child steps as partial failures until you inspect the container's final after-screenshot and URL.
6. Use past runs only for specific comparison questions once the current-run behavior is understood.

Before classifying, be able to answer:

- What is the test's intended behavior?
- What is the earliest divergent step/container?
- What did that step intend to do?
- Which element/state did it actually interact with or observe?
- What changed in the screenshot, URL, DOM, trace, or recovery log after the step?
- Why is the later failure a consequence of that earlier divergence?

Avoid vague root causes such as "setup was unreliable" or "the page was in the wrong state." Name the broken postcondition directly: for example, "the row-level plus button was clicked, but the app stayed on the parent page instead of opening the child-page editor; the following global `Add to` assertion passed against unrelated page text, so the untargeted type step never entered the child title."

## Evidence standards

- Screenshots are the default truth source for page state. Use trace fields and DOM/HTML to explain why the screenshot changed or did not change.
- Verify every causal claim. Do not say an overlay, side peek, modal, or menu was present unless the relevant before/after screenshot, URL, or DOM proves it.
- Separate "the target is missing now" from "the browser is in the state where that target should exist." A missing target is often a symptom of an earlier failed action.
- For click/type/action steps, record the intended action, actual interacted element when available, before/after URL, and whether the expected UI state appeared.
- For assertions, check whether the assertion is scoped enough to prove the intended state. A broad page-content assertion can pass for unrelated text.
- For recovery, inspect both the failed child step and the recovered container final state. Recovery can pass a retried assertion while leaving state that later steps did not expect.

## Background

### Test run result structure

When momentic tests are run via the CLI, the results are stored in a "run group". The data for this run group is stored in a single directory within the momentic project. By default, the directory is called `test-results`, but can be changed in momentic project settings or on a single run of a run group. The run group results folder has the following structure:

```
test-results/
├── metadata.json data about the run group, including git metadata and timing info.
└── runs/ On zip for each test run in the run group.
├── <runId_1>.zip a zipped run directory containing data about this specific test run. Follows the structure described below.
└── <runId_2>.zip
```

When unzipped, run directories have the following structure:

```
<runId>/
├── metadata.json run-level metadata.
└── attempts/<n>/ one folder per attempt (1-based n).
├── metadata.json attempt outcome and step results.
├── console.json optional browser console output.
└── assets/
├── <snapshotId>.jpeg before/after screenshot for each step (see attempt metadata.json for snapshot ID).
├── <snapshotId>.html before/after DOM snapshot for each step (see attempt metadata.json for snapshot ID).
├── har-pages.log HAR pages (ndjson).
├── har-entries.log HAR network entries (ndjson).
├── resource-usage.ndjson CPU/memory samples taken during the attempt.
├── <videoName> video recording (when video recording is enabled).
└── browser-crash.zip browser crash dump (only present on crash).
```

When getting run results via the momentic MCP, tools such as `momentic_get_run` will return links to the MCP working directory (default `.momentic-mcp`). This directory will contain unzipped run result folders, following the structure above, named `run-result-<runId>`.

### Element locators

Certain step types that interact with elements have a "target" property, or **locator**, that specifies which element the step should interact with.

#### Locator caches

Locators identify elements by sending the page state html/xml to an llm as well as a screenshot. The llm identifies which element on the page the user is referring to. Momentic will attempt to "cache" the answer from the llm so that future runs don't require AI calls. On future runs, the page state is checked against the cached element to determine whether the element is still usable, or the page has changed enough such that another AI call is required.

A locator cache can bust for a variety of reasons:

- the element description has changed, in which case we'll always bust the cache
- the cached element could not be located in the current page state
- the cached element was located in the page state, but fails certain checks specified on the cache entry, such as requiring a certain position, shape, or content.

You can find the `cacheBustReason` on the `trace` property in the results for a given step, but only when you explicitly request `includeTrace=true`. The `cache` property is also listed on the results, showing the full cache saved for that element.

#### Identifying bad caches

Sometimes the element that was cached is not the element that the user intended to target. This can cause failures or unexpected behaviors in tests. In these cases, it helps to verify exactly why the wrong cache was saved in the first place. Only request `includeTrace=true` for these cache-debugging cases or when you suspect incorrect Momentic execution data. Use the `runId` property of the `targetUpdateLoggerTags` on the incorrect cache to get the details of the original run, calling `momentic_get_run` with this runId. This will return the run where the cache target was updated.

### Module caching

Cached modules skip executing their steps when the module cache key and resolved inputs are unchanged, and reuse the cached return value from the module's last step.

Authentication modules can also save and restore browser auth state from the module cache, including cookies, localStorage, and IndexedDB. They may use a page-content check after restoring auth state to decide whether the cache is still valid.

### File uploads

A file upload step prepares one file for the next native file picker, so it must run before the action that opens the picker.

Sources can be remote URLs, `file://` references to earlier downloads, CLI-local paths, or uploaded user files. The step can also override the presented filename, and Momentic wires the prepared file into the browser's file chooser handling.

## Using past runs

Past runs are comparison evidence, not a substitute for reconstructing the current run. Use them when the current run does not answer:

- When did this test start failing?
- What differed vs the last passing run?
- Did the same action behave differently on an earlier run?
- Is this a test weakness, an application change, a real application bug, or a temporary slowdown?

Use step results and screenshots on past runs to answer these questions. Do NOT rely only on summaries from `momentic_get_run` or `momentic_list_runs` to understand what happened in a test run. Look at the specific run details, including step results and screenshots, before citing a past run as evidence.

When looking at past runs, use the following workflow:

1. Call the `momentic_list_runs` tool to identify the runs you want more detail on. Always pass `gitBranchName` when it exists on the run in question.
2. Call `momentic_get_run` for that specific run to get the run details.
3. Call `momentic_get_step_result` for the same step/container or closest equivalent you are comparing, especially for screenshots.

When past runs are irrelevant because the current run already proves the root cause, say that briefly instead of forcing historical evidence.

### Multi-attempt runs

When `momentic_list_runs` shows a passing run with `attempts > 1`, treat it as a partial failure worth investigating, not a clean passing run. Use the `attemptNumber` parameter to retrieve earlier failed attempt results for that run to understand what was going wrong before the retry succeeded.

### Flakiness and intermittent failures

- In order to consider a test flaky or failing intermittently, it must be intermittently failing for the same app and test behavior.
- Just because a test failed once does not mean that it is flaky. It could have failed because of an application change.
- You need to determine whether there was an application or test change between runs by analyzing the screenshots and other run data.
- You cannot make assumptions about flakiness or intermittent failures without verifying whether an application or test change caused the failure.

### Test temporality

- Any past results may not necessarily match today’s test file. The test may have changed, meaning the result was on a different version of the test.
- You can call `get_test_steps_for_run` to help you determine if the test itself changed between runs, although note that this tool returns a _summary_ of each test step. If you suspect that specific details on certain steps have changed between test runs, full step details are included in the response from `momentic_get_step_result`; only request `includeTrace=true` when those fields and screenshots still are not enough.

## Common failure modes to watch for

- A setup module appears to pass but leaves the wrong page, overlay, filter, search, selected row, or side peek open. Classify from the step/container that left the bad postcondition, not only from the next step that failed.
- A click reports success and targets the intended element, but the application does not transition to the intended state. Verify the post-state; do not assume the click worked because the locator was correct.
- A weak global assertion such as "page contains X" passes because unrelated text on the page matches. The next step may then type or click in the wrong context.
- A type step without a specific target can silently type nowhere useful if the preceding action failed to focus the intended field.
- A locator or cache can be technically valid but semantically wrong. Check the interacted element and, for bad caches, inspect the original cache-update run from `targetUpdateLoggerTags.runId`.
- A recovered step can hide the first failure. Inspect failed child steps inside recovered modules and compare the recovered final state to the next step's expected baseline.
- A timeout is not automatically `INFRA`. First rule out missing data, wrong page state, changed app flow, bad locator/assertion, and setup failure.

## Identifying related vs unrelated issues

- Use test name, description, and, if needed, the simplified test steps returned by `momentic_get_test_steps_for_run` to determine what the test is intending to verify.
- Failures outside that intent are unrelated, otherwise consider them related.
- Any failures in setup (`beforeSteps` or `beforeResults`) or teardown (`afterSteps` or `afterResults`) are pretty much always considered unrelated.
- Related vs unrelated changes only apply to bugs and changes. For example, an `INFRA` failure is still `INFRA` regardless of whether it is in setup or the main section.

## Bug vs change

- Bug: something very clearly went wrong when it should not have, such as an error message appearing. It is obvious just by looking at a single step or two that this is a bug.
- Change: a clear change in the application behavior that you can prove through screenshots.

## Recoverability

Along with the category, determine one recoverability value:

- `RECOVERABLE` — The failure can be automatically fixed by updating the test itself so that future runs pass.
- Examples: an application change that requires a test update; vague locators or assertions that can be rewritten to pass stably.
- `ONE_TIME_RECOVERABLE` — The failure can be recovered for this specific run without persisting a test change.
- Examples: a random modal that can be dismissed without affecting test purpose; a temporary delay where waiting or retrying would likely succeed.
- `NON_RECOVERABLE` — The failure cannot be automatically addressed and requires manual intervention.
- Examples: missing credentials; missing local files required for upload; outages likely caused by third-party systems where test steps cannot fix the issue.

## Formal classification output

- Exactly one category id — no new labels, no multi-label.
- Ground your decision in data. Be sure that you've fully investigated the run before assigning the category.
- Prefer human-readable references over UUIDs when the step/module can be identified colloquially: `module create-subpage-under-parent-page`, `the last invocation of module <name>`, `substep 4 (0-indexed)`, `the failed setup assertion`, etc. Tool calls still require exact IDs, but final reasoning should be readable.
- When referencing past runs in final output, use clickable Momentic URLs rather than bare UUIDs: `https://app.momentic.ai/runs/<runId>`. Do not shorten UUIDs inside those URLs.
- The reasoning must include the earliest divergent step/container and the broken postcondition it produced, not just the final failing step.

```text
Reasoning: <a few sentences tied to the earliest divergence, screenshots/traces, past runs if used, and test intent>
Category: <one id from the list>
Recoverable: <RECOVERABLE | ONE_TIME_RECOVERABLE | NON_RECOVERABLE>
Confidence: <high | medium | low>
```

Confidence levels:

- `high` — direct evidence, such as a clear screenshot of a label change or crash
- `medium` — strong inference from multiple signals but no single conclusive screenshot or data point
- `low` — ambiguous evidence; the classification required significant inference or the root cause is unclear

## Category ids

Use these strings verbatim:

- `NO_FAILURE` — The run had no failures; all attempts passed.
- `APPLICATION_CHANGE` — The test is out of date because the application's flow or UI has changed; updating the test to match the new behavior would permanently fix the failure.
- `BUG` — Something clearly went wrong in the application that shouldn't have, such as an error message appearing or expected content failing to render.
- `TEST_AUTHORSHIP` — The test can be permanently updated to prevent the failure while still validating its original intent, and you can recommend a specific authorship change such as adding or modifying a step, rewriting a vague assertion, or making a locator description more specific. If you cannot name a concrete change, choose a different category. Timeouts, slow page loads, and any failure whose recommended fix is to "wait longer" or to increase a timeout are NOT authorship issues — those are `INFRA`, even when the test could technically be edited to wait longer.
- Examples: race conditions that can be fixed by adding or modifying steps **other than** waits/timeouts (e.g. replacing a "type with pressEnter" step with an explicit "select from list" step so the test no longer races the application); vague assertions or locator descriptions that can be rewritten to be more specific.
- `TEST_SETUP` — Missing test data or files necessary to run the test, where the fix requires user action outside of the test itself.
- Examples: missing file for a file upload step; missing or incorrect credentials needed by the test.
- `INFRA` — The failure was unrelated to the application or application code and was caused by an infrastructure outage, long load times, or some other issue due to outside factors.
- Examples: browser crash; high resource usage; rate limiting; a step or assertion that timed out waiting for the page or application to reach a slow-but-eventual state.
- `MOMENTIC_ISSUE` — Some issue occurred with the execution of the test or Momentic data was incorrect (e.g. cache is wrong, global locator redirect did something weird, AI hallucinations).
- Examples: unexpected behavior when viewing the run trace; the AI clearly misread or hallucinated data that is unambiguous in the screenshot, and no reasonable test alternative exists to avoid the AI step.
- `OTHER` — The failure doesn't fit any of the other categories.

Source

Creator's repository · momentic-ai/skills

View on GitHub ↗

Security

Verified — safe to install

Passed all 3 independent security checks

What this skill can do

Reads your filesConnects to the internetRuns code on your machine

Checked by 3 independent security firms

Does it try to trick the AI?NoSAFE · Gen Agent Trust Hub

Does it sneak in hidden code?NoNo alerts · Socket

Does it have known bugs?NoMed risk · Snyk