video-analysis

Skill file

Preview skill file↓↑

---
name: video-analysis
version: 1.1.0
description: |
  Video understanding for any model — native passthrough for small files,
  frame extraction + audio transcription fallback for large files.

  Use when the user asks to analyze, describe, or understand a video file
  (e.g. "what's in this video", "summarize this clip", "transcribe this recording").
metadata:
  starchild:
    emoji: "🎥"
    skillKey: video-analysis
delivery: script
user-invocable: true
disable-model-invocation: false
---

# Video Analysis

Analyze video files using either **native model understanding** or **frame extraction + transcription**.

## How It Works

```
analyze_video(path, question)
      │
      ├─ file_size ≤ threshold (default 20MB)
      │     → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
      │     → Model sees full video natively (best quality)
      │
      └─ file_size > threshold
            → ffmpeg extracts keyframes (scene detection for long videos)
            → Whisper transcribes audio track
            → Returns frame image paths + transcript text
            → Agent feeds these to the current chat model
```

## Quick Start

⚠️ **Invocation — do NOT use dotted imports.** The directory name contains a
hyphen (`video-analysis`), so `from skills.video-analysis.exports import ...`
is a **Python syntax error** (`-` is parsed as minus). This is true for every
hyphenated skill, not just this one. Use one of the two patterns below.

**Pattern A — from workspace root (recommended for scripts):**
```bash
cd /data/workspace/skills/video-analysis && \
  python3 -c "from exports import analyze_video; \
    import json; \
    print(json.dumps(analyze_video('output/videos/clip.mp4', \
      question='What happens in this video?'), ensure_ascii=False))"
```
Note: pass the video path **workspace-relative** (analyze.py resolves it
against `WORKSPACE_DIR`), even though you cd into the skill dir.

**Pattern B — inside a starchild-clawd script:**
```python
from core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
                                      question="What happens in this video?")
```

❌ **Do NOT** `exec(open('skills/video-analysis/analyze.py').read())` — analyze.py
uses `__file__` at import time, which is undefined under `exec`, so it crashes.
Load it by file path with `importlib.util.spec_from_file_location` if you must
avoid both patterns above.

```python
# result keys (same for both patterns):
# Analyze a video — auto-selects native or extraction mode
# result = analyze_video("output/videos/clip.mp4", question="What happens in this video?")

# result keys:
#   success: bool
#   mode: "native" | "extraction"
#
# If mode == "native":
#   analysis: str (model's text response)
#   model: str (which model was used)
#   tokens: {input, output, video, audio}
#
# If mode == "extraction":
#   frame_paths: list[str] (workspace-relative paths to keyframe JPEGs)
#   transcript: str | None (Whisper transcription text)
#   frame_count: int
#   duration_sec: float
```

## Using the Exports

```python
from core.skill_tools import video_analysis

# Full analysis (auto-selects mode)
result = video_analysis.analyze_video("output/videos/my_video.mp4", question="Describe this video")

# Check current config
config = video_analysis.get_config()

# Get video metadata without analyzing
info = video_analysis.get_video_info("output/videos/my_video.mp4")
# → {"duration": 45.2, "size": 12345678, "width": 1920, "height": 1080, "has_audio": true}
```

## Native Mode (small videos)

For videos under the size threshold, the skill sends the full video to a model
that supports native video input. The model sees every frame and hears the audio.

**Default model:** `google/gemini-3.1-flash-lite` — best price/quality for video.

**Model benchmark** (6MB clip, vs `gemini-3.1-pro-preview` baseline):

| Model                       | Tier   | Cost     | Time  | Accuracy | Notes                          |
|-----------------------------|--------|----------|-------|----------|--------------------------------|
| google/gemini-3.1-flash-lite | budget | ~$0.0014 | 8.1s  | ~88%     | ⭐ Default — cheapest + fastest |
| google/gemini-3.5-flash     | std    | ~$0.0152 | 11.8s | ~85%     | More detail, higher cost       |
| qwen/qwen3.6-plus           | budget | ~$0.0058 | 44.2s | ~95%     | Accurate but slow              |
| qwen/qwen3.6-flash          | budget | ~$0.0027 | 16.6s | ~80%     | Misreads subjects sometimes    |
| google/gemini-3.1-pro-preview | std  | ~$0.0199 | 19.7s | 100%     | Baseline (best, most expensive)|

flash-lite identifies the full scene, action sequence, and transitions
correctly at ~14x lower cost than the Pro baseline. For maximum accuracy
(exact character names, fine detail), switch `default_model` to
`gemini-3.1-pro-preview` or `gemini-3.5-flash` in `config/video-analysis.yaml`.

## Extraction Mode (large videos)

For videos over the size threshold, the skill extracts keyframes and transcribes audio:

- **Short videos (≤60s):** One frame every N seconds (default: 2s)
- **Long videos (>60s):** Scene-change detection picks visually distinct frames
- **Audio:** Extracted and sent to Whisper for transcription
- **Max frames:** Capped at 30 (configurable) to control cost

The agent receives frame image paths and transcript text, then feeds them
to the current chat model as image attachments + context text.

## Configuration

Edit **`config/video-analysis.yaml`** (in the workspace) to customize. This file
is created automatically on first use, only needs the keys you want to override,
and **survives skill updates**.

> Do NOT edit `skills/video-analysis/config.yaml` — that's the factory default
> and is overwritten on every skill auto-update. The user file overlays it.

Both the standalone skill and the chat "send a video" flow read this same config,
so one edit changes the model everywhere. Available keys:

```yaml
# Model for native video understanding
default_model: google/gemini-3.1-flash-lite

# Size threshold: native (≤) vs extraction (>)
# Set to 0 → always extraction. Set to 100 → always native.
native_size_limit_mb: 20

# Frame extraction settings
extraction:
  max_frames: 30                  # Max keyframes to extract
  short_video_interval_sec: 2     # Frame interval for ≤60s videos
  scene_threshold: 0.3            # Scene detection sensitivity (0.0-1.0)
  transcribe_audio: true          # Whether to Whisper-transcribe audio
```

### Available Video Models

| Model                         | Alias    | Tier     | Notes              |
|-------------------------------|----------|----------|--------------------|
| google/gemini-3.1-flash-lite  | flash31  | budget   | ⭐ Default, best price/quality |
| google/gemini-3.5-flash       | gemini35 | standard | More detail, higher cost |
| google/gemini-3.1-flash-lite  | flash31  | budget   | Cheapest option    |
| google/gemini-3.1-pro-preview | gemini   | standard | Highest quality    |
| qwen/qwen3.6-flash            | qwenf    | budget   | Good alternative   |
| qwen/qwen3.6-plus             | qwen     | budget   | —                  |
| minimax/minimax-m3             | mm3      | standard | —                  |
| meta-llama/llama-4-maverick    | maverick | standard | —                  |
| meta-llama/llama-4-scout       | scout    | budget   | —                  |
| xiaomi/mimo-v2.5               | mimo     | standard | —                  |
| z-ai/glm-5v-turbo             | glm5v    | standard | —                  |
| minimax/minimax-m2.7           | mm27     | budget   | Audio-only, no image |

## Agent Behavior

When the user provides a video file (via upload or file path) and the current
chat model does NOT support video:

1. Call `analyze_video(path, question)`.
2. If result mode is `"native"` → return `result["analysis"]` directly.
3. If result mode is `"extraction"` → use `result["frame_paths"]` as image
   references and `result["transcript"]` as context, then ask the current
   model to analyze based on the frames + transcript.

When the current model DOES support video, the backend handles it natively
via Phase 1 (base64 content block injection) — no need for this skill.

## Troubleshooting

| Problem | Fix |
|---------|-----|
| "File not found" | Check path is workspace-relative (e.g. `output/videos/x.mp4`) |
| Native mode returns error | Check `default_model` in config/video-analysis.yaml is valid |
| No audio transcription | Video may have no audio track; check `has_audio` in result |
| Too few frames extracted | Lower `scene_threshold` in config/video-analysis.yaml (e.g. 0.15) |
| Too many frames / high cost | Reduce `max_frames` or raise `scene_threshold` |

Source

Creator's repository · starchild-ai-agent/official-skills

View on GitHub ↗

Security

Security checks in progress

Results will appear here once audits complete

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk