Auto-tune a skill's performance over time

Runs repeated tests on a skill, detects where it fails, and automatically rewrites its prompt or logic to improve accuracy without manual intervention.
Best for: Engineers building skills who want the system to learn from its own mistakes.
Engineering / skill-authoringmetafor-engineersexecutionneeds-integration
Skill file

Preview skill file↓↑
---
name: hermes-agent-self-evolution
description: Evolutionary self-improvement for Hermes Agent using DSPy + GEPA to optimize skills, prompts, and code
triggers:
  - evolve a hermes agent skill
  - optimize hermes agent prompts with GEPA
  - run self-evolution on agent skills
  - improve hermes agent with evolutionary search
  - generate evaluation data for skill optimization
  - use DSPy to optimize agent capabilities
  - run GEPA optimizer on tool descriptions
  - automate hermes agent improvement
---

# Hermes Agent Self-Evolution

> Skill by [ara.so](https://ara.so) — AI Agent Skills collection.

Hermes Agent Self-Evolution is an evolutionary optimization framework for [Hermes Agent](https://github.com/NousResearch/hermes-agent) that uses DSPy + GEPA (Genetic-Pareto Prompt Evolution) to automatically improve agent skills, tool descriptions, system prompts, and code through reflective evolutionary search. No GPU training required — everything operates via API calls with execution trace analysis.

## Installation

```bash
# Clone the repository
git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
cd hermes-agent-self-evolution

# Install with development dependencies
pip install -e ".[dev]"

# Set environment variables
export HERMES_AGENT_REPO=~/.hermes/hermes-agent  # Path to your hermes-agent repo
export OPENAI_API_KEY=your_openai_api_key        # For DSPy/GEPA optimizer
```

## Key Commands

### Evolving Skills

```bash
# Basic skill evolution with synthetic eval data
python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 10 \
    --eval-source synthetic

# Use real session history from agent databases
python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 10 \
    --eval-source sessiondb

# Specify custom hermes-agent repo path
python -m evolution.skills.evolve_skill \
    --skill web-scraping \
    --iterations 5 \
    --hermes-repo /path/to/hermes-agent

# Use specific model for optimization
python -m evolution.skills.evolve_skill \
    --skill data-analysis \
    --iterations 8 \
    --model gpt-4-turbo
```

### Generating Evaluation Data

```bash
# Generate synthetic evaluation dataset for a skill
python -m evolution.eval.generate_eval_data \
    --skill github-code-review \
    --num-examples 50 \
    --output eval_data/github-code-review.json

# Extract real usage examples from session database
python -m evolution.eval.extract_sessions \
    --skill web-scraping \
    --min-quality 0.7 \
    --output eval_data/web-scraping-real.json
```

## Core Concepts

### Evolution Pipeline

The optimization process follows this flow:

1. **Read current artifact** (skill file, prompt, tool description)
2. **Generate evaluation dataset** (synthetic or from real sessions)
3. **GEPA optimization loop**:
   - Generate candidate variants through mutation
   - Execute variants against eval dataset
   - Analyze execution traces for failure patterns
   - Propose targeted improvements
   - Apply constraint gates (tests, size limits, benchmarks)
4. **Select best variant** based on multi-objective criteria
5. **Create PR** against hermes-agent repository

### Evaluation Sources

- **`synthetic`**: LLM-generated scenarios based on skill description
- **`sessiondb`**: Real usage traces from Claude Code, Copilot, Hermes sessions
- **`benchmark`**: Predefined test suites (SWE-bench, HumanEval, etc.)

## Python API

```python
from evolution.skills.optimizer import SkillOptimizer
from evolution.eval.dataset import EvalDataset

# Initialize optimizer
optimizer = SkillOptimizer(
    hermes_repo_path="~/.hermes/hermes-agent",
    model="gpt-4-turbo",
    temperature=0.7
)

# Load or generate evaluation data
eval_dataset = EvalDataset.from_synthetic(
    skill_name="github-code-review",
    num_examples=30
)

# Run optimization
result = optimizer.optimize(
    skill_name="github-code-review",
    eval_dataset=eval_dataset,
    iterations=10,
    population_size=8
)

print(f"Original score: {result.baseline_score}")
print(f"Optimized score: {result.best_score}")
print(f"Improvement: {result.improvement_pct}%")

# Access the optimized skill content
optimized_skill = result.best_variant.content

# Create PR with changes
result.create_pull_request(
    title="[Self-Evolution] Optimize github-code-review skill",
    branch="evolution/github-code-review"
)
```

### Custom Evaluation Functions

```python
from evolution.eval.metrics import EvalMetric

class TaskSuccessRate(EvalMetric):
    """Custom metric for task completion rate."""
    
    def evaluate(self, prediction, example):
        """Score prediction against ground truth."""
        # Extract success indicators from execution trace
        completed_steps = self.count_completed_steps(prediction.trace)
        total_steps = len(example.expected_steps)
        
        return {
            "score": completed_steps / total_steps,
            "metadata": {
                "completed": completed_steps,
                "total": total_steps
            }
        }

# Use custom metric
optimizer = SkillOptimizer(
    metrics=[TaskSuccessRate(), "accuracy", "response_quality"]
)
```

### Constraint Gates

```python
from evolution.constraints import ConstraintGate

# Define custom constraints
constraints = [
    ConstraintGate.max_size(15_000),  # 15KB max for skill files
    ConstraintGate.test_suite_passes(),  # Must pass pytest
    ConstraintGate.no_semantic_drift(threshold=0.85),  # Preserve purpose
    ConstraintGate.benchmark_threshold("swe_bench_lite", min_score=0.6)
]

optimizer = SkillOptimizer(constraints=constraints)
```

## Configuration

Create `evolution_config.yaml` in your project root:

```yaml
# Model configuration
model:
  provider: openai
  name: gpt-4-turbo
  temperature: 0.7
  max_tokens: 4096

# GEPA optimizer settings
gepa:
  population_size: 8
  mutation_rate: 0.3
  crossover_rate: 0.5
  elite_count: 2
  max_generations: 10

# Evaluation settings
evaluation:
  default_source: synthetic
  num_examples: 30
  min_quality_threshold: 0.7
  use_execution_traces: true

# Constraints
constraints:
  max_skill_size: 15360  # 15KB
  max_tool_description: 500  # chars
  require_test_pass: true
  semantic_drift_threshold: 0.85

# Repository paths
paths:
  hermes_agent: ~/.hermes/hermes-agent
  eval_data: ./eval_data
  results: ./evolution_results

# PR automation
pull_requests:
  auto_create: false
  branch_prefix: evolution/
  require_approval: true
  assignees: []
```

Load configuration:

```python
from evolution.config import load_config

config = load_config("evolution_config.yaml")
optimizer = SkillOptimizer.from_config(config)
```

## Common Patterns

### Batch Optimization

Optimize multiple skills in sequence:

```python
from evolution.batch import BatchOptimizer

skills = [
    "github-code-review",
    "web-scraping",
    "data-analysis",
    "api-integration"
]

batch = BatchOptimizer(
    skills=skills,
    hermes_repo_path="~/.hermes/hermes-agent",
    iterations=10
)

results = batch.run(
    eval_source="synthetic",
    parallel=True,
    max_workers=4
)

# Generate comparison report
batch.generate_report("evolution_results/batch_report.html")
```

### Continuous Evolution Pipeline

Set up automated continuous improvement:

```python
from evolution.continuous import ContinuousPipeline
from datetime import timedelta

pipeline = ContinuousPipeline(
    hermes_repo_path="~/.hermes/hermes-agent",
    check_interval=timedelta(days=7)
)

# Watch for new session data
pipeline.watch_sessions(
    min_sessions=100,  # Wait for 100 new sessions
    quality_threshold=0.7
)

# Run evolution when triggered
pipeline.on_trigger(
    callback=lambda skill: optimizer.optimize(skill),
    create_pr=True
)

pipeline.start()
```

### A/B Testing Variants

Compare multiple evolved variants:

```python
from evolution.testing import ABTest

# Generate multiple variants
variants = optimizer.evolve_variants(
    skill_name="github-code-review",
    num_variants=5,
    iterations=10
)

# Run A/B test
ab_test = ABTest(
    baseline=original_skill,
    variants=variants,
    eval_dataset=eval_dataset
)

results = ab_test.run(num_trials=100)

# Statistical significance testing
winner = results.get_winner(confidence=0.95)
print(f"Winner: {winner.variant_id} (p={winner.p_value})")
```

## Real-World Examples

### Example 1: Optimize GitHub Code Review Skill

```python
from evolution.skills.optimizer import SkillOptimizer
from evolution.eval.dataset import EvalDataset

# Initialize
optimizer = SkillOptimizer(
    hermes_repo_path="~/.hermes/hermes-agent",
    model="gpt-4-turbo"
)

# Generate synthetic eval data based on skill description
eval_dataset = EvalDataset.from_synthetic(
    skill_name="github-code-review",
    num_examples=40,
    scenarios=[
        "review PR with security vulnerabilities",
        "review PR with performance issues",
        "review PR with style violations",
        "review PR with breaking changes"
    ]
)

# Run optimization
result = optimizer.optimize(
    skill_name="github-code-review",
    eval_dataset=eval_dataset,
    iterations=15,
    constraints=[
        "max_size:15000",
        "test_suite_passes",
        "no_semantic_drift:0.85"
    ]
)

# Review improvements
print(f"Baseline: {result.baseline_score:.2f}")
print(f"Optimized: {result.best_score:.2f}")
print(f"\nKey improvements:")
for improvement in result.improvements:
    print(f"  - {improvement}")

# Create PR if improvement is significant
if result.improvement_pct > 10:
    result.create_pull_request(
        title="[Self-Evolution] Optimize github-code-review skill (+{:.1f}%)".format(
            result.improvement_pct
        ),
        branch="evolution/github-code-review-v2"
    )
```

### Example 2: Use Real Session Data

```python
from evolution.eval.sessions import SessionExtractor

# Extract high-quality sessions from agent database
extractor = SessionExtractor(
    session_db_path="~/.hermes/sessions.db"
)

real_examples = extractor.extract(
    skill_name="web-scraping",
    min_quality=0.8,
    min_session_length=5,  # At least 5 turns
    max_examples=50
)

# Convert to eval dataset
eval_dataset = EvalDataset.from_sessions(real_examples)

# Optimize using real user data
result = optimizer.optimize(
    skill_name="web-scraping",
    eval_dataset=eval_dataset,
    iterations=10
)
```

### Example 3: Multi-Objective Optimization

```python
from evolution.objectives import MultiObjective

# Define multiple objectives
objectives = MultiObjective([
    ("task_success", weight=0.5),
    ("response_quality", weight=0.3),
    ("efficiency", weight=0.2)
])

result = optimizer.optimize(
    skill_name="data-analysis",
    eval_dataset=eval_dataset,
    objectives=objectives,
    pareto_optimal=True  # Use Pareto frontier selection
)

# Visualize trade-offs
result.plot_pareto_frontier(
    x_axis="task_success",
    y_axis="efficiency",
    output="pareto_analysis.png"
)
```

## Troubleshooting

### Optimization Stalls

If evolution isn't improving scores:

```python
# Increase mutation diversity
optimizer = SkillOptimizer(
    mutation_rate=0.5,  # More aggressive mutations
    temperature=0.9     # More creative variations
)

# Or use trace-guided mutations
optimizer.enable_trace_analysis(
    focus_on_failures=True,
    extract_error_patterns=True
)
```

### Constraint Failures

Check which constraints are failing:

```python
result = optimizer.optimize(skill_name="github-code-review")

for variant in result.rejected_variants:
    print(f"Variant {variant.id} failed:")
    for constraint, passed in variant.constraint_results.items():
        if not passed:
            print(f"  ✗ {constraint}: {variant.constraint_errors[constraint]}")
```

### API Rate Limits

Handle rate limiting gracefully:

```python
from evolution.utils import RateLimiter

optimizer = SkillOptimizer(
    rate_limiter=RateLimiter(
        max_requests_per_minute=50,
        backoff_strategy="exponential"
    )
)
```

### Evaluation Dataset Quality

Validate dataset before optimization:

```python
from evolution.eval.validation import DatasetValidator

validator = DatasetValidator()
issues = validator.validate(eval_dataset)

if issues:
    print("Dataset issues found:")
    for issue in issues:
        print(f"  - {issue}")
    
    # Auto-fix common issues
    eval_dataset = validator.auto_fix(eval_dataset)
```

## Integration with Hermes Agent

Evolved skills automatically integrate with Hermes Agent:

```bash
# After optimization, test the evolved skill
cd $HERMES_AGENT_REPO
git checkout evolution/github-code-review

# Run hermes with the evolved skill
hermes --skill github-code-review "Review this PR: https://github.com/..."

# Compare with baseline
hermes-benchmark compare \
    --baseline main \
    --variant evolution/github-code-review \
    --skill github-code-review
```

## Cost Estimation

Typical costs per optimization run:

- **Skill optimization** (10 iterations, 30 examples): ~$2-5
- **Tool description** (5 iterations, 20 examples): ~$1-2
- **System prompt** (15 iterations, 50 examples): ~$5-10

Reduce costs:

```python
optimizer = SkillOptimizer(
    model="gpt-3.5-turbo",  # Cheaper model
    cache_predictions=True,  # Reuse evaluations
    early_stopping=True      # Stop if no improvement
)
```
Source

Creator's repository · aradotso/ai-agent-skills
View on GitHub ↗
Security

Security checks in progress
Results will appear here once audits complete
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk