trl-training

Train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.

Skill file

Preview skill file
---
name: trl-training
description: Train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.
license: Apache-2.0
metadata:
  version: "1.0.0"
  author: huggingface
  commands:
    - trl sft
    - trl dpo
    - trl grpo
    - trl kto
    - trl rloo
    - trl reward
  categories:
    - machine-learning
    - llm-training
    - reinforcement-learning
  tags:
    - rlhf
    - supervised-fine-tuning
    - dpo
    - grpo
    - huggingface
    - transformers
  documentation: https://huggingface.co/docs/trl/en/clis
---

# TRL Training Skill

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.

## Overview

TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:

- **SFT** (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
- **DPO** (Direct Preference Optimization): Align models using preference data
- **GRPO** (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
- **RLOO** (Reinforce Leave One Out): Online RL training with generation-based rewards
- **Reward Model Training**: Train reward models for RLHF

TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.

## Core Commands

### trl sft - Supervised Fine-Tuning

Fine-tune language models on instruction-following or conversational datasets.

**Full training:**

```bash
trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-5 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub
```

**Train with LoRA adapters:**

```bash
trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub
```

### trl dpo - Direct Preference Optimization

Align models using preference data (chosen/rejected pairs).

**Full training:**

```bash
trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-7 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns
```

**Train with LoRA adapters:**

```bash
trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-6 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16
```

### trl grpo - Group Relative Policy Optimization

Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.

**Basic usage:**

```bash
trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/gsm8k \
  --reward_funcs accuracy_reward \
  --output_dir Qwen2-0.5B-GRPO \
  --push_to_hub
```

### trl rloo - Reinforce Leave One Out

Online RL training where the model generates text and receives rewards based on custom criteria.

**Basic usage:**

```bash
trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/tldr \
  --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
  --output_dir Qwen2-0.5B-RLOO \
  --push_to_hub
```

### trl reward - Reward Model Training

Train a reward model to score text quality for RLHF.

**Full training:**

```bash
trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-5 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048
```

**Train with LoRA adapters:**

```bash
trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward-LoRA \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-4 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048 \
  --use_peft \
  --lora_task_type SEQ_CLS \
  --lora_r 32 \
  --lora_alpha 16
```

## Configuration Files

TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.

**Example config (sft_config.yaml):**

```yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio
```

**Launch with config:**

```bash
trl sft --config sft_config.yaml
```

**Override config values:**

```bash
trl sft --config sft_config.yaml --learning_rate 1.0e-5
```

## Distributed Training

TRL integrates with Accelerate for multi-GPU and multi-node training.

**Multi-GPU training:**

```bash
trl sft \
  --config sft_config.yaml \
  --num_processes 4
```

**Use predefined Accelerate configs:**

TRL provides predefined configs: `single_gpu`, `multi_gpu`, `fsdp1`, `fsdp2`, `zero1`, `zero2`, `zero3`

```bash
trl sft \
  --config sft_config.yaml \
  --accelerate_config zero2
```

**Custom Accelerate config:**

```bash
# Generate custom config
accelerate config

# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml
```

**Fully Sharded Data Parallel (FSDP):**

```bash
trl sft --config sft_config.yaml --accelerate_config fsdp2
```

**DeepSpeed ZeRO:**

```bash
trl sft --config sft_config.yaml --accelerate_config zero3
```

## Troubleshooting

### CUDA Out of Memory

- Reduce `--per_device_train_batch_size` and increase `--gradient_accumulation_steps`
- Enable `--use_peft` for LoRA training
- Use `--gradient_checkpointing` to save memory
- Try smaller model or longer sequence truncation

### Dataset Loading Issues

- Verify dataset exists: check Hugging Face Hub or local path
- Check dataset format matches expected columns
- Use `--dataset_config` for multi-config datasets
- Inspect dataset: `from datasets import load_dataset; ds = load_dataset(name)`

### Model Loading Issues

- Verify model exists on Hugging Face Hub
- Check if gated model requires authentication: `hf auth login`
- For local models, provide absolute path
- Ensure sufficient disk space and memory

### Slow Training

- Enable dataset `--packing` for short sequences
- Use larger `--per_device_train_batch_size` if memory allows
- Enable `--tf32` for faster computation on Ampere GPUs
- Use `--bf16` on supported hardware
- Consider multi-GPU training with `--num_processes`

### Generation Issues (GRPO/RLOO)

- Check prompt format in dataset
- Adjust `--temperature` and `--top_p` for generation
- Verify the reward function (for GRPO/RLOO)

## Additional Resources

- **Documentation**: https://huggingface.co/docs/trl
- **GitHub**: https://github.com/huggingface/trl
- **Examples**: https://github.com/huggingface/trl/tree/main/examples

## Best Practices

1. **Start with SFT**: Always fine-tune base models with SFT before preference alignment
2. **Use LoRA for efficiency**: Enable `--use_peft` for faster training and lower memory
3. **Monitor training**: Use `--report_to trackio` (or `--report_to wandb` or `--report_to tensorboard`) for tracking
4. **Save checkpoints**: TRL automatically saves checkpoints in `--output_dir`
5. **Test on small datasets first**: Verify pipeline works before full training
6. **Use configuration files**: Create YAML configs for reproducibility
7. **Leverage Accelerate**: Use multi-GPU training for faster iteration

When helping users with TRL:
- Always check which training method is appropriate for their use case
- Verify dataset format matches the expected schema
- Recommend starting with smaller models for testing
- Suggest LoRA for resource-constrained environments
- Point to specific documentation sections for advanced features

Source

Creator's repository · huggingface/skills

View on GitHub

License: Apache-2.0

Security

Security checks in progress
Results will appear here once audits complete
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk