Reinforcement Learning Fine-tuning with EasyR1

This experiment demonstrates how to use FlexAI to fine-tune language models using reinforcement learning (RL) techniques with EasyR1, a framework for training reasoning-capable models using GRPO (Group Relative Policy Optimization), DAPO, and REINFORCE algorithms.

For illustration purposes, we'll fine-tune the Qwen2.5-7B-Instruct model on mathematical reasoning tasks using the math12k dataset with GRPO algorithm to improve reasoning capabilities.

VIEW CODE ON GITHUB

Note: If you haven't already connected FlexAI to GitHub, run flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.

Quick Start

Run GRPO training on Qwen2.5-7B with this single command:

flexai training run grpo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct

Replace <YOUR_WANDB_API_KEY> and <HF_AUTH_TOKEN_SECRET_NAME> with your actual values.

What is EasyR1?

EasyR1 is a reinforcement learning framework specifically designed for training language models with enhanced reasoning capabilities. It implements several RL algorithms optimized for LLMs:

GRPO (Group Relative Policy Optimization): Efficient policy optimization using group-based advantage estimation
DAPO (Data-Augmented Policy Optimization): Enhanced training with data augmentation strategies
REINFORCE: Classic policy gradient method for LLM fine-tuning

The framework is built on top of VERL (Versatile Efficient Reinforcement Learning), providing distributed training capabilities with FSDP and vLLM integration.

Directory Structure

The code/easyR1/ directory contains:

config.yaml - Main GRPO training configuration
format_prompt/ - Jinja templates for prompt formatting
reward_function/ - Custom reward scoring functions

For baseline training scripts and additional examples, refer to the EasyR1 GitHub repository.

Step 1: Understand the Configuration

EasyR1 uses a comprehensive YAML configuration file that controls all aspects of RL training. The main configuration file is located at code/easyR1/config.yaml in this repository.

Key Configuration Sections

Data Configuration

data:
  train_files: hiyouga/math12k@train
  val_files: hiyouga/math12k@test
  prompt_key: problem
  answer_key: answer
  format_prompt: ./code/easyR1/format_prompt/math.jinja
  max_prompt_length: 2048
  max_response_length: 2048
  rollout_batch_size: 512

Algorithm Settings

algorithm:
  adv_estimator: grpo  # GRPO, DAPO, or REINFORCE
  use_kl_loss: true
  kl_coef: 1.0e-2

Worker Configuration

worker:
  actor:
    model:
      model_path: Qwen/Qwen2.5-7B-Instruct
    optim:
      lr: 1.0e-6
  rollout:
    n: 5  # number of rollout samples per prompt
    temperature: 1.0
  reward:
    reward_type: batch
    reward_function: ./code/easyR1/reward_function/math.py:compute_score

Step 2: Reference Baseline Examples

For pre-configured training scripts and baseline examples, refer to the EasyR1 repository. The repository provides multiple baseline configurations for different models and tasks:

Available Baselines (in EasyR1 repo)

Mathematical Reasoning: qwen2_5_7b_math_grpo.sh, qwen3_4b_math_grpo.sh
Geometric Reasoning (Vision-Language): qwen2_5_vl_7b_geo3k_grpo.sh, qwen2_5_vl_7b_geo3k_dapo.sh, qwen2_5_vl_7b_geo3k_reinforce.sh
Multi-Image Tasks: qwen2_5_vl_7b_multi_image.sh

You can adapt these examples to work with FlexAI by following the training commands in this README.

Step 3: Customize Your Configuration

For your specific use case, you may want to create a custom configuration. Here's how to customize the config.yaml:

Custom Dataset

Replace the dataset configuration:

data:
  train_files: your-username/your-dataset@train
  val_files: your-username/your-dataset@test
  prompt_key: question  # adjust based on your dataset
  answer_key: solution  # adjust based on your dataset

Custom Reward Function

Create your own reward function in code/easyR1/reward_function/custom.py:

def compute_score(prompts, responses, answers):
    """
    Args:
        prompts: List of input prompts
        responses: List of model responses
        answers: List of ground truth answers

    Returns:
        List of reward scores (float)
    """
    scores = []
    for response, answer in zip(responses, answers):
        # Your custom reward logic here
        score = your_evaluation_function(response, answer)
        scores.append(score)
    return scores

Then update the config to reference your custom reward function:

worker:
  reward:
    reward_function: ./code/easyR1/reward_function/custom.py:compute_score

Custom Prompt Format

Create a custom Jinja template in code/easyR1/format_prompt/custom.jinja:

{{ problem }}

Please solve this step by step and provide your final answer.

Update the config:

data:
  format_prompt: ./code/easyR1/format_prompt/custom.jinja

Create Secrets

To access HuggingFace models and datasets, you need a HuggingFace token.

Use the flexai secret create command to store your HuggingFace Token as a secret:

flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>

Then paste your HuggingFace Token API key value.

Optional: Pre-Fetch the Model

To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage:

Create a HuggingFace storage provider:

flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>

Push the model checkpoint to your storage:

flexai checkpoint push qwen25-7b-instruct --storage-provider HF-STORAGE --source-path Qwen/Qwen2.5-7B-Instruct

Training

For RL training with EasyR1, we recommend using 1 node (8 × H100 GPUs) for 7B models to handle the actor, reference model, and rollout workers efficiently.

Repository Note: The commands below use this repository which contains all necessary configuration files in the code/easyR1/ directory.

Standard Training: Mathematical Reasoning with GRPO

flexai training run grpo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct

Note: Replace <YOUR_WANDB_API_KEY> with your actual Weights & Biases API key, or use --secret WANDB_API_KEY=<SECRET_NAME> if you've stored it as a FlexAI secret.

Training with Model Pre-Fetch

flexai training run grpo-prefetched \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --checkpoint qwen25-7b-instruct \
  --env FORCE_TORCHRUN=1 \
  --env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=/input-checkpoint/qwen25-7b-instruct

Training with Custom Configuration

To use a modified configuration or different dataset, override config values:

flexai training run grpo-custom \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct \
      data.train_files=your-username/your-dataset@train \
      data.val_files=your-username/your-dataset@test \
      trainer.experiment_name=custom-experiment

Monitoring Training Progress

You can check the status and lifecycle events of your Training Job:

flexai training inspect grpo

View the logs of your Training Job:

flexai training logs grpo

Training Observability with Weights & Biases

EasyR1 supports Weights & Biases (wandb) integration for detailed training metrics visualization. The configuration already includes wandb logging:

trainer:
  logger: ["file", "wandb"]
  project_name: easy_r1
  experiment_name: qwen2_5_7b_math_grpo

The WANDB_API_KEY is already included in the training command as an environment variable. You can either:

Option 1: Use environment variable directly (as shown in the command)

--env WANDB_API_KEY=<YOUR_WANDB_API_KEY>

Option 2: Store as a FlexAI secret (more secure)

flexai secret create WANDB_API_KEY

Then use in your command:

--secret WANDB_API_KEY=<SECRET_NAME>

Getting Training Checkpoints

Once the Training Job completes successfully, you can list all produced checkpoints:

flexai training checkpoints grpo

Look for checkpoints marked as INFERENCE READY = true - these are ready for serving.

Serving the Trained Model

Deploy your RL-trained model directly from the checkpoint using FlexAI inference. Replace <CHECKPOINT_ID> with the ID from an inference-ready checkpoint:

flexai inference serve easyr1-reasoning-endpoint --checkpoint <CHECKPOINT_ID>

Monitor your inference endpoint status:

# List all inference endpoints
flexai inference list

# Get detailed endpoint information
flexai inference inspect easyr1-reasoning-endpoint

# Check endpoint logs
flexai inference logs easyr1-reasoning-endpoint

Testing Your RL-Trained Model

Once the endpoint is running, you can test it with reasoning tasks. For our mathematical reasoning example, the model should demonstrate improved step-by-step reasoning and accurate problem-solving.

Before and After Training Comparison

To illustrate the improvement from RL fine-tuning, here's a comparison using a math problem:

Problem: "If a train travels 120 miles in 2 hours, what is its average speed in miles per hour?"

Base Model Response (Qwen2.5-7B-Instruct before RL training):

The average speed is 60 mph.

Issues: Correct answer but no reasoning steps shown

RL Fine-tuned Model Response (after GRPO training on math12k):

Let me solve this step by step:

Step 1: Identify the given information
- Distance traveled = 120 miles
- Time taken = 2 hours

Step 2: Apply the speed formula
Speed = Distance / Time

Step 3: Calculate
Speed = 120 miles / 2 hours = 60 miles per hour

Therefore, the average speed of the train is 60 mph.

Improvements: Clear reasoning steps, structured approach, educational value

This demonstrates how RL training encourages the model to show its reasoning process, making it more reliable and transparent.

Example API Call

curl -X POST "https://your-endpoint-url/v1/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Solve the following problem step by step: A rectangle has a length of 15 cm and a width of 8 cm. What is its area?",
    "max_tokens": 500,
    "temperature": 0.7
  }'

Expected Results

After RL fine-tuning with EasyR1, your model should achieve:

Enhanced Reasoning: Step-by-step problem-solving with clear explanations
Improved Accuracy: Higher success rate on reasoning tasks
Better Generalization: Ability to apply learned reasoning patterns to new problems
Structured Outputs: More organized and educational responses

For mathematical reasoning tasks:

Explicit Step-by-Step Solutions: Clear breakdown of problem-solving process
Higher Success Rate: Improved accuracy on math benchmarks
Better Error Detection: Ability to identify and correct mistakes

Technical Details

Training Configuration Breakdown

Reinforcement Learning Components

Actor Model: The model being trained (policy network)
Reference Model: Frozen copy for KL divergence computation
Rollout Workers: Generate multiple responses for each prompt (n=5)
Reward Function: Evaluates response quality (custom per task)

Distributed Training

FSDP (Fully Sharded Data Parallel): Efficient memory usage for large models
vLLM Integration: Fast inference during rollout generation
Tensor Parallelism: For rollout workers (size=2)

Optimization

GRPO Algorithm: Group-based advantage estimation for stable training
KL Penalty: Prevents model from deviating too far from base model
Gradient Checkpointing: Reduces memory usage during backpropagation

Resource Requirements

Recommended Configuration for Qwen2.5-7B:

Nodes: 1 node (sufficient for RL training with actor + reference + rollout)
Accelerators: 8 × H100 GPUs per node
Memory: ~400GB+ GPU memory total (actor, reference, and rollout workers)
Training Time: ~8-12 hours for 15 epochs
Storage: ~50GB for checkpoints

Command Line Parameters Explained:

FORCE_TORCHRUN=1: Ensures proper distributed training setup
--runtime pytorch-28-vllm-0110-nvidia: PyTorch 2.8 with vLLM 0.11.0 optimized for EasyR1
--repository-url: Points to the FlexAI blueprints repository
config=code/easyR1/config.yaml: Main configuration file path relative to repository root

Key Configuration Parameters

Data Settings

rollout_batch_size: 512: Number of prompts per training iteration
max_prompt_length: 2048: Maximum input length
max_response_length: 2048: Maximum output length

Algorithm Settings

adv_estimator: grpo: Choice of RL algorithm
kl_coef: 1.0e-2: Strength of KL penalty
use_kl_loss: true: Enable KL divergence loss

Training Settings

total_epochs: 15: Number of training epochs
n_gpus_per_node: 8: GPUs per node
val_freq: 5: Validation every 5 epochs
save_freq: 5: Save checkpoint every 5 epochs

Scaling Options

For faster training: Increase to 2 nodes (16 × H100)
For larger models: Increase tensor_parallel_size for rollout
For better exploration: Increase rollout.n (more samples per prompt)
For memory efficiency: Enable CPU offloading (enable_cpu_offload: true)
For different tasks: Modify reward function and prompt templates

Advanced Examples

Vision-Language Model with Geometric Reasoning

For vision-language models, you'll need to use a VL model and the geometry dataset:

flexai training run grpo-VL-Geo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-VL-7B-Instruct \
      data.train_files=hiyouga/geometry3k@train \
      data.val_files=hiyouga/geometry3k@test \
      data.format_prompt=./code/easyR1/format_prompt/r1v.jinja \
      worker.reward.reward_function=./code/easyR1/reward_function/r1v.py:compute_score \
      trainer.experiment_name=qwen2_5_vl_7b_geo3k_grpo

Using DAPO Algorithm

For DAPO (Data-Augmented Policy Optimization), change the algorithm settings:

flexai training run Dapo-14B \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen3-14B \
      algorithm.adv_estimator=dapo \
      algorithm.online_filtering=true \
      data.train_files=hiyouga/dapo17k@train \
      data.val_files=hiyouga/dapo17k@test \
      data.format_prompt=./code/easyR1/format_prompt/dapo.jinja \
      worker.reward.reward_function=./code/easyR1/reward_function/dapo.py:compute_score \
      trainer.experiment_name=qwen3_14b_dapo17k_dapo

Troubleshooting Common Issues

Training Job Fails to Start

# Check FlexAI authentication
flexai auth status

# Verify repository access
git clone https://github.com/flexaihq/blueprints

Out of Memory Errors

Reduce rollout_batch_size from 512 to 256
Reduce rollout.n from 5 to 3 (fewer samples per prompt)
Enable CPU offloading: enable_cpu_offload: true in FSDP config
Reduce tensor_parallel_size for rollout workers

Reward Function Errors

Verify reward function path is correct in config
Test reward function locally before training
Ensure reward function returns float scores for all inputs
Check for NaN or infinite reward values

Checkpoint Not Inference Ready

Wait for training to complete fully
Check save_model_only: false in config to include all necessary files
Verify training completed without errors

Endpoint Deployment Issues

Verify checkpoint shows INFERENCE READY = true status
Check FlexAI cluster availability
Review detailed logs with flexai inference logs <endpoint-name>

Dataset Loading Issues

Verify dataset path format: username/dataset@split
Ensure HuggingFace token has access to datasets
Check prompt_key and answer_key match your dataset schema

vLLM Rollout Errors

Adjust gpu_memory_utilization (default 0.6)
Reduce tensor_parallel_size if GPUs are insufficient

Enable enforce_eager: true for debugging

Get Started Today

To celebrate this launch we’re offering €100 starter credits for first-time users!

Get Started Now

Reinforcement Learning Fine-tuning with EasyR1

Quick Start

What is EasyR1?

Directory Structure

Step 1: Understand the Configuration

Key Configuration Sections

Data Configuration

Algorithm Settings

Step 2: Reference Baseline Examples

Available Baselines (in EasyR1 repo)

Step 3: Customize Your Configuration

Custom Dataset

Custom Reward Function

Custom Prompt Format

Create Secrets

Optional: Pre-Fetch the Model

Training

Standard Training: Mathematical Reasoning with GRPO

Training with Model Pre-Fetch

Training with Custom Configuration

Monitoring Training Progress

Training Observability with Weights & Biases

Getting Training Checkpoints

Serving the Trained Model

Testing Your RL-Trained Model

Before and After Training Comparison

Example API Call

Expected Results

Technical Details

Training Configuration Breakdown

Reinforcement Learning Components

Distributed Training

Optimization

Resource Requirements

Key Configuration Parameters

Data Settings

Algorithm Settings

Training Settings

Scaling Options

Advanced Examples

Vision-Language Model with Geometric Reasoning

Using DAPO Algorithm

Troubleshooting Common Issues

Training Job Fails to Start

Out of Memory Errors

Reward Function Errors

Checkpoint Not Inference Ready

Endpoint Deployment Issues

Dataset Loading Issues

vLLM Rollout Errors

Get Started Today

Platform

Blueprints

Customers

Resources

Company

Book a Demo