This experiment demonstrates how to use FlexAI to fine-tune language models using reinforcement learning (RL) techniques with EasyR1, a framework for training reasoning-capable models using GRPO (Group Relative Policy Optimization), DAPO, and REINFORCE algorithms.
For illustration purposes, we'll fine-tune the Qwen2.5-7B-Instruct model on mathematical reasoning tasks using the math12k dataset with GRPO algorithm to improve reasoning capabilities.
Note: If you haven't already connected FlexAI to GitHub, run flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.
Run GRPO training on Qwen2.5-7B with this single command:
flexai training run grpo \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen2.5-7B-InstructReplace <YOUR_WANDB_API_KEY> and <HF_AUTH_TOKEN_SECRET_NAME> with your actual values.
EasyR1 is a reinforcement learning framework specifically designed for training language models with enhanced reasoning capabilities. It implements several RL algorithms optimized for LLMs:
The framework is built on top of VERL (Versatile Efficient Reinforcement Learning), providing distributed training capabilities with FSDP and vLLM integration.
The code/easyR1/ directory contains:
For baseline training scripts and additional examples, refer to the EasyR1 GitHub repository.
EasyR1 uses a comprehensive YAML configuration file that controls all aspects of RL training. The main configuration file is located at code/easyR1/config.yaml in this repository.
data:
train_files: hiyouga/math12k@train
val_files: hiyouga/math12k@test
prompt_key: problem
answer_key: answer
format_prompt: ./code/easyR1/format_prompt/math.jinja
max_prompt_length: 2048
max_response_length: 2048
rollout_batch_size: 512
algorithm:
adv_estimator: grpo # GRPO, DAPO, or REINFORCE
use_kl_loss: true
kl_coef: 1.0e-2Worker Configuration
worker:
actor:
model:
model_path: Qwen/Qwen2.5-7B-Instruct
optim:
lr: 1.0e-6
rollout:
n: 5 # number of rollout samples per prompt
temperature: 1.0
reward:
reward_type: batch
reward_function: ./code/easyR1/reward_function/math.py:compute_scoreFor pre-configured training scripts and baseline examples, refer to the EasyR1 repository. The repository provides multiple baseline configurations for different models and tasks:
You can adapt these examples to work with FlexAI by following the training commands in this README.
For your specific use case, you may want to create a custom configuration. Here's how to customize the config.yaml:
Replace the dataset configuration:
data:
train_files: your-username/your-dataset@train
val_files: your-username/your-dataset@test
prompt_key: question # adjust based on your dataset
answer_key: solution # adjust based on your datasetCreate your own reward function in code/easyR1/reward_function/custom.py:
def compute_score(prompts, responses, answers):
"""
Args:
prompts: List of input prompts
responses: List of model responses
answers: List of ground truth answers
Returns:
List of reward scores (float)
"""
scores = []
for response, answer in zip(responses, answers):
# Your custom reward logic here
score = your_evaluation_function(response, answer)
scores.append(score)
return scoresThen update the config to reference your custom reward function:
worker:
reward:
reward_function: ./code/easyR1/reward_function/custom.py:compute_scoreCreate a custom Jinja template in code/easyR1/format_prompt/custom.jinja:
{{ problem }}
Please solve this step by step and provide your final answer.Update the config:
data:
format_prompt: ./code/easyR1/format_prompt/custom.jinjaTo access HuggingFace models and datasets, you need a HuggingFace token.
Use the flexai secret create command to store your HuggingFace Token as a secret:
flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>Then paste your HuggingFace Token API key value.
To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage:
flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>flexai checkpoint push qwen25-7b-instruct --storage-provider HF-STORAGE --source-path Qwen/Qwen2.5-7B-InstructFor RL training with EasyR1, we recommend using 1 node (8 × H100 GPUs) for 7B models to handle the actor, reference model, and rollout workers efficiently.
Repository Note: The commands below use this repository which contains all necessary configuration files in the code/easyR1/ directory.
flexai training run grpo \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen2.5-7B-InstructNote: Replace <YOUR_WANDB_API_KEY> with your actual Weights & Biases API key, or use --secret WANDB_API_KEY=<SECRET_NAME> if you've stored it as a FlexAI secret.
flexai training run grpo-prefetched \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--checkpoint qwen25-7b-instruct \
--env FORCE_TORCHRUN=1 \
--env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=/input-checkpoint/qwen25-7b-instructTo use a modified configuration or different dataset, override config values:
flexai training run grpo-custom \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct \
data.train_files=your-username/your-dataset@train \
data.val_files=your-username/your-dataset@test \
trainer.experiment_name=custom-experimentYou can check the status and lifecycle events of your Training Job:
flexai training inspect grpoView the logs of your Training Job:
flexai training logs grpoEasyR1 supports Weights & Biases (wandb) integration for detailed training metrics visualization. The configuration already includes wandb logging:
trainer:
logger: ["file", "wandb"]
project_name: easy_r1
experiment_name: qwen2_5_7b_math_grpoThe WANDB_API_KEY is already included in the training command as an environment variable. You can either:
Option 1: Use environment variable directly (as shown in the command)
--env WANDB_API_KEY=<YOUR_WANDB_API_KEY>Option 2: Store as a FlexAI secret (more secure)
flexai secret create WANDB_API_KEYThen use in your command:
--secret WANDB_API_KEY=<SECRET_NAME>Once the Training Job completes successfully, you can list all produced checkpoints:
flexai training checkpoints grpoLook for checkpoints marked as INFERENCE READY = true - these are ready for serving.
Deploy your RL-trained model directly from the checkpoint using FlexAI inference. Replace <CHECKPOINT_ID> with the ID from an inference-ready checkpoint:
flexai inference serve easyr1-reasoning-endpoint --checkpoint <CHECKPOINT_ID>Monitor your inference endpoint status:
# List all inference endpoints
flexai inference list
# Get detailed endpoint information
flexai inference inspect easyr1-reasoning-endpoint
# Check endpoint logs
flexai inference logs easyr1-reasoning-endpointOnce the endpoint is running, you can test it with reasoning tasks. For our mathematical reasoning example, the model should demonstrate improved step-by-step reasoning and accurate problem-solving.
To illustrate the improvement from RL fine-tuning, here's a comparison using a math problem:
Problem: "If a train travels 120 miles in 2 hours, what is its average speed in miles per hour?"
Base Model Response (Qwen2.5-7B-Instruct before RL training):
The average speed is 60 mph.Issues: Correct answer but no reasoning steps shown
RL Fine-tuned Model Response (after GRPO training on math12k):
Let me solve this step by step:
Step 1: Identify the given information
- Distance traveled = 120 miles
- Time taken = 2 hours
Step 2: Apply the speed formula
Speed = Distance / Time
Step 3: Calculate
Speed = 120 miles / 2 hours = 60 miles per hour
Therefore, the average speed of the train is 60 mph.Improvements: Clear reasoning steps, structured approach, educational value
This demonstrates how RL training encourages the model to show its reasoning process, making it more reliable and transparent.
curl -X POST "https://your-endpoint-url/v1/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"prompt": "Solve the following problem step by step: A rectangle has a length of 15 cm and a width of 8 cm. What is its area?",
"max_tokens": 500,
"temperature": 0.7
}'After RL fine-tuning with EasyR1, your model should achieve:
For mathematical reasoning tasks:
Recommended Configuration for Qwen2.5-7B:
Command Line Parameters Explained:
For vision-language models, you'll need to use a VL model and the geometry dataset:
flexai training run grpo-VL-Geo \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen2.5-VL-7B-Instruct \
data.train_files=hiyouga/geometry3k@train \
data.val_files=hiyouga/geometry3k@test \
data.format_prompt=./code/easyR1/format_prompt/r1v.jinja \
worker.reward.reward_function=./code/easyR1/reward_function/r1v.py:compute_score \
trainer.experiment_name=qwen2_5_vl_7b_geo3k_grpoFor DAPO (Data-Augmented Policy Optimization), change the algorithm settings:
flexai training run Dapo-14B \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--env WANDB_API_KEY=<YOUR_WANDB_API_KEY> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen3-14B \
algorithm.adv_estimator=dapo \
algorithm.online_filtering=true \
data.train_files=hiyouga/dapo17k@train \
data.val_files=hiyouga/dapo17k@test \
data.format_prompt=./code/easyR1/format_prompt/dapo.jinja \
worker.reward.reward_function=./code/easyR1/reward_function/dapo.py:compute_score \
trainer.experiment_name=qwen3_14b_dapo17k_dapo# Check FlexAI authentication
flexai auth status
# Verify repository access
git clone https://github.com/flexaihq/blueprintsEnable enforce_eager: true for debugging

To celebrate this launch we’re offering €100 starter credits for first-time users!
Get Started Now