This blueprint demonstrates how to use FlexAI to fine-tune language models on domain-specific data using Axolotl, then deploy them as production-ready inference endpoints. For illustration purposes, we'll fine-tune for maximum command of French using the Qwen2.5-7B model and the openhermes-fr dataset.
You will see that this process requires configuring Axolotl's training parameters, leveraging FlexAI's managed training infrastructure, and deploying the fine-tuned model as a scalable inference endpoint.
Note: If you haven't already connected FlexAI to GitHub, run flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.
First, ensure your domain-specific dataset is properly configured in your Axolotl YAML file. For our French language example, we'll use the openhermes-fr dataset.
Navigate to code/axolotl/qwen2/fft-7b-french.yaml and verify the dataset configuration:
datasets:
- path: legmlai/openhermes-fr
type:
system_prompt: ""
field_instruction: prompt
field_output: accepted_completion
format: "{instruction}"
no_input_format: "{instruction}"For your own use case, replace this with your domain-specific dataset. The openhermes-fr dataset is specifically designed for French language tasks and serves as an excellent example of domain specialization.
The qwen2/fft-7b-french.yaml file contains the training configuration for domain-specific fine-tuning. Key settings include:
base_model: Qwen/Qwen2.5-7B
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: legmlai/openhermes-fr
type:
system_prompt: ""
field_instruction: prompt
field_output: accepted_completion
format: "{instruction}"
no_input_format: "{instruction}"
dataset_prepared_path:
val_set_size: 0.05
output_dir: /output-checkpoint
sequence_len: 2048
sample_packing: true
eval_sample_packing: true
gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 1.0e-5
bf16: auto
tf32: true
gradient_checkpointing: true
flash_attention: true
warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Qwen2DecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARDTo access the Qwen2.5-7B model and OpenHermes-FR dataset, you need a HuggingFace token.
Use the flexai secret create command to store your HuggingFace Token as a secret. Replace <HF_AUTH_TOKEN_SECRET_NAME> with your desired name for the secret:
flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>Then paste your HuggingFace Token API key value.
To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage. For example, to pre-fetch the Qwen/Qwen2.5-7B model:
flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>flexai checkpoint push qwen25-7b --storage-provider HF-STORAGE --source-path Qwen/Qwen2.5-7BDuring your training run, you can use the pre-fetched model by adding the following argument to your training command:
--checkpoint qwen25-7bFor a 7B model, we recommend using 1 node (4 × H100 GPUs) to ensure reasonable training time and avoid out-of-memory issues.
flexai training run axolotl-french-sft \
--accels 4 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--requirements-path code/axolotl/requirements.txt \
-- axolotl train code/axolotl/qwen2/fft-7b-french.yamlTo take advantage of model pre-fetching performed in the Optional: Pre-Fetch the Model section, use:
flexai training run axolotl-french-sft-prefetched \
--accels 4 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--checkpoint qwen25-7b \
--env FORCE_TORCHRUN=1 \
--requirements-path code/axolotl/requirements.txt \
-- axolotl train code/axolotl/qwen2/fft-7b-french.yamlYou can check the status and lifecycle events of your Training Job by running:
flexai training inspect axolotl-french-sftAdditionally, you can view the logs of your Training Job by running:
flexai training logs axolotl-french-sftFor advanced monitoring and visualization of training metrics, Axolotl supports Weights & Biases (wandb) integration. You can leverage wandb logging for detailed insights into training progress, loss curves, and model performance.
To enable wandb logging, update your YAML configuration:
wandb_project: your-project-name
wandb_entity: your-wandb-entity
wandb_watch: gradients
wandb_name: qwen25-7b-sft
wandb_log_model: checkpointFor additional monitoring, you can also access FlexAI's hosted TensorBoard instance, which provides organization-wide real-time insights into your training workload progress. TensorBoard is enabled by default for every organization - simply log in using your FlexAI account credentials to track metrics and visualize performance.
For more details on observability features, see the Axolotl documentation.
Once the Training Job completes successfully, you will be able to list all the produced checkpoints:
flexai training checkpoints axolotl-french-sftLook for checkpoints marked as INFERENCE READY = true. These are ready for serving.
Deploy your trained model directly from the checkpoint using FlexAI inference. Replace <CHECKPOINT_ID> with the ID from an inference-ready checkpoint:
flexai inference serve axolotl-french-sft-endpoint --checkpoint <CHECKPOINT_ID>Note: GPU specification for inference endpoints is currently managed automatically by FlexAI. Future versions will allow explicit GPU count specification for inference workloads to optimize cost and performance based on your specific requirements.
You can monitor your inference endpoint status:
# List all inference endpoints
flexai inference list
# Get detailed endpoint information
flexai inference inspect axolotl-french-sft-endpoint
# Check endpoint logs
flexai inference logs axolotl-french-sft-endpointOnce the endpoint is running, you can test it with domain-specific prompts. For our French language example, the model should demonstrate strong French language understanding, proper grammar and syntax, and cultural context awareness.
To illustrate the improvement from fine-tuning on French data, here's a comparison using the question: "Qui a gagné la Coupe du monde 2018 ?" (who won the 2018 world cup?)
Base Model Response (Qwen/Qwen2.5-7B before training):
La Coupe du monde de football 2018 a été remportée par la Russie.Issues: Incorrect answer (says Russia instead of France)
Fine-tuned Model Response (after full fine-tuning on openhermes-fr):
La France a remporté la Coupe du monde de football 2018, en battant le Croatie lors de la finale disputée à Moscou le 15 juillet 2018.Improvements: Correct answer (France), excellent French grammar, accurate details, proper structure
This example demonstrates the dramatic improvement in both factual accuracy and French language quality after domain-specific fine-tuning.
curl -X POST "https://your-endpoint-url/v1/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"prompt": "Expliquez-moi les avantages de l'\''intelligence artificielle en français:",
"max_tokens": 200,
"temperature": 0.7
}'Adapt the prompt and evaluation criteria to match your specific domain and use case.
After fine-tuning on domain-specific data, your model should achieve:
For our French language example:
Recommended Configuration for Qwen2.5-7B
Command Line Parameters Explained
Axolotl provides extensive configuration examples for various models and training strategies:
flexai training run axolotl-lora-llama3-8B \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/experiments \
--env FORCE_TORCHRUN=1 \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/axolotl/requirements.txt \
--runtime nvidia-25.06 \
-- axolotl train code/axolotl/llama-3/lora-8b.ymlMistral 7B with QLoRA
flexai training run axolotl-qlora-mistral-7B \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/experiments \
--env FORCE_TORCHRUN=1 \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/axolotl/requirements.txt \
--runtime nvidia-25.06 \
-- axolotl train code/axolotl/mistral/qlora.ymlExplore the code/axolotl/ directory for more examples including Gemma, Phi, Qwen2, multimodal models, and advanced configurations.
# Check FlexAI authentication
flexai auth status
# Verify repository access
git clone https://github.com/flexaihq/experiments
To celebrate this launch we’re offering €100 starter credits for first-time users!
Get Started Now