This blueprint provides a step-by-step guide for evaluating language models on FlexAI using the LM-Evaluation-Harness framework.
LM-Evaluation-Harness is a unified, extensible toolkit for few-shot evaluation of language models across hundreds of standardized NLP benchmarks.
In this guide, you'll learn how to:
Note: If you haven't already connected FlexAI to GitHub, run flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.
LM-Evaluation-Harness is a comprehensive evaluation framework that provides:
Popular evaluation tasks include:
Run a basic evaluation on HellaSwag with this single command:
flexai training run lm-eval-basic \
--accels 2 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cuda \
--batch_size 8The LM-Evaluation-Harness uses command-line arguments to configure evaluations. Here are the key parameters:
--model hf # Use HuggingFace backend
--model_args pretrained=MODEL_NAME # Specify model to evaluate
--model_args pretrained=MODEL_NAME,dtype=bfloat16 # With precision controlTask Selection
--tasks hellaswag # Single task
--tasks hellaswag,arc_easy,arc_challenge # Multiple tasks
--tasks mmlu_* # All MMLU subtasks
--tasks all # All available tasks (not recommended)Evaluation Parameters
--batch_size 8 # Batch size for evaluation
--max_batch_size 32 # Maximum batch size
--device cuda # GPU device
--num_fewshot 5 # Number of few-shot examples
--limit 1000 # Limit number of samples per taskOutput Configuration
--output_path /output-checkpoint/results.json # Save results JSON
--log_samples # Log individual sample results
--show_config # Display configurationTo access models from HuggingFace (especially gated models), you need a HuggingFace token.
Use the flexai secret create command to store your HuggingFace Token as a secret:
flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>Then paste your HuggingFace Token API key value.
To speed up evaluation and avoid downloading large models at runtime, you can pre-fetch your models to FlexAI storage:
flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>flexai checkpoint push llama2-7b --storage-provider HF-STORAGE --source-path meta-llama/Llama-2-7b-hfflexai training run lm-eval-prefetched \
--accels 8 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--checkpoint llama2-7b \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=/input-checkpoint/llama2-7b \
--tasks mmlu,hellaswag \
--device cuda \
--batch_size 4 \
--output_path /output-checkpoint/prefetched_eval.jsonFor a thorough evaluation across multiple benchmarks:
flexai training run lm-eval-comprehensive \
--accels 4 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=microsoft/DialoGPT-medium \
--tasks hellaswag,arc_easy,arc_challenge,mmlu,gsm8k \
--device cuda \
--batch_size 16 \
--output_path /output-checkpoint/comprehensive_eval.json \
--log_samplesFor evaluating large models (7B+ parameters):
flexai training run lm-eval-large-model \
--accels 8 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2 \
--device cuda \
--batch_size 4 \
--output_path /output-checkpoint/llama2_7b_eval.jsonFor evaluating code generation capabilities:
flexai training run lm-eval-code \
--accels 2 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=Salesforce/codegen-350M-mono \
--tasks humaneval \
--device cuda \
--batch_size 8 \
--output_path /output-checkpoint/code_eval.jsonFor testing few-shot learning capabilities:
flexai training run lm-eval-fewshot \
--accels 2 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=EleutherAI/gpt-neo-1.3B \
--tasks winogrande,piqa,openbookqa \
--num_fewshot 5 \
--device cuda \
--batch_size 16 \
--output_path /output-checkpoint/fewshot_eval.jsonYou can check the status and progress of your evaluation job:
# Check job status
flexai training inspect lm-eval-comprehensive
# View evaluation logs
flexai training logs lm-eval-comprehensiveOnce the evaluation job completes, you can access the results:
# List all checkpoints/outputs
flexai training checkpoints lm-eval-comprehensive
# Download results JSON
flexai checkpoint fetch <CHECKPOINT_ID> --destination ./results/The results JSON will be saved with detailed metrics for each task.
{
"results": {
"hellaswag": {
"acc": 0.6234,
"acc_stderr": 0.0048,
"acc_norm": 0.8012,
"acc_norm_stderr": 0.0040
},
"mmlu": {
"acc": 0.4567,
"acc_stderr": 0.0031
}
},
"config": {
"model": "hf",
"model_args": "pretrained=EleutherAI/gpt-j-6B",
"batch_size": 8,
"device": "cuda"
},
"git_hash": "abc123",
"date": 1698123456
}For very large models or extensive benchmark suites:
flexai training run lm-eval-multi-node \
--accels 8 \
--nodes 2 \
--repository-url https://github.com/flexaihq/blueprints \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=meta-llama/Llama-2-13b-hf,dtype=bfloat16,device_map=auto \
--tasks mmlu_*,hellaswag,arc_challenge,truthfulqa_mc2 \
--device cuda \
--batch_size 2 \
--output_path /output-checkpoint/multi_node_eval.jsonFor evaluating on custom tasks or datasets:
flexai training run lm-eval-custom \
--accels 4 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=microsoft/DialoGPT-medium \
--tasks_list /path/to/custom_tasks.yaml \
--device cuda \
--batch_size 8 \
--output_path /output-checkpoint/custom_eval.jsonFor memory-efficient evaluation:
flexai training run lm-eval-fp16 \
--accels 4 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=EleutherAI/gpt-neox-20b,dtype=float16 \
--tasks hellaswag,mmlu \
--device cuda \
--batch_size 1 \
--output_path /output-checkpoint/fp16_eval.jsonHellaSwag (Commonsense Reasoning):
MMLU (Multi-task Language Understanding):
GSM8K (Grade School Math):
HumanEval (Code Generation):
Generally, larger models perform better, but with diminishing returns:
Small Models (< 1B parameters):
Medium Models (1B-7B parameters):
Large Models (7B-70B parameters):
Very Large Models (> 70B parameters):
--tasks hellaswag,arc_easy,arc_challenge,winogrande,piqa--tasks mmlu,truthfulqa_mc2,gsm8k,humaneval--tasks arc_challenge,hellaswag,winogrande,piqa,openbookqa--tasks mmlu_*,truthfulqa_mc2,lambada_openai--tasks humaneval,mbpp--tasks gsm8k,mathqa,aqua_rat# Check job status
flexai training inspect <job-name>
# View logs
flexai training logs <job-name>Results Access:
# List outputs
flexai training checkpoints <job-name>
# Download results JSON
flexai checkpoint fetch <checkpoint-id> --destination ./results/
To celebrate this launch we’re offering €100 starter credits for first-time users!
Get Started Now