Resuming a Training Job from Checkpoint

This blueprint will continue training from a Checkpoint emitted by the Training Job in the A Simple Training Job on FlexAI blueprint, so make sure to complete it and download its output artifacts before proceeding.

SEE CODE ON GITHUB

Extract the contents of the output_0.zip file into a directory named fetched_checkpoints:

unzip output_0.zip -d fetched_checkpoints

This fetched_checkpoints directory contains the different checkpoints that have been saved in the /output-checkpoint of the Training Job's runtime environment during execution.

Let's use the checkpoint (saved at step 500) located in fetched_checkpoints/output/checkpoint-500/.

Create the FlexAI checkpoint to be passed to the next run that will resume the training:

flexai checkpoint push gpt2-ckpt500 --file fetched_checkpoints/output/checkpoint-500

Resume training from your checkpoint with the following command:

flexai training run gpt2training-resume --repository-url https://github.com/flexaihq/experiments --dataset gpt2-tokenized-wikitext --checkpoint gpt2-ckpt500 --requirements-path code/causal-language-modeling/requirements.txt \
  -- code/causal-language-modeling/train.py \
    --do_eval \
    --do_train \
    --dataset_name wikitext \
    --tokenized_dataset_load_dir /input/gpt2-tokenized-wikitext \
    --model_name_or_path /input-checkpoint \
    --resume_from_checkpoint /input-checkpoint \
    --output_dir /output-checkpoint \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --logging_steps 50 \
    --save_steps 500 \
    --eval_steps 500 \
    --eval_strategy steps \
    --num_train_epochs 6

Compared to the blueprint that starts training from the base model, note that:

  • --checkpoint gpt2-ckpt500 has been added - referring to the checkpoint created above, the content of the checkpoint-500 folder will be mounted on /input-checkpoint
  • --model_name_or_path has been updated, pointing to the new checkpoint location

together with additional HuggingFace args to resume the training from the checkpoint:

  • --resume_from_checkpoint /input-checkpoint
  • --num_train_epochs 6
FlexAI Logo

Get Started Today

To celebrate this launch we’re offering €100 starter credits for first-time users!

Get Started Now