Streaming Large Datasets During a Training Job

In some cases you might want to use large datasets that would be too large to download or push to FlexAI and you'd prefer to use that data transfer time more efficiently. Streaming such datasets can be a useful technique in those cases.

This experiment demonstrates how to stream a large dataset during a Training Job on FlexAI. We'll use the HuggingFace Datasets library's Streaming capabilities to achieve this.

Connect to GitHub (if needed)

If you haven't already connected FlexAI to GitHub, you'll need to set up a code registry connection:

flexai code-registry connect

This will allow FlexAI to pull repositories directly from GitHub using the -u flag in training commands.

Running the Training Job streaming a dataset

Here is an example using the code/causal-language-modeling/train.py script to stream the over 90 TB Fineweb dataset:

flexai training run gpt2training-stream --repository-url https://github.com/flexaihq/blueprints --requirements-path code/causal-language-modeling/requirements.txt \
   -- code/causal-language-modeling/train.py \
    --dataset_streaming true \
    --do_train \
    --eval_strategy no \
    --dataset_name HuggingFaceFW/fineweb \
    --dataset_config_name CC-MAIN-2024-10 \
    --dataset_group_text true \
    --dataloader_num_workers 8 \
    --max_steps 2500 \
    --model_name_or_path openai-community/gpt2 \
    --output_dir /output-checkpoint \
    --per_device_train_batch_size 8 \
    --logging_steps 50 \
    --save_steps 1000

The first line defines the 3 main components required to run a Training Job in FlexAI:

  1. The Training Job's name (gpt2training-stream).
  2. The URL of the repository containing the training script (https://github.com/flexaihq/blueprints).
  3. The name of the dataset to be used (empty-dataset or any other dataset you have available).

The second line defines the script that will be executed when the Training Job is started (code/causal-language-modeling/train.py).

Below that, the first argument passed to the script is --dataset_streaming true, which value tells the script to use the Datasets library with streaming capabilities enabled.

The next lines specify the arguments that will be passed to the training script during execution to adjust the Training Job's hyperparameters or customize its behavior. For instance, --max_train_samples and --max_eval_samples can be used to tweak the sample size.

The Code

You will notice that the train function in the code/causal-language-modeling/train.py script makes a call to the _load_model_and_tokenizer function to load the model and tokenizer using the user-provided arguments:

  1. This is the function that will be called by the `flexai training run` command
# 1. This is the function that will be called by the `flexai training run` command

def train(dataset_args, model_args, training_args, additional_args):
    set_wandb(training_args)
    print(f"Training/evaluation parameters {training_args}")
    
    # 2. Here the script calls the `load_and_tokenize` helper function
    # Get dataset
    tokenizer_model_name=model_args.model_name_or_path,
    do_eval=training_args.do_eval,
    
    # 3. These are the arguments passed to the script
    **vars(dataset_args),
)

This is all that is needed to stream a dataset during a Training Job on FlexAI! You are no longer restricted by the challenges that come with large dataset transfer processes and can now use them more efficiently.

FlexAI Logo

Get Started Today

To celebrate this launch we’re offering €100 starter credits for first-time users!

Get Started Now