Skip to main content

Fine tuning LLMs: A minimal example

Fine-tuning large language models (LLMs) is a specialized process designed to adapt a pre-trained model to perform better on specific tasks or understand particular domains more effectively. Unlike training a model from scratch, fine-tuning leverages understanding already embedded in LLMs.

In this tutorial, we will demonstrate how to fine tune an LLM from the Huggingface transformers library. We will use pure PyTorch, thus fine tuning all the model weights. For more advanced and efficient approaches, see approaches such as LoRA and PEFT.

Getting started

The local Python environment will use only four additional packages (in addition to Covalent):

tranformers==4.40.0
torch==2.2.2
datasets==2.19.0
accelerate==0.29.3

After installing the above, import the following.

import os
import covalent as ct
import covalent_cloud as cc
from transformers import (
AutoTokenizer, AutoModelForCausalLM,
DataCollatorForLanguageModeling, Trainer, TrainingArguments
)
from datasets import load_dataset, load_from_disk
import torch
from pathlib import Path

Using the COVALENT_API_KEY environment variable, we create a Covalent Cloud environment for basic fine tuning.

cc.save_api_key(os.environ["COVALENT_API_KEY"])
cc.create_env(
name="finetune-basic",
conda={
"dependencies": ["python=3.10"]
}, pip=[
"transformers==4.40.0", "torch==2.2.2",
"datasets==2.19.0", "accelerate==0.29.3"
],
wait=True
)

Data and Preprocessing

We start by defining our lighter, CPU-only tasks together with a cpu_executor that specifies the relevant resources.

cpu_executor = cc.CloudExecutor(
env="finetune-basic", num_cpus=2, memory="8GB", time_limit="01:00:00"
)

@ct.electron(executor=cpu_executor)
def load_tokenizer(model_id="distilbert/distilgpt2"):
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token": "<PAD>"})
# Save tokenizer to model folder
tokenizer_folder = Path("/volumes") / "mydata" / "tokenizer"
tokenizer_folder.mkdir(exist_ok=True, parents=True)

tokenizer.save_pretrained(tokenizer_folder)
return tokenizer_folder

Next, let’s use the datasets library to download the ELI5 category dataset of Reddit questions. Preprocessing here means tokenizing all the texts in the dataset and splitting it into batches.

@ct.electron(executor=cpu_executor)
def load_and_preprocess_data(tokenizer_path):
eli5 = load_dataset("eli5_category", split="train[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)
eli5 = eli5.flatten()

tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

def preprocess_and_group(examples):
# Tokenize the text
tokenized_inputs = tokenizer(
[" ".join(x) for x in examples["answers.text"]], truncation=True,
)
# Group text into blocks
block_size = 128
concatenated_examples = {
k: sum(tokenized_inputs[k], []) for k in tokenized_inputs.keys()
}
total_length = (len(concatenated_examples["input_ids"]) // block_size)
total_length *= block_size
return {
k: [t[i:i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}

lm_dataset = eli5.map(
preprocess_and_group,
batched=True,
num_proc=4,
remove_columns=eli5.column_names,
)
save_location = Path("/volumes") / "mydata" / "datasets"
save_location.mkdir(exist_ok=True, parents=True)
# save dataset to disk
lm_dataset.save_to_disk(save_location)
return save_location

Fine Tuning a Language Model

Huggingface provides a neat interface for fine-tuning via Trainer and TrainingArguments. We define a standard fine-tuning task below. This task concludes with saving the model on a persistent volume in Covalent Cloud.

gpu_executor = cc.CloudExecutor(
env="finetune-basic", num_cpus=2, num_gpus=1, gpu_type="h100",
memory="8GB", time_limit="01:00:00"
)

@ct.electron(executor=gpu_executor)
def train_model(model_id, tokenizer_path, dataset_path):
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
lm_dataset = load_from_disk(dataset_path)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
model = AutoModelForCausalLM.from_pretrained(model_id)

training_args = TrainingArguments(
output_dir="/tmp/my_awesome_eli5_clm-model",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset["train"],
eval_dataset=lm_dataset["test"],
data_collator=data_collator,
)
trainer.train()

# Save the model on a Covalent Cloud volume.
save_path = Path("/volumes") / "mydata" / "model"
save_path.mkdir(exist_ok=True, parents=True)
model.save_pretrained(save_path)
return save_path

Dispatch

Here’s the workflow that connects the three fine-tuning steps (each electron) above.

@ct.lattice(workflow_executor=cpu_executor, executor=cpu_executor)
def workflow(model_id):
tokenizer_path = load_tokenizer(model_id)
dataset_path = load_and_preprocess_data(tokenizer_path)
return train_model(model_id, tokenizer_path, dataset_path)

Run the code below to dispatch it with a volume to persist the fine-tuned model.

  volume = cc.volume("mydata")
dispatch_id = cc.dispatch(workflow, volume=volume)("distilbert/distilgpt2")
result = cc.get_result(dispatch_id, wait=True)

result.result.load()
model_path = result.result.value

In this example, we use a distilled version of the GPT2 model with ~82 million parameters.

Tip

Feel free to specify a different model to fine-tune! Just make sure to scale resources in the gpu_executor, selecting an appropriate gpu_type with sufficient vRAM.

After the workflow completes, the fine-tuned model will be available at the model_path.

Fine tuning a Large Language Model, workflow graph in Covalent Cloud

Fine tuning a Large Language Model, workflow graph in Covalent Cloud

Conclusion

In this guide, we combined Covalent and Huggingface to fine tune a language model on Reddit questions. We demonstrated the simplicity of transitioning between CPU and GPU devices and permanently saving results with volumes. The cost of running this workflow is approximately $0.23.

The full code can be found below:

Full Code
import os
import covalent as ct
import covalent_cloud as cc
from transformers import (
AutoTokenizer, AutoModelForCausalLM,
DataCollatorForLanguageModeling, Trainer, TrainingArguments
)
from datasets import load_dataset, load_from_disk
import torch
from pathlib import Path

cc.save_api_key(os.environ["COVALENT_API_KEY"])
cc.create_env(
name="finetune-basic",
conda={
"dependencies": ["python=3.10"]
}, pip=[
"transformers==4.40.0", "torch==2.2.2",
"datasets==2.19.0", "accelerate==0.29.3"
],
wait=True
)
VOLUME_NAME = "finetune"
cpu_executor = cc.CloudExecutor(
env="finetune-basic", num_cpus=2,
memory="8GB", time_limit="01:00:00"
)
gpu_executor = cc.CloudExecutor(
env="finetune-basic", num_cpus=2, num_gpus=1,
gpu_type="h100", memory="8GB", time_limit="01:00:00"
)

@ct.electron(executor=cpu_executor)
def load_tokenizer(model_id="distilbert/distilgpt2"):
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token": "<PAD>"})
# Save tokenizer to model folder
tokenizer_folder = Path("/volumes") / VOLUME_NAME / "tokenizer"
tokenizer_folder.mkdir(exist_ok=True, parents=True)

tokenizer.save_pretrained(tokenizer_folder)
return tokenizer_folder

@ct.electron(executor=cpu_executor)
def load_and_preprocess_data(tokenizer_path):
eli5 = load_dataset("eli5_category", split="train[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)
eli5 = eli5.flatten()

tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

def preprocess_and_group(examples):
# Tokenize the text
tokenized_inputs = tokenizer(
[" ".join(x) for x in examples["answers.text"]], truncation=True,
)
# Group text into blocks
block_size = 128
concatenated_examples = {
k: sum(tokenized_inputs[k], []) for k in tokenized_inputs.keys()
}
total_length = (len(concatenated_examples["input_ids"]) // block_size) * block_size
return {
k: [
t[i:i + block_size]
for i in range(0, total_length, block_size)
]
for k, t in concatenated_examples.items()
}

lm_dataset = eli5.map(
preprocess_and_group,
batched=True,
num_proc=4,
remove_columns=eli5['train'].column_names,
)
save_location = Path("/volumes") / VOLUME_NAME / "datasets"
save_location.mkdir(exist_ok=True, parents=True)
# save dataset to disk
lm_dataset.save_to_disk(str(save_location))
return save_location

@ct.electron(executor=gpu_executor)
def train_model(model_id, tokenizer_path, dataset_path):
tokenizer = AutoTokenizer.from_pretrained(str(tokenizer_path))
lm_dataset = load_from_disk(str(dataset_path))

data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
model = AutoModelForCausalLM.from_pretrained(model_id)

training_args = TrainingArguments(
output_dir="/tmp/my_awesome_eli5_clm-model",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset["train"],
eval_dataset=lm_dataset["test"],
data_collator=data_collator,
)
trainer.train()

# saved model path
save_path = Path("/volumes") / VOLUME_NAME / "model"
save_path.mkdir(exist_ok=True, parents=True)
model.save_pretrained(save_path)
return save_path

@ct.lattice(workflow_executor=cpu_executor, executor=cpu_executor)
def workflow(model_id):
tokenizer_path = load_tokenizer(model_id)
dataset_path = load_and_preprocess_data(tokenizer_path)
model_path = train_model(model_id, tokenizer_path, dataset_path)
return model_path

if __name__ == "__main__":
volume = cc.volume(VOLUME_NAME)
dispatch_id = cc.dispatch(workflow, volume=volume)("distilbert/distilgpt2")
result = cc.get_result(dispatch_id, wait=True)

result.result.load()
model_path = result.result.value
print(f"Model saved at: {model_path}")