Huggingface LLM Course - Chapter2 & Chapter3

2025-04-15

Behind the pipeline:

The vector output by the Transformer module is usually large. It generally has three dimensions:

  • Batch size: The number of sequences processed at a time.
  • Sequence length: The length of the numerical representation of the sequence.
  • Hidden size: The vector dimension of each model input.

The hidden size can be very large. 768 is common for smaller models, and in larger models this can reach 3072 or more.

Model heads

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers.

Model

from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

This is to load the BERT's architecture with random values of weights. If I use it, it will output gibberish. It needs to be trained first but it takes a lot of time and resources. For preventing this, we can use a pre-trained model:

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)
>>> ['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

From tokens to input IDs

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)
>>>[7993, 170, 11303, 1200, 2443, 1110, 3014]

Decoding

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
>>> 'Using a Transformer network is simple'

Send to Model

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

>>> IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Why did this fail?

Transformers models expect multiple sentences by default.

So, this is the right way to send it to the model:

input_ids = torch.tensor([ids])
model(input_ids)

I will handle Datasets library later. Therefore, this post doesn't cover it detailedly.

Fine-tuning a model with the Trainer API:

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

Then Trainer will start the fine-tuning and report the training loss every 500 steps.

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

>>> (408, 2) (408,)
import numpy as np
import evaluate

preds = np.argmax(predictions.predictions, axis=-1)

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

>>> {'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}

But same reason as pipeline, I won't use this high-level APIs.

A full training

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels") # model expects the argument to be named 'labels'
tokenized_datasets.set_format("torch") # therefore, they return PyTorch tensors instead of lists
tokenized_datasets["train"].column_names

>>> ["attention_mask", "input_ids", "labels", "token_type_ids"]

define a dataloaders:

from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To quickly check there is no mistake in the data processing:

for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

>>>{'attention_mask': torch.Size([8, 65]),
    'input_ids': torch.Size([8, 65]),
    'labels': torch.Size([8]),
    'token_type_ids': torch.Size([8, 65])}

After loading a pre-trained model:

outputs = model(**batch)
print(output.loss, outputs.logits.shape)

>>> tensor(0.5441, grad_fn=<NllLossBackward>) torch.Size([8, 2])

Optimizer:

from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

Learning rate scheduler:

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

Training roop:

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

The training flow is as follows:

  1. Calculate the loss
  2. Calculate the gradients during backpropagation
  3. Update the model parameters using the optimizer
  4. Tune the learning rate
  5. Initialize gradients to zero

With the Accelerate library, which enables more efficient and faster training performance, we need to make some minor revisions to the code.

accelerate = Accelerator()

...
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)
...

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

accelerate.prepare() wraps these objects in the appropriate container to ensure your distributed training functions properly as intended.


Reference

https://huggingface.co/learn/llm-course/chapter2/1?fw=pt https://huggingface.co/learn/llm-course/chapter3/1?fw=pt