https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/af0caf6d7af0dda755f4c9d7af9ccc2c/quickstart_tutorial.ipynb

Training a neural network means:

  1. Compute predictions
  2. Compute loss
  3. Compute gradients
  4. Update weights
  5. Repeat
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

Forward pass

output = model(X)
loss = criterion(output, y)

PyTorch build a computation graph

Each tensor remembers: How it was created? What operation produced it? What tensors it depends on

Back pass

loss.backward() : traverse the computation graph backwards and store the gradients in param.grad . After that, PyTorch free the computation graph.

optimizer.step() : apply gradient rule to update model parameters

Detach loss

loss, current = loss.item(), (batch + 1) * len(X)

Here loss.item() detach the scalar loss from the loss tensor

<aside> ❗

The new gradient does not accumulate automatically from the old one But the new gradient is computed using updated parameters, which came from the old gradient. Also, for optimizers like Adam, they maintain internal state

</aside>