Training a neural network means:
def train(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
model.train()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
loss.backward()
optimizer.step()
optimizer.zero_grad()
if batch % 100 == 0:
loss, current = loss.item(), (batch + 1) * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
output = model(X)
loss = criterion(output, y)
PyTorch build a computation graph
Each tensor remembers: How it was created? What operation produced it? What tensors it depends on
loss.backward() : traverse the computation graph backwards and store the gradients in param.grad . After that, PyTorch free the computation graph.
optimizer.step() : apply gradient rule to update model parameters
loss, current = loss.item(), (batch + 1) * len(X)
Here loss.item() detach the scalar loss from the loss tensor
<aside> ❗
The new gradient does not accumulate automatically from the old one But the new gradient is computed using updated parameters, which came from the old gradient. Also, for optimizers like Adam, they maintain internal state
</aside>