PyTorch Learning Rate Schedulers: A Comprehensive Guide
Master the art of dynamic learning rate adjustment for optimal deep learning
model training
Published: February 10, 202615 min read
Introduction
Learning rate is arguably one of the most critical hyperparameters in deep learning. It determines how
quickly or slowly a neural network learns from data. While using a fixed learning rate throughout
training can work, dynamically adjusting it often leads to better convergence, faster training, and
improved model performance.
PyTorch provides a rich collection of learning rate schedulers in the torch.optim.lr_scheduler module. These schedulers
automatically adjust the learning rate during training based on various strategies, from simple
step-wise reductions to sophisticated adaptive methods.
Why Learning Rate Scheduling Matters: Proper learning rate scheduling can reduce
training time by 30-50% while achieving better final model accuracy compared to fixed learning
rates.
Why Use Learning Rate Schedulers?
Faster Convergence
Start with higher learning rates for rapid initial progress, then reduce
to fine-tune the model as it approaches optimal weights.
Better Final Performance
Lower learning rates in later stages help the model settle into better
local minima, improving final accuracy.
Avoid Overshooting
Prevents the optimizer from bouncing around the minimum due to learning
rates that are too high.
Adaptive Training
Automatically adjust to training dynamics without manual intervention
during long training runs.
Common Learning Rate Patterns
Different schedulers create distinct learning rate patterns over time. Here's a visualization of how
popular schedulers behave:
Types of PyTorch Learning Rate Schedulers
1. StepLR - Step Decay
Reduces learning rate by a multiplicative factor every few epochs.
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
# Setup optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# Create StepLR scheduler
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
# Step the scheduler
scheduler.step()
print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.6f}")
Best for: Simple experiments, baseline models, when you know roughly how long
training should take.
Parameters: step_size (epochs between reductions), gamma (multiplication
factor)
2. MultiStepLR - Multiple Milestones
Reduces learning rate at specific epoch milestones.
from torch.optim.lr_scheduler import MultiStepLR
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# Reduce LR at epochs 30, 60, and 90
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Best for: When you have domain knowledge about when to reduce LR, training
ImageNet models, following research paper schedules.
3. ExponentialLR - Exponential Decay
Exponentially decays the learning rate each epoch.
from torch.optim.lr_scheduler import ExponentialLR
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Decay LR by 1% each epoch
scheduler = ExponentialLR(optimizer, gamma=0.99)
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.6f}")
Best for: Long training runs, fine-tuning pretrained models, when you want
smooth LR decay.
4. CosineAnnealingLR - Cosine Annealing
Anneals learning rate using a cosine function, popular in modern deep learning.
from torch.optim.lr_scheduler import CosineAnnealingLR
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# Cosine annealing over 100 epochs, minimum LR = 0.001
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0.001)
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Best for: Modern architectures (ResNet, DenseNet), competitive deep learning,
when you know total training epochs.
5. ReduceLROnPlateau - Adaptive Reduction
Reduces learning rate when a metric has stopped improving (most adaptive
scheduler).
from torch.optim.lr_scheduler import ReduceLROnPlateau
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Reduce LR when validation loss plateaus
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5,
patience=10, verbose=True, min_lr=1e-7)
for epoch in range(100):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = validate(model, val_loader)
# Step with validation loss
scheduler.step(val_loss)
print(f"Epoch {epoch}, Train Loss: {train_loss:.4f}, "
f"Val Loss: {val_loss:.4f}, LR: {optimizer.param_groups[0]['lr']:.6f}")
Best for: Unknown training duration, experimental setups, when you want
hands-off training, early stopping scenarios.
6. CyclicLR - Cyclical Learning Rates
Cycles the learning rate between two boundaries with a constant frequency.
from torch.optim.lr_scheduler import CyclicLR
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Cycle between 0.001 and 0.1 every 2000 steps
scheduler = CyclicLR(optimizer, base_lr=0.001, max_lr=0.1,
step_size_up=2000, mode='triangular')
for epoch in range(50):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Step after each batch
scheduler.step()
Best for: Finding optimal LR ranges, escaping local minima, when training is
stuck, super-convergence techniques.
7. OneCycleLR - One Cycle Policy
Implements the 1cycle learning rate policy for super-convergence.
from torch.optim.lr_scheduler import OneCycleLR
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# One cycle over entire training
total_steps = len(train_loader) * epochs
scheduler = OneCycleLR(optimizer, max_lr=0.1, total_steps=total_steps,
pct_start=0.3, anneal_strategy='cos')
for epoch in range(epochs):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Step after each batch
scheduler.step()
Best for: Fast training, when you have limited time, competitive deep learning,
following fastai practices.
Advanced Usage and Best Practices
Combining Multiple Schedulers
PyTorch allows chaining multiple schedulers for complex learning rate schedules.
from torch.optim.lr_scheduler import SequentialLR, LinearLR, ExponentialLR
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# Warm up for 10 epochs, then exponential decay
scheduler1 = LinearLR(optimizer, start_factor=0.1, total_iters=10)
scheduler2 = ExponentialLR(optimizer, gamma=0.99)
# Combine schedulers
scheduler = SequentialLR(optimizer, schedulers=[scheduler1, scheduler2],
milestones=[10])
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Learning Rate Range Test
Find optimal learning rates before training using the LR range test.
def lr_range_test(model, train_loader, start_lr=1e-7, end_lr=10, num_iter=100):
"""Perform learning rate range test"""
optimizer = optim.SGD(model.parameters(), lr=start_lr)
scheduler = ExponentialLR(optimizer, gamma=(end_lr/start_lr)**(1/num_iter))
losses = []
lrs = []
for i, (data, target) in enumerate(train_loader):
if i >= num_iter:
break
# Training step
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Record
losses.append(loss.item())
lrs.append(optimizer.param_groups[0]['lr'])
# Update LR
scheduler.step()
return lrs, losses
# Usage
lrs, losses = lr_range_test(model, train_loader)
# Plot results to find optimal LR range
Saving and Loading Scheduler State
Save scheduler state for resuming training from checkpoints.
Use ReduceLROnPlateau for unknown training duration
CosineAnnealingLR works well with modern architectures
OneCycleLR for fast training and super-convergence
Always validate scheduler behavior with small experiments
Save scheduler state in checkpoints
Use learning rate range tests to find optimal ranges
Common Pitfalls
Don't use aggressive decay with Adam/AdamW
Avoid stepping scheduler before first optimization step
Don't ignore scheduler warnings about step timing
Avoid too frequent LR changes with ReduceLROnPlateau
Don't use the same schedule for different model sizes
Avoid mixing epoch-based and step-based schedulers
Complete Training Example
Here's a complete example showing how to integrate learning rate scheduling into a
typical PyTorch training loop:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import OneCycleLR, ReduceLROnPlateau
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
def train_with_scheduler(model, train_loader, val_loader, num_epochs=50):
"""Complete training example with learning rate scheduling"""
# Setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
# Choose scheduler strategy
use_onecycle = True # Toggle between OneCycleLR and ReduceLROnPlateau
if use_onecycle:
total_steps = len(train_loader) * num_epochs
scheduler = OneCycleLR(optimizer, max_lr=0.1, total_steps=total_steps)
step_per_batch = True
else:
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5,
patience=5, verbose=True)
step_per_batch = False
# Training metrics
train_losses = []
val_losses = []
learning_rates = []
model.train()
for epoch in range(num_epochs):
epoch_loss = 0.0
num_batches = 0
# Training phase
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
num_batches += 1
# Step scheduler per batch if using OneCycleLR
if step_per_batch:
scheduler.step()
learning_rates.append(optimizer.param_groups[0]['lr'])
# Calculate average training loss
avg_train_loss = epoch_loss / num_batches
train_losses.append(avg_train_loss)
# Validation phase
model.eval()
val_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for data, target in val_loader:
output = model(data)
val_loss += criterion(output, target).item()
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
avg_val_loss = val_loss / len(val_loader)
val_accuracy = 100 * correct / total
val_losses.append(avg_val_loss)
# Step scheduler per epoch if using ReduceLROnPlateau
if not step_per_batch:
scheduler.step(avg_val_loss)
learning_rates.append(optimizer.param_groups[0]['lr'])
# Logging
current_lr = optimizer.param_groups[0]['lr']
print(f'Epoch {epoch+1}/{num_epochs}:')
print(f' Train Loss: {avg_train_loss:.4f}')
print(f' Val Loss: {avg_val_loss:.4f}')
print(f' Val Accuracy: {val_accuracy:.2f}%')
print(f' Learning Rate: {current_lr:.6f}')
print('-' * 50)
model.train()
return {
'train_losses': train_losses,
'val_losses': val_losses,
'learning_rates': learning_rates
}
# Usage
# results = train_with_scheduler(model, train_loader, val_loader)
Scheduler Performance Comparison
Scheduler
Training Speed
Final Accuracy
Ease of Use
Best Use Case
StepLR
⭐⭐⭐
⭐⭐⭐
⭐⭐⭐⭐⭐
Simple baselines, known training duration
MultiStepLR
⭐⭐⭐⭐
⭐⭐⭐⭐
⭐⭐⭐⭐
Following research schedules, ImageNet
ExponentialLR
⭐⭐⭐
⭐⭐⭐
⭐⭐⭐⭐⭐
Long training, fine-tuning
CosineAnnealingLR
⭐⭐⭐⭐
⭐⭐⭐⭐⭐
⭐⭐⭐⭐
Modern architectures, competitive ML
ReduceLROnPlateau
⭐⭐⭐
⭐⭐⭐⭐
⭐⭐⭐⭐⭐
Experimental work, unknown duration
OneCycleLR
⭐⭐⭐⭐⭐
⭐⭐⭐⭐⭐
⭐⭐⭐
Fast training, super-convergence
CyclicLR
⭐⭐⭐⭐
⭐⭐⭐⭐
⭐⭐
LR exploration, escaping local minima
Conclusion
Learning rate scheduling is a powerful technique that can significantly improve your deep learning
model's performance and training efficiency. The key is choosing the right scheduler for your specific
use case:
For Speed
Use OneCycleLR for fastest convergence and
super-convergence effects.
For Simplicity
Use ReduceLROnPlateau for hands-off training with
automatic adaptation.
For Performance
Use CosineAnnealingLR for modern architectures
and competitive results.
Remember to always experiment with different schedulers on your specific dataset and architecture. What
works best can vary significantly depending on your model, data, and training setup. The investment in
proper learning rate scheduling often pays off with better models and faster training times.
Next Steps
• Experiment with different schedulers on your current projects
• Implement learning rate range tests to find optimal ranges
• Try combining multiple schedulers for complex training schedules
• Monitor and log learning rates along with other training metrics
• Explore warm-up strategies for large batch training