PyTorch Learning Rate Schedulers: A Comprehensive Guide

Master the art of dynamic learning rate adjustment for optimal deep learning model training

Published: February 10, 2026 15 min read

Introduction

Learning rate is arguably one of the most critical hyperparameters in deep learning. It determines how quickly or slowly a neural network learns from data. While using a fixed learning rate throughout training can work, dynamically adjusting it often leads to better convergence, faster training, and improved model performance.

PyTorch provides a rich collection of learning rate schedulers in the torch.optim.lr_scheduler module. These schedulers automatically adjust the learning rate during training based on various strategies, from simple step-wise reductions to sophisticated adaptive methods.

Why Learning Rate Scheduling Matters: Proper learning rate scheduling can reduce training time by 30-50% while achieving better final model accuracy compared to fixed learning rates.

Why Use Learning Rate Schedulers?

Faster Convergence

Start with higher learning rates for rapid initial progress, then reduce to fine-tune the model as it approaches optimal weights.

Better Final Performance

Lower learning rates in later stages help the model settle into better local minima, improving final accuracy.

Avoid Overshooting

Prevents the optimizer from bouncing around the minimum due to learning rates that are too high.

Adaptive Training

Automatically adjust to training dynamics without manual intervention during long training runs.

Common Learning Rate Patterns

Different schedulers create distinct learning rate patterns over time. Here's a visualization of how popular schedulers behave:

Types of PyTorch Learning Rate Schedulers

1. StepLR - Step Decay

Reduces learning rate by a multiplicative factor every few epochs.

import torch import torch.optim as optim from torch.optim.lr_scheduler import StepLR # Setup optimizer optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9) # Create StepLR scheduler scheduler = StepLR(optimizer, step_size=30, gamma=0.1) # Training loop for epoch in range(100): train_one_epoch(model, train_loader, optimizer) # Step the scheduler scheduler.step() print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.6f}")

Best for: Simple experiments, baseline models, when you know roughly how long training should take.
Parameters: step_size (epochs between reductions), gamma (multiplication factor)

2. MultiStepLR - Multiple Milestones

Reduces learning rate at specific epoch milestones.

from torch.optim.lr_scheduler import MultiStepLR optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9) # Reduce LR at epochs 30, 60, and 90 scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1) for epoch in range(100): train_one_epoch(model, train_loader, optimizer) scheduler.step()

Best for: When you have domain knowledge about when to reduce LR, training ImageNet models, following research paper schedules.

3. ExponentialLR - Exponential Decay

Exponentially decays the learning rate each epoch.

from torch.optim.lr_scheduler import ExponentialLR optimizer = optim.Adam(model.parameters(), lr=0.001) # Decay LR by 1% each epoch scheduler = ExponentialLR(optimizer, gamma=0.99) for epoch in range(100): train_one_epoch(model, train_loader, optimizer) scheduler.step() if epoch % 10 == 0: print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()[0]:.6f}")

Best for: Long training runs, fine-tuning pretrained models, when you want smooth LR decay.

4. CosineAnnealingLR - Cosine Annealing

Anneals learning rate using a cosine function, popular in modern deep learning.

from torch.optim.lr_scheduler import CosineAnnealingLR optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9) # Cosine annealing over 100 epochs, minimum LR = 0.001 scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0.001) for epoch in range(100): train_one_epoch(model, train_loader, optimizer) scheduler.step()

Best for: Modern architectures (ResNet, DenseNet), competitive deep learning, when you know total training epochs.

5. ReduceLROnPlateau - Adaptive Reduction

Reduces learning rate when a metric has stopped improving (most adaptive scheduler).

from torch.optim.lr_scheduler import ReduceLROnPlateau optimizer = optim.Adam(model.parameters(), lr=0.001) # Reduce LR when validation loss plateaus scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=10, verbose=True, min_lr=1e-7) for epoch in range(100): train_loss = train_one_epoch(model, train_loader, optimizer) val_loss = validate(model, val_loader) # Step with validation loss scheduler.step(val_loss) print(f"Epoch {epoch}, Train Loss: {train_loss:.4f}, " f"Val Loss: {val_loss:.4f}, LR: {optimizer.param_groups[0]['lr']:.6f}")

Best for: Unknown training duration, experimental setups, when you want hands-off training, early stopping scenarios.

6. CyclicLR - Cyclical Learning Rates

Cycles the learning rate between two boundaries with a constant frequency.

from torch.optim.lr_scheduler import CyclicLR optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9) # Cycle between 0.001 and 0.1 every 2000 steps scheduler = CyclicLR(optimizer, base_lr=0.001, max_lr=0.1, step_size_up=2000, mode='triangular') for epoch in range(50): for batch_idx, (data, target) in enumerate(train_loader): optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() # Step after each batch scheduler.step()

Best for: Finding optimal LR ranges, escaping local minima, when training is stuck, super-convergence techniques.

7. OneCycleLR - One Cycle Policy

Implements the 1cycle learning rate policy for super-convergence.

from torch.optim.lr_scheduler import OneCycleLR optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9) # One cycle over entire training total_steps = len(train_loader) * epochs scheduler = OneCycleLR(optimizer, max_lr=0.1, total_steps=total_steps, pct_start=0.3, anneal_strategy='cos') for epoch in range(epochs): for batch_idx, (data, target) in enumerate(train_loader): optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() # Step after each batch scheduler.step()

Best for: Fast training, when you have limited time, competitive deep learning, following fastai practices.

Advanced Usage and Best Practices

Combining Multiple Schedulers

PyTorch allows chaining multiple schedulers for complex learning rate schedules.

from torch.optim.lr_scheduler import SequentialLR, LinearLR, ExponentialLR optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9) # Warm up for 10 epochs, then exponential decay scheduler1 = LinearLR(optimizer, start_factor=0.1, total_iters=10) scheduler2 = ExponentialLR(optimizer, gamma=0.99) # Combine schedulers scheduler = SequentialLR(optimizer, schedulers=[scheduler1, scheduler2], milestones=[10]) for epoch in range(100): train_one_epoch(model, train_loader, optimizer) scheduler.step()

Learning Rate Range Test

Find optimal learning rates before training using the LR range test.

def lr_range_test(model, train_loader, start_lr=1e-7, end_lr=10, num_iter=100): """Perform learning rate range test""" optimizer = optim.SGD(model.parameters(), lr=start_lr) scheduler = ExponentialLR(optimizer, gamma=(end_lr/start_lr)**(1/num_iter)) losses = [] lrs = [] for i, (data, target) in enumerate(train_loader): if i >= num_iter: break # Training step optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() # Record losses.append(loss.item()) lrs.append(optimizer.param_groups[0]['lr']) # Update LR scheduler.step() return lrs, losses # Usage lrs, losses = lr_range_test(model, train_loader) # Plot results to find optimal LR range

Saving and Loading Scheduler State

Save scheduler state for resuming training from checkpoints.

# Saving checkpoint checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'scheduler_state_dict': scheduler.state_dict(), 'loss': loss, } torch.save(checkpoint, 'checkpoint.pth') # Loading checkpoint checkpoint = torch.load('checkpoint.pth') model.load_state_dict(checkpoint['model_state_dict']) optimizer.load_state_dict(checkpoint['optimizer_state_dict']) scheduler.load_state_dict(checkpoint['scheduler_state_dict']) epoch = checkpoint['epoch'] loss = checkpoint['loss']

Common Patterns and Recommendations

Best Practices

  • Use ReduceLROnPlateau for unknown training duration
  • CosineAnnealingLR works well with modern architectures
  • OneCycleLR for fast training and super-convergence
  • Always validate scheduler behavior with small experiments
  • Save scheduler state in checkpoints
  • Use learning rate range tests to find optimal ranges

Common Pitfalls

  • Don't use aggressive decay with Adam/AdamW
  • Avoid stepping scheduler before first optimization step
  • Don't ignore scheduler warnings about step timing
  • Avoid too frequent LR changes with ReduceLROnPlateau
  • Don't use the same schedule for different model sizes
  • Avoid mixing epoch-based and step-based schedulers

Complete Training Example

Here's a complete example showing how to integrate learning rate scheduling into a typical PyTorch training loop:

import torch import torch.nn as nn import torch.optim as optim from torch.optim.lr_scheduler import OneCycleLR, ReduceLROnPlateau from torch.utils.data import DataLoader import matplotlib.pyplot as plt def train_with_scheduler(model, train_loader, val_loader, num_epochs=50): """Complete training example with learning rate scheduling""" # Setup criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4) # Choose scheduler strategy use_onecycle = True # Toggle between OneCycleLR and ReduceLROnPlateau if use_onecycle: total_steps = len(train_loader) * num_epochs scheduler = OneCycleLR(optimizer, max_lr=0.1, total_steps=total_steps) step_per_batch = True else: scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5, verbose=True) step_per_batch = False # Training metrics train_losses = [] val_losses = [] learning_rates = [] model.train() for epoch in range(num_epochs): epoch_loss = 0.0 num_batches = 0 # Training phase for batch_idx, (data, target) in enumerate(train_loader): optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() epoch_loss += loss.item() num_batches += 1 # Step scheduler per batch if using OneCycleLR if step_per_batch: scheduler.step() learning_rates.append(optimizer.param_groups[0]['lr']) # Calculate average training loss avg_train_loss = epoch_loss / num_batches train_losses.append(avg_train_loss) # Validation phase model.eval() val_loss = 0.0 correct = 0 total = 0 with torch.no_grad(): for data, target in val_loader: output = model(data) val_loss += criterion(output, target).item() _, predicted = torch.max(output.data, 1) total += target.size(0) correct += (predicted == target).sum().item() avg_val_loss = val_loss / len(val_loader) val_accuracy = 100 * correct / total val_losses.append(avg_val_loss) # Step scheduler per epoch if using ReduceLROnPlateau if not step_per_batch: scheduler.step(avg_val_loss) learning_rates.append(optimizer.param_groups[0]['lr']) # Logging current_lr = optimizer.param_groups[0]['lr'] print(f'Epoch {epoch+1}/{num_epochs}:') print(f' Train Loss: {avg_train_loss:.4f}') print(f' Val Loss: {avg_val_loss:.4f}') print(f' Val Accuracy: {val_accuracy:.2f}%') print(f' Learning Rate: {current_lr:.6f}') print('-' * 50) model.train() return { 'train_losses': train_losses, 'val_losses': val_losses, 'learning_rates': learning_rates } # Usage # results = train_with_scheduler(model, train_loader, val_loader)

Scheduler Performance Comparison

Scheduler Training Speed Final Accuracy Ease of Use Best Use Case
StepLR ⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐⭐ Simple baselines, known training duration
MultiStepLR ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐ Following research schedules, ImageNet
ExponentialLR ⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐⭐ Long training, fine-tuning
CosineAnnealingLR ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ Modern architectures, competitive ML
ReduceLROnPlateau ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Experimental work, unknown duration
OneCycleLR ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐ Fast training, super-convergence
CyclicLR ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐ LR exploration, escaping local minima

Conclusion

Learning rate scheduling is a powerful technique that can significantly improve your deep learning model's performance and training efficiency. The key is choosing the right scheduler for your specific use case:

For Speed

Use OneCycleLR for fastest convergence and super-convergence effects.

For Simplicity

Use ReduceLROnPlateau for hands-off training with automatic adaptation.

For Performance

Use CosineAnnealingLR for modern architectures and competitive results.

Remember to always experiment with different schedulers on your specific dataset and architecture. What works best can vary significantly depending on your model, data, and training setup. The investment in proper learning rate scheduling often pays off with better models and faster training times.

Next Steps

  • • Experiment with different schedulers on your current projects
  • • Implement learning rate range tests to find optimal ranges
  • • Try combining multiple schedulers for complex training schedules
  • • Monitor and log learning rates along with other training metrics
  • • Explore warm-up strategies for large batch training