Understanding the Adam Optimization Algorithm in Machine Learning

Machine learning and artificial intelligence are some of the most advanced technologies applicable across many industries, including healthcare, finance, transportation, and entertainment. These technologies require powerful algorithms to achieve high accuracy. One such algorithm is the Adam optimization algorithm, which has gained popularity due to its effectiveness and versatility.

This article will delve into the workings of the Adam optimization algorithm, explaining its history, advantages, and applications. We will also compare it with other optimization algorithms and assess its performance on standard datasets like MNIST.

The Basics of Optimization

To understand the Adam algorithm, we need to understand the concept of optimization in machine learning. Optimization involves adjusting the parameters of a machine learning model to minimize or maximize its performance, measured by a loss function. The loss function quantifies the difference between the model's predicted outputs and the actual outputs.

The Adam algorithm, proposed by Diederik Kingma and Jimmy Ba in 2015, is an adaptive optimization method that efficiently updates the model parameters by maintaining a separate learning rate for each parameter.

Advantages of Adam

Adaptive learning rates: Adam adapts the learning rate for each parameter based on its history of updates, allowing for more efficient convergence.
Momentum: Incorporates momentum to accelerate the learning process and overcome local minima.
Combination of algorithms: Merges benefits from AdaGrad and RMSProp to optimize non-stationary data.

Implementation of Adam

The Adam algorithm updates the model parameters by iteratively computing the exponential moving average of the gradients and the squared gradients.

Problem: Optimizing a Simple Quadratic Function

Consider the quadratic function: f(x, y) = x^2 + y^2. This function can be visualized as a 2D contour plot.

Solution: Using Adam Optimizer

The Adam optimizer efficiently finds the minimum of this function through a structured approach:

Initialize parameters and moments.
Iterate through gradient calculations and moment updates.
Adjust the parameters until convergence.

Code Example

import numpy as np
import matplotlib.pyplot as plt

def loss_function(x, y):
    return x**2 + y**2

def gradient(x, y):
    return np.array([2*x, 2*y])

def adam_optimizer(x, y, alpha=0.1, beta1=0.9, beta2=0.999, epsilon=1e-8):
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    t = 0
    while True:
        t += 1
        grad = gradient(x, y)
        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        x = x - alpha * m_hat / (np.sqrt(v_hat) + epsilon)
        y = y - alpha * m_hat / (np.sqrt(v_hat) + epsilon)
        loss = loss_function(x, y)
        yield x, y, loss

This code executes the Adam optimizer and visualizes results. Experiment with different learning rates to see how they impact optimization.

Adam Optimizer Visualization

Adjust parameters and see how Adam optimizer converges!

Learning Rate (α):
Beta1:
Beta2:
Epsilon:

Optimization Algorithm Comparison

Select an optimization algorithm and visualize how it minimizes the loss function:

Choose an Optimizer:

Train MNIST Model using Adam Optimizer and PyTorch

The MNIST dataset consists of 60,000 training images and 10,000 testing images of handwritten digits (0-9).

Each image is 28x28 pixels in grayscale.
Commonly used for benchmarking classification algorithms.
Developed by Yann LeCun, Corinna Cortes, and Christopher Burges.

It serves as a foundational dataset in deep learning research, often used for training and evaluating neural networks.

Sample MNIST Images

By Suvanjanprasai - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=156115980

            import torch
            from torch import nn
            from tqdm import tqdm
            
            class NeuralNetwork(nn.Module):
                def __init__(self, input_size, num_classes):
                    """
                    Args:
                        input_size: (1,28,28)
                        num_classes: 10
                    """
                    super(NeuralNetwork, self).__init__()
            
                    self.layer1 = nn.Sequential(
                        nn.Conv2d(input_size[0], 32, kernel_size=5),
                        nn.ReLU(),
                        nn.MaxPool2d(kernel_size=2))
                    
                    self.layer2 = nn.Sequential(
                        nn.Conv2d(32, 64, kernel_size=5),
                        nn.ReLU(),
                        nn.MaxPool2d(kernel_size=2))
            
                    self.fc1 = nn.Linear(4 * 4 * 64, num_classes)
                    
                
                def forward(self, x):
                    """
                    Args:
                        x: (Nx1x28x28) tensor
                    """
                    x = self.layer1(x)
                    x = self.layer2(x)
                    x = x.reshape(x.size(0), -1)
                    x = self.fc1(x)
                    return x
                
                model = NeuralNetwork((1, 28, 28), 10)
                
            opts = {
                'lr': 5e-4,
                'epochs': 50,
                'batch_size': 64
            }
            
            optimizer = torch.optim.Adam(model.parameters(), opts['lr'])
            criterion = torch.nn.CrossEntropyLoss()  # loss function
            train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=opts['batch_size'], shuffle=True)
            test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=opts['batch_size'], shuffle=True)
            
            for epoch in range(opts['epochs']):
                train_loss = []
                for i, (data, labels) in tqdm(enumerate(train_loader), total=len(train_loader)):
                    outputs = model(data)
                    loss = criterion(outputs, labels)
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()
                    train_loss.append(loss.item())
                test_loss = []
                test_accuracy = []
                for i, (data, labels) in enumerate(test_loader):
                    outputs = model(data)
                    _, predicted = torch.max(outputs.data, 1)
                    loss = criterion(outputs, labels)
                    test_loss.append(loss.item())
                    test_accuracy.append((predicted == labels).sum().item() / predicted.size(0))
                print('epoch: {}, train loss: {}, test loss: {}, test accuracy: {}'.format(epoch, np.mean(train_loss), np.mean(test_loss), np.mean(test_accuracy)))

The above code reaches >99% accuracy on the MNIST dataset.

< Back to All Posts