MNIST Hand-Written Number Classifier | Code-Uncovered

Un-coding the code

6 min readFeb 12, 2021

*Code created from this awesome Udacity course

We’ve all done courses, classes, and tutorials of code, but somehow at the end of the day, we didn’t know what certain parts meant. With this series, “Code-Uncovered,” I’ll go line-by-line on different programs and elaborate on what they mean.

For this piece, we’ll be going line-by-line through this neural network which classifies handwritten numbers. These images range from 0–9, and are 28x28 pixels large.

Here’s a link to the whole file: https://github.com/aplusryan/MNSTclassifier

import torch
from torch import nn
from torchvision import datasets, transforms
from torch import optim

Here we import any necessary tools we need.

torch :

As this file is made with PyTorch, “torch” is just a Tensor Library, similar to NumPy. This package has data structures for tensors of many dimensions, in addition to mathematical operations and many other things.

nn :

Torch.nn is a part of PyTorch that helps us create/train neural netowrks. This module helps to create the containers of our neural network, in addition to creating the layers of our network.

torchvision:

Torchvision is a library part of Pytroch. This library’s main purprose to handle data related to vision. Torchvision here, is used to import media. Torchvision has the capability to import popular datasets, and it also has model architectures that can be used.

- datasets:

torchvision.datasets: this is where we get our MNIST dataset.

- transforms:

torchvision.transforms: this allows us to manipulate/change our data/image characteristics.

torch.optim:

This is a package that allows us to use many different optimization algorithms to make our neural network work better.

new section

transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,(0.5,)),])

In this line, we’re defining a transform which should make our data normalized. The .Normalize just sets some fun math to the data, and the .ToTensor() helps turn our data into a tensor.

new section

trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

The first line here is just getting the MNIST dataset (trainset). “trainloader” loads in the data, includes it in batches of 64, and shuffles the order of the images.

new section

model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 10),
                      nn.LogSoftmax(dim=1))

Woohoo! Now that we’re done with the boring stuff (importing data and libraries), we can actually focus on the fun stuff (models)! For basics, let’s revise the very bare-bone part of neural networks: perceptrons. Neural networks at a high-level and perceptrons are very much similar.

Perceptrons are how we get our results. We have a couple of inputs, x(1)-x(n). Then we initiate our magic math: we multiply some weights on our inputs, add a bias (constant), and get a weighted sum (sum of all weights). Then we use a step function to get out a “final answer.” The purpose of the activation function differs and varies, but without it, neural networks would not learn. These functions allows for values to not reach extreme ranges, to reach non-linearity using non-linear functions, and to take an output and create a reasonable input for the next layer.This step function/activation function can differ based on what we want.

Let’s get back to our model. nn.Sequential() is a tool that allows us to easily stack different layers. We have the line nn.Linear(784, 128). This function comes from the nn import, and it creates a linear transformation of our data. The first parameter is the number of inputs we have, and the second parameter is how many outputs we want. Right after this line, we have nn.ReLU(), our activation function.

This function takes an positive input and returns it unchanged, while any non-postive input is returned as 0.

model code once again:

model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 10),
                      nn.LogSoftmax(dim=1))

We continue this process of nn.Linear() and nn.ReLU() for one more layer, until we reach the last layer: nn.Linear(64, 10), nn.LogSoftmax(dim=1)). nn.Linear() here turns our 64 inputs to the last 10 outputs, one output for each digit: 0–9. nn.LogSoftmax(dim=1)) is our new, and final activation function. LogSoftmax is an activation function based of Softmax. Softmax is an activation function that’s able to turn logits (numbers) into probabilities. Since we want the output of our model to be the probability of a written digit being a certain number, using a Softmax allows us to take the output from the last layer and turn it into a probability. We then apply log to be able to find the cross entropy loss, something we’ll learn in another post. The only parameter for nn.LogSoftmax(), dim=1, is just the dimension of the tensor. If dim=0, softmax is calculated by each row, where dim=1 calculates softmax by each column. We have 1 row of 10 outputs, and we want softmax applied to each output, so we use dim=1.

new section

criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.003)

Here we involve the parts that help our model learn. The first line is the loss, or the criterion. nn.NLLLoss() is the negative log likelihood loss. This loss works on cross-entropy, something we’ll touch on in a different article. The optimizer is what actually changes our network after it learns the losses and such, via changing the model parameters. optim is the optim package from torch. The .SGD() is the stochastic gradient descent, or in short, a process that helps lower the error. The parameters of this include where the optimizer will change the gradients (model.parameters()), and the learning rate(lr=0.003) which is how fast the optimizer should change the gradients.

new section

epochs = 5
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        # Flatten MNIST images into a 784 long vector
        images = images.view(images.shape[0], -1)
    
        
        optimizer.zero_grad()
        
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")

Finally! Now that we set everything up, here’s where we actually train the model.

Our program here runs 5 epochs. For those of you who don’t know, epochs are the amount of times a network goes through the dataset and updates weights/biases based on error.

We have a nested loop, one for running through epochs, and one for running through data. Let’s dive into the inner loop. First we start off with: images = images.view(images.shape[0], -1). Given by the wonderful comment, this line flattens the data from a 28x28, to a single row that is 784 pixels long. Our next line is optimizer.zero_grad() which resets the gradients to zero. If we don’t do this, PyTorch will add the gradients after each epoch, which will massively disrupt the true gradients needed to reduce error.

        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

The first line, output = model(images), takes the image data (remember we turned into into a vector) and puts it through the model. Then we take this output, take the the labels that associate with those images, and run them through the criterion, or the NLLLoss to get the loss value.

loss.backward() is just back propogation, something we’ll also cover in a different article. For a short intro, back propagation computes gradient error of the different parameters. optimizer.step() then updates each weight based on those gradients.

new section

running_loss += loss.item()
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")

This line of code just allows us to read how well the network is learning.
print(f”Training loss: {running_loss/len(trainloader)}”) just calculates total loss and divides it by the original data (trainloader).

Congrats! We finally made an actual neural network. Now let’s see how we can see it!

Woohoo!!!! We made and trained our first ever neural network. Now we’re able to see which numbers are which quite easily!

Hopefully this article goes in-depth on each line of code for the program, allowing you to fully understand the code you are writing. I’ll continue with articles going line-by-line for other more difficult programs in the future.

Thanks For Reading. I’m Aryan Saha, a 15-year-old working to use technology to help solve climate change. If you would like to contact me, reach out at aryannsaha@gmail.com. If you would like to connect, visit my linkedin: https://www.linkedin.com/in/aryan-saha-0190541a0/.

MNIST Hand-Written Number Classifier | Code-Uncovered

Un-coding the code

Written by Aryan Saha