Introduction to Image Recognition: Building a Simple Digit Detector

Digit recognition is not something that difficult or advanced. It is kind of “Hello world!” program – not that cool, but you start exactly here. So I decided to share my work and at the same time refresh the knowledge – it’s being a long ago I played with images.

Data Import and Exploration

We start with importing all the necessary packages.

import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
%matplotlib inline

MNIST dataset, which contains 20 thousand hand-written digits is a “Hello World” dataset for this task, it is already preloaded in Colaboratory (cloud-based Python notebooks, fantastic thing BTW), so we will use it. No need to invent a wheel here.

# load data
df = pd.read_csv('sample_data/mnist_train_small.csv', header=None)

As we can see from the head() method, first column in dataset contains labels and the rest pixels of the image 28×28 – that is why we have 784 columns more. It is also useful to check the length of the dataset each time after some modification to make sure we did everything correct.

Next, let’s visualize our pixels and watch the images we have. We use randint() to select random image every time we run the code below. Also we have to transform our pixels to numpy array (now its’ type is Series) and reshape it to the size 28×28 to be able to plot them.

ix = random.randint(0, len(df)-1)
label, pixels = df.loc[ix][0], df.loc[ix][1:]
img = np.array(pixels).reshape((28,28))
print('label: ' + str(label))
label: 9
<matplotlib.image.AxesImage at 0x7ff9ac6fda20>

Data Preprocessing

Now, to make our life little bit easier we will transform our dataframe to have only two columns – label and image, where image is a numpy array of pixels. Also we will reduce the size of dataframe for faster computation (first we want to make sure everything works and then we start playing with model)

# transforming df for easier manipulation
labels, imgs = [], []
for index, row in df.iterrows():
label, pixels = row[0], row[1:]
img = np.array(pixels)

df2 = pd.DataFrame({'label': labels, 'img': imgs})
df2 = df2[:1000]df2.head()
0[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
1[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
2[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …7
3[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
4[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …5
# checking images using new df structure
ix = random.randint(0, len(df2)-1)
img = df2.loc[ix].img.reshape((28,28))
label = df2.loc[ix].label
print('label: ' + str(label))
label: 9
<matplotlib.image.AxesImage at 0x7ff9a9b997f0>

When we have our data prepared, we want to split it into 2 datasets: one to traing our model and another to test it’s performance. And the best way to do that is using sklearn. We set up a test_size=0.2 which is standard value for this operation (usually for test we leave 20-30% of data), which means that for training remains 80%. It is also a good practice to set shuffle=True as some datasets might have ordered data, so the model will learn to recognize 0s and 1s, but won’t have any idea that 8 exists for example.

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df2, test_size=0.2, shuffle=True)
print(len(train_df), len(test_df))

800 200

825[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
305[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
189[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …3
397[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
70[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …8

Building a Model

We checked the length, the head of datasets – all good, we can start building our model. For this we will need to install pytorch. If we go to “Code snippets” and start typing there ‘pyt’ it will show us “Install [pytorch]”, so we can insert it into our notebook. If someone has pytorch already installed this step can be skipped.

from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep|sed -e 's/..([0-9]).([0-9]*)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'
!pip install -q{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision
# importing torch and setting up the device
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Next, we have to transform our data into pytorch Dataset. is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

  • __len__so that len(dataset) returns the size of the dataset.
  • __getitem__ to support the indexing such that dataset[i] can be used to get its sample
# create torch dataset
from import Dataset
class MNISTDataset(Dataset):
def init__(self, imgs, labels): super(MNISTDataset, self).__init()
self.imgs = imgs
self.labels = labels
def len(self):
return len(self.imgs)
def getitem(self, ix):
img = self.imgs[ix]
label = self.labels[ix]
return torch.from_numpy(img).float(), label

dataset = {
'train': MNISTDataset(train_df.img.values, train_df.label.values),
'test': MNISTDataset(test_df.img.values, test_df.label.values)

# again checking image, now based on torch dataset
ix = random.randint(0, len(dataset['train'])-1)
img, label = dataset['train'][ix]
print(img.shape, img.dtype)

torch.Size([784]) torch.float32
<matplotlib.image.AxesImage at 0x7ff99eeeed30>

The beauty of pytorch is its simplicity in defining the model. We define our layer with inputs and outputs, we add some batch normalization to improve our model (It is a technique to provide any layer in a neural network with inputs that are zero mean/unit variance) and activation function, in this case ReLU.

For the first input we have 784 neurons (one neuron per each pixel) and 512 for output (this one is almost random – I tried few different values and this one performed pretty well, so I left it). Next layer will have 512 inputs (input_layer[n+1] == output_layer[n]) and 256 for output, next 256 inputs and 128 outputs and the last one – 128 inputs and 10 for output (each neuron represents one of 10 digits)

# create model
import torch.nn as nn
def block(in_f, out_f):
return nn.Sequential(
nn.Linear(in_f, out_f),
model = nn.Sequential(
nn.Linear(128, 10)

Now we need to create few additional parameters for our model:

  • criterion – to calculate loss function, in our case CrossEntropyLoss
  • optimizer – to set up learning rate
  • scheduler – to update learning rate if model doesn’t improve with time (quite powerful technique, allows us to tweak the system on the go)
  • dataloader – class for pytorch that provides single- or multi-process iterators over the dataset
from import DataLoader
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)
scheduler = ReduceLROnPlateau(optimizer, 'max', factor=0.1, patience=3, min_lr=0.0001, verbose=True)

dataloader = {
'train': DataLoader(dataset['train'], batch_size=32, shuffle=True, num_workers=4),
'test': DataLoader(dataset['test'], batch_size=32, shuffle=False, num_workers=4),

Training and Evaluating the Model

With all above we can start training and evaluating our model. Although we define 100 epochs, it is also useful to stop the loop if model doesn’t improve with time. Here we have set up early_stop = 10, so if model doesn’t change for 10 epochs in a row we will stop the training process.

Training process: we iterate through our train data by assigning each image and label to a device defined previously, we give our model an image and it tries to find the correct class (preds), we clear all gradients (zero_grad()) and calculate the loss function and the gradient (loss), perform an optimizer step and append new value to a total_loss array.

Testing process: we iterate through the test data, make predictions, calculate the loss and accuracy of the model. In torch.max() we are looking for an index of the maximum value as it will represent the class of a digit and in our case it will match labels. Then by comparing labels and predictions we calculate the accuracy of our model.

Every time we find the best model we save it and if we hit the early_stop we exit and report the results. Usually it won’t need all those 100 epochs.

# train
best_acc, stop, early_stop = 0, 0, 10
for e in range(100):

total_loss = []
for imgs, labels in tqdm(dataloader['train']):
imgs, labels =,
preds = model(imgs)
loss = criterion(preds, labels)

val_loss, acc = [], 0.
with torch.no_grad():
for imgs, labels in tqdm(dataloader['test']):
imgs, labels =,
preds = model(imgs)
loss = criterion(preds, labels)
_, preds = torch.max(preds, 1)
acc += (preds == labels).sum().item()

acc /= len(dataset['test'])
if acc > best_acc:
print('\n Best model ! saved.'), '')
best_acc = acc
stop = -1

stop += 1
if stop >= early_stop:


print('\n Epoch {}, Training loss: {:4f}, Val loss: {:4f}, Val acc: {:4f}'.format(
e + 1, np.array(total_loss).mean(), np.array(val_loss).mean(), acc))

print('\n Best model with acc: {}'.format(best_acc))
Epoch 30, Training loss: 0.015759, Val loss: 0.397337, Val acc: 0.910000
100%|██████████| 25/25 [00:01<00:00, 22.10it/s]
100%|██████████| 7/7 [00:00<00:00, 73.41it/s]
Best model with acc: 0.91

When we found our best model and saved it, we can play with it by feeding it with new data and see how it performs.

# test

ix = random.randint(0, len(dataset['test'])-1)
img, label = dataset['test'][ix]
pred = model(img.unsqueeze(0).to(device)).cpu()
pred_label = torch.argmax(pred)
print('Ground Truth: {}, Prediction: {}'.format(label, pred_label))

Ground Truth: 5, Prediction: 5
<matplotlib.image.AxesImage at 0x7ff9a9ced748>

Like it was said in the beginning it is a “Hello World” for the image recognition, we didn’t use convolutional neural network which is normally used in tasks like this, just entry level to understand the flow. I don’t usually work with images, so if there are some mistakes, please let me know. It was a nice refresher for me, hopefully it helped someone else.

You can find Python Notebook with the code on GitHub

Photo by Toa Heftiba on Unsplash

2 thoughts on “Introduction to Image Recognition: Building a Simple Digit Detector”

Leave a Reply

Your email address will not be published. Required fields are marked *