Building Custom Deep Learning Based OCR models

OCR provides us with different ways to see an image, find and recognize the text in it. When we think about OCR, we inevitably think of lots of paperwork - bank cheques and legal documents, ID cards and street signs. In this blog post, we will try to predict the text present in number plate images.

What we are dealing with is an optical character recognition library that leverages deep learning and attention mechanism to make predictions about what a particular character or word in an image is, if there is one at all. Lots of big words thrown there, so we'll take it step by step and explore the state of OCR technology and different approaches used for these tasks.  

You can always directly skip to the code section of the article or check the github repository if you are familiar with the big words above.

OCR - Optical Character Recognition

Optical character recognition or OCR refers to a set of computer vision problems that require us to convert images of digital or hand-written text images to machine readable text in a form your computer can process, store and edit as a text file or as a part of a data entry and manipulation software. The images can include documents, invoices, legal forms, ID cards or OCR in the wild like reading street signs, shipping container numbers or vehicle number plates.  

optical character recognition using attention ocr

People have tried solving the OCR problem with several conventional computer vision techniques like image filters, contour detection and image classification which performed well on narrow, template based datasets which did not vary much in their orientation, image quality, etc but to make our models robust to these variations so that a business can deploy their machine learning applications at scale, new methods have to be explored.

There are a lot of services and products that perform differently on different kinds of OCR tasks. If you are interested, here's a blog post about where these OCR APIs might fail and how can they improve.

Deep Learning and OCR

Deep learning approaches have improved over the last few years, reviving an interest in the OCR problem, where neural networks can be used to combine the tasks of localizing text in an image along with understanding what the text is. Using deep convolutional neural architectures and attention mechanisms and recurrent networks have gone a long way in this regard.

One of these deep learning approaches is the basis of Attention - OCR, the library we are going to be using to predict the text in number plate images.

Think of it like this. The overall pipeline for many architectures for OCR tasks follow this template - a convolutional network to extract image features as encoded vectors followed by a recurrent network that uses these encoded features to predict where each of the letters in the image text might be and what they are.  

Let's try to understand what's going on under the hood.

Attention Mechanisms

You might be aware of RNNs or LSTMs, neural network architectures that predict output at each time step, providing us with sequence generation as we need for language. This breed of neural networks intended to learn patterns in sequential data by modifying their current state based on current input and previous states iteratively. But due to limitations on memory and issues like vanishing gradients, we found RNNs and LSTMs not able to really capture the influence of words farther away.

Attention mechanism tries to fix this. It is a way to get your model learn long range dependencies in a sequence and has found several applications in natural language processing and machine translation.

Bert attention visualisation
BERT attention visualisation - source

In a nutshell, attention is a feed-forward layer with trainable weights that help us capture the relationships between different elements of sequences. It works by using query, key and value matrices, passing the input embeddings through a series of operations and getting an encoded representation of our original input sequence.  

calculating encoded representations our input embeddings (x) with key, value, query matrices
calculating encoded representations our input embeddings (x) with key, value, query matrices -source

There are flavors to attention mechanisms. They can be hard or soft attention depending on whether the entire image is available to the attention or only a patch. Having soft attention by laying each patch smoothly over the sequence makes it differentiable, but hurts the time taken to run computations. A better explanation can be found here.


You might have heard of BERT, GPT2 or more recently XLNet performing a little too well on language modelling and generation tasks. The secret sauce is the different ways of applying transformers.


If you understand how attention works, it shouldn't take much effort to grasp how transformers work. In essence, the paper uses multi-headed attention, which is nothing but using several query, key and value matrices and training them independently, concatenating them and then extracting a useable matrix for our following network by using an additional set of weights.

Another important addition is a positional embedding that encodes the time at which an element in a sequence appears. These positional embeddings are added to our input embeddings for the network to learn time dependencies better. This article is an amazing resource to learn about the mathematics behind self-attention and transformers.

Visual Attention

Though attention and transformer networks evolved for applications in the NLP domain, they have been adapted for convolutional networks to replicate attention mechanisms of the human brain and how it processes vision. To learn more, check this link or this study. The fundamental behind this is to replicate how the human eye works.

When you open your eyes to a new scene, some parts of the picture directly catch your 'attention'. You focus on those parts of the picture first, extract information from it and comprehend it. This information also guides your search for the next point of attention.

This method of watering down an image into it's most important components is the basis of visual attention models. The process of finding the next attention point is seen as a sequential task on convolutional features extracted from the image.

RAM - Recurrent Attention Model

This paper approaches the problem of attention by using reinforcement learning to model how the human eye works. It defines a glimpse vector that extracts features of an image around a certain location.

Several such glimpse vectors extracting features from a different sized crop of the image around a common centre are then resized and converted to a constant resolution. These glimpse vectors are flattened and passed through the glimpse network to obtain a vector representation based on visual attention.

Recurrent Models of Visual Attention
A) Glimpse sensor B) Glimpse network takes and image and location coordinates, crops extract different sized features around the location and resizes them for further processing C)These resized fixed length feature vectors are passed to an RNN which generates the next location for to pay attention to. source

Following this, there is a Location Network which utilises an RNN to predict which part of the image our algorithm should pay attention to next. This predicted location becomes the next input for your glimpse network. This is a stochastic process which helps us balance exploration and exploitation while we are back-propagating our network to maximize our rewards. The back-propagation is done using the REINFORCE policy gradient on the log-likelihood of the attention score.

DRAM - Deep Recurrent Attention Model

Instead of using a single RNN, DRAM uses two RNNs - a location RNN to predict the next glimpse location and another Classification RNN dedicated to predicting the class labels or guess which character is it we are looking at in the text. A context network is used to downsample image inputs for more generalisable RNN states. It also chooses to refer to the location network in RAM as Emission Network. The training is done using an accumulated reward and optimizing the sequence log-likelihood loss function using the REINFORCE policy gradient.

visual attention model using deep learning
The DRAM model - source 

CRNN - Convolutional Recurrent Neural Networks

CRNNs don't treat our OCR task as a reinforcement learning problem but as a machine learning problem with a custom loss. The loss used is called CTC loss - Connectionist Temporal Classification. The convolutional layers are used as feature extractors that pass these features to the recurrent layers - bi-directional LSTMs . These are followed by a transcription layer that uses a probabilistic approach to decode our LSTM outputs. Each frame generated by the LSTM is decoded into a character and these characters are fed into a final decoder/transcription layer which will output the final predicted sequence.

Neural Network for Image-based Sequence Recognition

Spatial Transformer Networks

Spatial Transformer Networks, introduced in this paper, augment input images by applying affine transformations so that the trained model is robust to variations in data.  


The network consists of a localisation net, a grid generator and a sampler. The localisation net takes an input image and gives us the parameters for the transformation we want to apply on it. The grid generator uses a desired output template, multiplies it with the parameters obtained from the localisation net and brings us the location of the point we want to apply the transformation at to get the desired result. A bilinear sampling kernel is finally used to generate our transformed feature maps.

Attention OCR

Attention-OCR is an OCR project available on tensorflow as an implementation of this paper and came into being as a way to solve the image captioning problem. It can be thought of as a CRNN followed by an attention decoder.

Tensorflow Attention OCR

First we use layers of convolutional networks to extract encoded image features. These extracted features are then encoded to strings and passed through a recurrent network for the attention mechanism to process. The attention mechanism used in the implementation is borrowed from the Seq2Seq machine translation model. We use this attention based decoder to finally predict the text in our image.


We will use attention-ocr to train a model on a set of images of number plates along with their labels - the text present in the number plates and the bounding box coordinates of those number plates. The dataset was acquired from here.

The steps followed are summarized here:

  1. Gather annotated training data
  2. Get crops for each frame of each video where the number plates are.
  3. Generate tfrecords for all the cropped files.
  4. Place them in models/research/attention_ocr/python/datasets as required (in the FSNS dataset format). Follow this link or the following sections of this blog.
  5. Train the model using Attention OCR.
  6. Make prediction on your own cropped images.

Or you can explore the Nanonets API where all you have to do is upload annotated images and let the platform handle the rest for you. More about this in the final section.

This blog will run you through everything you need to train and make predictions using attention-ocr. Full code available here.

Getting training data

We have images of number plates but we do not have the text in them or the bounding box numbers of the number plates in these images. Use an annotation tool to get your annotations and save them in a .csv file.

Mahindra Taxi for extracting number from number plate

Get crops

We have stored our bounding box data as a .csv file. The .csv file has the following fields:

  1. files
  2. text
  3. xmin
  4. xmax
  5. ymin
  6. ymax

To crop the images and get only the cropped window we have to deal with different sized images. To do this we read the csv data in as a pandas dataframe and get our coordinates in such a way that we don't miss any information about the number plates while also maintaining a constant size of the crops. This will prove helpful when we are training our OCR model.

import os
import cv2
import pandas as pd # The annotation file consists of image names, text label, # bounding box information like xmin, ymin, xmax and ymax.
ANNOTATION_FILE = 'data/annot_file.csv'
df = pd.read_csv(ANNOTATION_FILE) #image directory path
IMG_DIR = 'data/images'
# The cropped images will be stored here
CROP_DIR = 'data/crops' files = df['files'] size = (200,200) for file in files: print(file) img = cv2.imread(IMG_DIR +'/' + file) annot_data = df[df['files'] == file] xmin = int(annot_data['xmin']) ymin = int(annot_data['ymin']) xmax = int(annot_data['xmax']) ymax = int(annot_data['ymax']) crop = img[ymin:ymax,xmin:xmax] new_crop = cv2.resize(crop, dsize=size, interpolation=cv2.INTER_CUBIC) + '/' + file.split('.')[0] + '.png', 'PNG', quality=90)

Generate tfrecords

Having stored our cropped images of equal sizes in a different directory, we can begin using those images to generate tfrecords that we will use to train our dataset. Here's a script to generate tfrecords. These tfrecords along with the label mapping have to be stored in the tensorflow object detection API inside the following directory -

# The dataset has to be in the FSNS dataset format. # For this, your test and train tfrecords along with the #charset labels text file are placed inside a folder named # 'fsns' inside the 'datasets' directory. # you can change this to another folder and upload your
# tfrecord files and charset-labels.txt here. You'll
# have to change the path in multiple places accordingly. # I have used a directory called 'number_plates' inside # the datasets/data directory.
DATA_PATH = 'models/research/attention_ocr/python/datasets/data/number_plates'

Now generate tf records by running the following script.

import os
import cv2
import random
import numpy as np import pandas as pd
import tensorflow as tf
from helpers import get_char_mapping
from tensorflow.python.platform import gfile MAX_STR_LEN = 20 def read_image(img_path): return cv2.imread(img_path) # Null ID depends on your charset label map. null = 43
def padding_char_ids(char_ids_unpadded, null_id = null, max_str_len=MAX_STR_LEN): return char_ids_unpadded + [null_id for x in range(max_str_len - len(char_ids_unpadded))] def get_bytelist_feature(x): return tf.train.Feature(bytes_list = tf.train.BytesList(value=x)) def get_floatlist_feature(x): return tf.train.Feature(float_list = tf.train.FloatList(value=x)) def get_intlist_feature(x): return tf.train.Feature(int64_list = tf.train.Int64List(value=x)) def get_tf_example(img_file, annotation, num_of_views=1): img_array = read_image(img_file) img = gfile.FastGFile(img_file, 'rb').read() char_map, _ = get_char_mapping() text = annotation['text'].values[0] split_text = [x for x in text] char_ids_unpadded = [char_map[x] for x in split_text] char_ids_padded = padding_char_ids(char_ids_unpadded) char_ids_unpadded = [int(x) for x in char_ids_unpadded] char_ids_padded = [int(x) for x in char_ids_padded] features = tf.train.Features(feature = { 'image/format': get_bytelist_feature([b'png']), 'image/encoded': get_bytelist_feature([img]), 'image/class': get_intlist_feature(char_ids_padded), 'image/unpadded_class': get_intlist_feature(char_ids_unpadded), 'image/width': get_intlist_feature([img_array.shape[1]]), 'image/orig_width': get_intlist_feature([img_array.shape[1]/num_of_views]), 'image/text': get_bytelist_feature([text]) } ) example = tf.train.Example(features=features) return example def get_tf_records(): train_file = DATA_PATH + '/' + 'train.tfrecord' test_file = DATA_PATH + '/' + 'test.tfrecord' if os.path.exists(train_file): os.remove(train_file) if os.path.exists(test_file): os.remove(test_file) train_writer = test_writer = annot = pd.read_csv(ANNOTATION_FILE) # define the annotation file explicitly annot['files'] = CROP_DIR + '/' + annot['files'] files = list(annot['files'].values) random.shuffle(files) for i, file in enumerate(files): print('writing file:', file) annotation = annot.[annot['files'] == file] example = get_tf_example(file, annotation) if i < 251: train_writer.write(example.SerializeToString()) else: test_writer.write(example.SerializeToString()) train_writer.close() test_writer.close() # Generate tfrecords!
if __name__ == '__main__': get_tf_records()

Setting our Attention-OCR up

Once we have our tfrecords and charset labels stored in the required directory, we need to write a dataset config script that will help us split our data into train and test for the attention OCR training script to process.

Make a python file and name it '' and place it inside the following directory:


The contents of the are as follows.

import datasets.fsns as fsns DEFAULT_DATASET_DIR = 'models/research/attention_ocr/python/datasets/data/number_plates' DEFAULT_CONFIG = { 'name': 'number_plates', # you can change the name if you want. 'splits': { 'train': { 'size': 250, # change according to your own train-test split 'pattern': 'train.tfrecord' }, 'test': { 'size': 49, # change according to your own train-test split 'pattern': 'test.tfrecord' } }, 'charset_filename': 'charset-labels.txt', 'image_shape': (200, 200, 3), # change this according to crop images size. 'num_of_views': 1, 'max_sequence_length': MAX_STR_LEN, # TO BE CONFIGURED 'null_code': 43, 'items_to_descriptions': { 'image': 'A (200X200) 3 channel color image.', 'label': 'Characters codes.', 'text': 'A unicode string.', 'length': 'A length of the encoded text.', 'num_of_views': 'A number of different views stored within the image.' }
} def get_split(split_name, dataset_dir=None, config=None): if not dataset_dir: dataset_dir = DEFAULT_DATASET_DIR if not config: config = DEFAULT_CONFIG return fsns.get_split(split_name, dataset_dir, config)

Train the model

Move into the following directory:


Open the file named '' and specify where you'd want to log your training.

# The train logs directory defaults to /tmp/attention_ocr/train. # You can change it to whatever you like. LOGS_DIR = 'models/research/attention_ocr/number_plates_model_logs'
flags.DEFINE_string('train_log_dir', LOGS_DIR, 'Directory where to write event logs.')

and run the following command on your terminal:

# change this if you changed the dataset name in the # script or if you want to change the
# number of epochs python --dataset_name=number_plates --max_number_of_steps=3000

Evaluate the model

Run the following command from terminal.

python --dataset_name='number_plates'

Get predictions

Now from the same directory run the following command on your shell.

python --dataset_name=number_plates --batch_size=8, \
--checkpoint='models/research/attention_ocr/number_plates_model_logs/model.ckpt-6000', \
sweating off meme

We learned about attention mechanism, transformers, different ways visual attention is applied - RAM, DRAM and CRNNs. We learned about STNs. Finally we learned about the deep learning approach we used - Attention OCR.

From a programming perspective, we learnt how to use attention OCR to train it on your own dataset and run inference using a trained model. The code can be found here and in my attention-ocr fork.

There's of course a better, much simpler and more intuitive way to do this.

OCR with Nanonets

The Nanonets OCR API allows you to build OCR models with ease. You can upload your data, annotate it, set the model to train and wait for getting predictions through a browser based UI without writing a single line of code, worrying about GPUs or finding the right architectures for your deep learning models.

Using the GUI:

You can also use the Nanonets-OCR API by following the steps below:

Using NanoNets API

Below, we will give you a step-by-step guide to training your own model using the Nanonets API, in 9 simple steps.

Step 1: Clone the Repo

git clone
cd nanonets-ocr-sample-python
sudo pip install requests
sudo pip install tqdm

Step 2: Get your free API Key

Get your free API Key from

Step 3: Set the API key as an Environment Variable


Step 4: Create a New Model

python ./code/

Note: This generates a MODEL_ID that you need for the next step

Step 5: Add Model Id as Environment Variable


Step 6: Upload the Training Data

Collect the images of object you want to detect. Once you have dataset ready in folder images (image files), start uploading the dataset.

python ./code/

Step 7: Train Model

Once the Images have been uploaded, begin training the Model

python ./code/

Step 8: Get Model State

The model takes ~30 minutes to train. You will get an email once the model is trained. In the meanwhile you check the state of the model

watch -n 100 python ./code/

Step 9: Make Prediction

Once the model is trained. You can make predictions using the model

python ./code/ PATH_TO_YOUR_IMAGE.jpg
john stewart boom