Tips & tricks for using Google Vision API for text detection.

By Monark Unadkat

The Google Cloud Vision API enables developers to create vision based machine learning applications based on object detection, OCR, etc. without having any actual background in machine learning.

The Google Cloud Vision API takes incredibly complex machine learning models centered around image recognition and formats it in a simple REST API interface. It encompasses a broad selection of tools to extract contextual data on your images with a single API request. It uses a model which trained on a large dataset of images, similar to the models used to power Google Photos, so there is no need to develop and train your own custom model.

For the purpose of this article, I’ll only focus on OCR capability of Google Cloud Vision API and provide you all with some tips and tricks for using the API for OCR text detection.

A code snippet below makes a DOCUMENT_TEXT_DETECTION request using python API library.

The response contains a AnnotateImageResponse, which is a json consisting of a list of Image Annotation results.

  1. First is text_annotations which is a contains only word level information, i.e, words and their particular location coordinates. A simple text_annotation element will look like this.
{
"responses": [
{
"textAnnotations": [
{
"locale": "en",
"description": "Wake up human!\n",
"boundingPoly": {
"vertices": [
{
"x": 29,
"y": 394
},
{
"x": 570,
"y": 394
},
{
"x": 570,
"y": 466
},
{
"x": 29,
"y": 466
}
]
}
}, ......
.....

2. Another one which we’ll be using is full_text_annotation which contains character level data descriptions. The full_text_annotation contains a structured representation of OCR extracted text which is like this : full_text_annotation -> Page -> Block -> Paragraph -> Word ->Symbols.

  • Group of symbols create words, group of words create a paragraph and so on. Each representation has properties like language and bounding_box.
  • You can use this type of representation to divide the text content and process them separately as you want.
  • Suppose you are extracting information from text documents with fixed format, using this representation will make your work easier.

The code below uses the response to draw bounding boxes around the feature we specify, in this case it is a word.

Image on left is the original image and the image on right contains the plotted bounding boxes for each detected word.

If we do it for block, we get

The block-wise separation helps to identify different parts of the image and helps to extract important information ignoring unwanted text. It doesn’t look so useful here, but suppose you are extracting information from some image of identification documents

Suppose in the above image, you want to extract the amount of Loans, including overdrafts, for that you need the location of that keyword, in this case, it is ‘Overdrafts’. You can use this below code to search for the location of that word, and based on the coordinates of the word, you can then extract that particular amount.

Now you have the coordinates of the word ‘Overdrafts’. To extract the amount, assume a box starting from the right of the word ‘Overdrafts’ having same width as of overdrafts. We need to extract text inside this box. You can use the below code to do that.

  • Extracting data from user forms or identification documents.
  • Extracting text from scanned images containing text.
  • Scanning user passbooks, and many such use cases.