Building a Chat Bot With Object Detection and OCR

By Mitchell A. Carroll

In part 1 of this series, we gave our bot the ability to detect sentiment from text and respond accordingly. But that’s about all it can do, and admittedly quite boring.

Of course, in a real chat, we often send a multitude of media: from text, images, videos, gifs, to anything else. So in this, our next step in our journey, let’s give our bot vision. The goal of this tutorial is to allow our bot to receive images, reply to them, and eventually give us a crude description of the main object in said image.

Let’s get started!

If you haven’t followed along, you can find the latest code here:

So the code we want to modify is in our event response cycle method, here:

Our bot already responds to images, but it has no idea what they are and responds in a rather bland way.

We can try it out, and see for ourselves. Let’s fire up our server (and ngrok), and send our bot an image.

So far so good. Our bot at least knows when it receives an image.

In this series, we have been using google cloud APIs, so for our image detection, we’ll be using Google Cloud Vision. Follow the quick-start here to get your project all set up: Remember to use the same project that we set up in part 1.

Once you have completed that, now it’s time to get back to coding. Let’s add the following to our Gemfile and run bundle install:

gem ‘google-cloud-vision’let’s require it in main.rb by adding the following:
require ‘google/cloud/vision’

Next, we want to create an instance of the cloud language API:

You can find your project ID in your google cloud console.

The vision API feature that we want to use is called annotation. Given a file path to an image on your local machine, it will attempt to identify the image based on values we pass to the method call.

In the following example (from Google’s documentation):

require "google/cloud/vision"

vision =
image = vision.image "path/to/face.jpg"

annotation = vision.annotate image, faces: true, labels: true
annotation.faces.count #=> 1
annotation.labels.count #=> 4
annotation.text #=> nil

We are telling the vision API attempt to recognize faces and labels. “Labels” are essentially objects that the API determines it has identified. Provided a picture of a dog, we would potentially be given the following labels:

"responses": [
"labelAnnotations": [
"mid": "/m/0bt9lr",
"description": "dog",
"score": 0.97346616
"mid": "/m/09686",
"description": "vertebrate",
"score": 0.85700572
"mid": "/m/01pm38",
"description": "clumber spaniel",
"score": 0.84881884
"mid": "/m/04rky",
"description": "mammal",
"score": 0.847575
"mid": "/m/02wbgd",
"description": "english cocker spaniel",
"score": 0.75829375

Let’s create the following method to utilize this functionality:

The above (admittedly naive) method takes a file path and returns a string, based on the results of the Google cloud vision’s API. In the annotate method, we are passing a few parameters, which tells the API what we want to try to detect.

The return (response) for this method is a cascading short-circuit flow, first checking for famous landmarks, any text, and finally any objects (labels) it has detected. This flow is purely arbitrary and simplified for purposes of this tutorial (i.e. don’t email me about how it can be improved).

Let’s try it on the following picture:

And the results (truncated):

description: "cuisine", score: 0.9247923493385315, confidence: 0.0, topicality: 0.9247923493385315, bounds: 0, locations: 0, properties: {}
description: "sushi", score: 0.9149415493011475, confidence: 0.0, topicality: 0.9149415493011475, bounds: 0, locations: 0, properties: {}
description: "food", score: 0.899940550327301, confidence: 0.0, topicality: 0.899940550327301, bounds: 0, locations: 0, properties: {}
description: "japanese cuisine", score: 0.8769422769546509, confidence: 0.0, topicality: 0.8769422769546509, bounds: 0, locations: 0, properties: {}

Since there were no landmarks or text, we have received the labels the API was able to detect. In this case, we see that it has been identified as “sushi.” In my experience with the label detection results, the second label (having the second highest topicality) tends to be how an average person would identify the picture.

Let’s give it another go on the following:

The output (again truncated):

description: "wildlife", score: 0.9749518036842346, confidence: 0.0, topicality: 0.9749518036842346, bounds: 0, locations: 0, properties: {}
description: "lion", score: 0.9627781510353088, confidence: 0.0, topicality: 0.9627781510353088, bounds: 0, locations: 0, properties: {}
description: "terrestrial animal", score: 0.9247941970825195, confidence: 0.0, topicality: 0.9247941970825195, bounds: 0, locations: 0, properties: {}

And there we see it, with “lion” being the second hit.

Ok, another for good measure, let’s try some text extraction:

Just a screenshot of my text editor

And let's see what we get:

2.4.2 :022 > puts analyze_image("major_general.png")
I am the very model of a modern Major-General,
I've information vegetable, animal, and mineral,
I know the kings of England, and I quote the fights historical
From Marathon to Waterloo, in order categorical;
I'm very well acquainted, too, with matters mathematical,
I understand equations, both the simple and quadratical,
About binomial theorem I'm teeming with a lot o' news, (bothered for a rhyme)
With many cheerful facts about the square of the hypotenuse.
=> nil
2.4.2 :023 >

Not bad.

Ok, last one for completeness’ sake. Let’s try a landmark:

And our method gives us:

2.4.2 :030 > puts analyze_image(“statue_of_liberty.jpg”)
Statue of Liberty

Ok, so our method is working as intended, now let’s actually use it with our chatbot.

When we send an image to our chatbot through our client (Line), the client returns the image data in the response body (along with other relevant information) to our callback. Because our image recognition method needs a file path, we will have to save the aforementioned image data to our local machine.

Let’s modify our method to do that. Change the relevant parts of your callback method to the following:

There is a bit going on here. First, we are creating a new Tempfile, and using the response body (image data) as its content. We’re then passing the tempfile’s path to the analye_image method we just tested in the console. Let’s try it with our bot, just as a sanity check.

Such a nice bot…

And it was able to successfully identify a landmark for us.

Our bot is now just working as a glorified console print line, and that’s not very chatty at all. We want this thing to sound more natural, let’s clean up our method a bit to make it sound more “human”.

In fact, it is.

Let’s make the necessary changes to our code. We’ll be modifying an existing method analyze_image and creating a new method get_analyze_image_response. Here it is below:

Again, this is not a tutorial about Ruby, but rather concepts; however, let’s go over what we’ve just done. In analyze_image we simply removed the string reply and replaced it with our new method get_analyze_image_response. This method takes an annotation object, and based on the type of object identified in the image, builds a sentence (string) using the annotation object’s description values.

Let’s try it out!

A classic:

It’s a bacon cheeseburger, but I’ll give you that one.

And now a landmark:

It indeed is!

And that’s it! Our bot now extracts text from images using optical character recognition, and also give us a basic description of objects it finds in any image we send it.

Currently, our bot can only reply to one-off messages. But what if it had a “memory”, and was able to actually able to have a real conversation? We will cover multi-step communication in Part 3.

Below is all of our code up to this point: