Handwritten digit recognizer on iOS with Keras and Core ML using the MNIST dataset

By Nathan Hubens

The goal of this tutorial is to show the full proceeding to create, train a Deep Learning model and to implement it in an iOS app. The use case here is the “Hello World” of Deep Learning, it is the digit recognition using a dataset of handwritten digits, the MNIST dataset. The model is created and trained by using the Keras framework and is then converted into a Core ML model in order to use it in an iOS app. The project code can be found on my Github.

  • Python 2.7
  • Keras 2 (version 2.0.4)
  • TensorFlow (version 1.1)
  • CoreMLTools (version 0.4.0)
  • macOS Sierra
  • Xcode 9
  • iOS 11 (only if you want to test it on a mobile device)

Keras is a high-level neural networks API released in 2015 and developed by François Chollet, a researcher in AI at Google. It is a Python library and is capable of running on top of either TensorFlow, CNTK or Theano. It was developed with a focus on enabling fast experimentation. It is actually a very user-friendly framework, really easy to use, even for those who don’t have many experience in Deep Learning.

Core ML is a new machine learning framework developed by Apple. It has been launched officially with the iOS 11 release in September 2017. This framework lets you use a trained model on an Apple mobile device. It is also very easy to use and compatible with Swift and Objective-C.

The Keras framework comes already with a MNIST Dataset that can be downloaded. All you need to do is to import it :

from keras.datasets import mnist

The MNIST dataset contains 60.000 images of handwritten digits that we can use to train our neural network. See here some examples of the MNIST dataset.

Handwritten Digits from the MNIST dataset
Generally, create a Neural Network is done in 3 steps : Load the data, create the model, train the model.

After having imported the dataset, we can now load it. We have first imported X_train and y_train, respectively the training images and the training labels (it is the training set) and also X_test and y_test, which are respectively the testing images and the testing labels (we will use them as a validation set, i.e. the set that is used during the training to see how well our network is doing). All of the images have dimensions of (28, 28, 1), which means that their width and height are of 28 pixels, and they have 1 channel (i.e. greyscale images, we would have 3 channels for RGB images). We must reshape X_train and X_test to have the shape (number of images, 28, 28, 1).

We will then normalize the images pixel values between 0 and 1 and One Hot encode the classes. The One Hot encoding aims to represent classes by a vector where all the values will be 0 except the actual label that will be 1. So for example, instead of having the label ‘4’, we will have the vector [0,0,0,0,1,0,0,0,0,0]. One Hot encoding is a common process when you deal with several labels.

The model that we will implement is a Convolutional Neural Network (also CNN or ConvNet) of five layers. The first three layers are all Convolutions with a ReLU activation function and are the layers that will take charge of extracting the features. The output of the third layer is then flattened to be used in fully-connected layers in order to do the classification part of the network. A dropout has also been added. Dropout will actually drop a given percentage of the learned weights at each iteration and will help the network to avoid overfitting. Note that the last activation function is not a ReLU but a softmax in order to be able to interpret the values in terms of probability.

The model is then compiled using the categorical crossentropy loss function (because the classes are one-hot encoded) and the Adam optimizer.

As the MNIST dataset is composed of images that have been cropped, realigned and with similar digit sizes, it can be a problem when you will draw by yourself the images in the iOS application because the digit may not be centered, aligned or the same order of size and this could be problematic for the model. To make the model more robust, we can use what is called data augmentation and that is used to create new data. The data augmentation will dynamically create new images from the original one, only by adding them some rotation, zoom, and vertical/horizontal shifts. By then training the model on more diverse images, he will be acquainted to images slightly different than the perfectly placed ones.

Data augmentation is also very efficient to avoid overfitting

Remark : It is important to think about the operations that are done on the images considering the application. Here, for digit recognition, it would be a bad idea to flip the images or impose too big rotations as it could lead the model to mistake between 6 and 9 for example. For other cases such as Cat vs Dog classification, we could imagine using the horizontal flip but again, not the vertical flip since dogs and cats are usually upside down. Always keep this in mind !

In this case, we only apply rotation of maximum 10°, zoom of maximum 10% and vertical/horizontal shifts of maximum 10% of the image width/height.

The model is now ready to be trained.

We can now use the fit_generator method, which will train the model with the new generated images from the data augmentation part. I used a batch size of 100 during 10 epochs, which led to an accuracy of 99.3% on the validation set. This can still be improved by modifying the model, data augmentation phase,… but it will be sufficient to be working on recognizing our handwritten digits.

Once the model is trained, you don’t want to retrain it often so you can save the model (i.e. the weights of each filter) to be able to reuse it later. This can be done by :

Now we can convert the trained model into a Core ML model. To do this, we can specify the output_labels and save the model. We use a scale factor of 1/255 because we trained images whose pixels where between 0 and 1, not between 0 and 255. We can also add some additional information about the model.

After the Core ML model has been created, drag and drop it into the Xcode project :

Select this file, and wait for a moment. An arrow will appear next to the model class name when Xcode has generated the model class:

We can see displayed at the top some information about the model such as its size. We can also see the expected inputs and outputs of the model at the bottom.

The Core ML model is now ready to be used in an app. We can create an app where we will draw a digit and we will make the inference and try to guess which digit it is.

We can do this by creating a model instance. Then by making the model predict the current image and get the label. This is as simple as shown below :

In this app, we will draw in white color on a black background because it is the way the digits are displayed in the MNIST dataset.