Smile Detection— A “Hello World” for Computer Vision

Reuben Sinha
9 min readFeb 10, 2019

Humans are hardwired to read emotions off faces. This ability has been ingrained in us as a result of the enormous evolutionary advantage it provides. A smiling face implies agreeableness and from an angry face, we infer danger! Thus, our minds immediately evaluate the underlying emotion of a person by merely looking at their face. Fascinating! Lets make machines do that.

A little theory first

It’s fairly simple, in fact you’ll be completely capable of tinkering with simple image processing kernels within minutes after reading this article. A fair introduction to Convolution Neural Networks and Batch Normalization, I believe.

Convolution Neural Networks

CNN is a special Deep Learning technique which has proven to be greatly effective when data can be expressed in the form of multidimensional inputs. To put it forth simply, this type of neural network learns to identify distinctive patterns which could potentially shift and distort around. This amazing feat is achieved by Convolution and Pooling.

Robust feature extraction

What is convolution?

The process of convolution is performed by the convolution layer which accepts a multidimensional input and chucks out a multidimensional output known as convolved feature/ feature map which is usually supplied to the subsequent layers for further processing.

Convolution is performed using matrices of weights exclusive to CNNs called filters/kernel . Think of filters as digital sieves which selectively allow information to pass through. These filters are characterized by their size, stride and channels. Size represents the shape of the filter, stride represents the number of units by which the filter shifts and finally, channels represent the number of convolved features created by stacking layers of filters. The weights of these filters are adjusted and tuned using back propagation to enable robust feature extraction. In the example displayed below, the filter (blue matrix) moves from top to bottom, left to right, with a stride of one, multiplying the input image pixels with the weights of corresponding cells to create the convolved feature/feature map.

Source: From Bits to Brains

What is Pooling?

Pooling is the aggregation of neighboring cells in a feature map based on a specified aggregation rule and serves the purpose of reducing the size of the feature map. Pooling is characterized by the size and stride of the Pooling layer. The size determines the number of cells to be considered for aggregation, while the stride determines the number of units by which focus is shifted to the next set of cells.

Please note that Pooling layers are different from filters as it utilizes aggregation techniques like Maximum and Average of cells and no weights are involved. Hence, back propagation doesn’t affect this layer. The example given below demonstrates Max Pooling, i.e., the Maximum of cells of four are considered, with a stride of two, moving from top to bottom and left to right to create the shrunken feature map.

Source: O’Reilly Media

Standard CNN Model

Most of the established models of CNNs start with the production of large convolved features in small numbers which grows smaller in size and larger in numbers as we approach the output layer which is usually a fully connected layer. The convolution layers are intercepted by pooling layers and sometimes dropout layers to curb over fitting.

Source: Towards Data Science

Batch Normalization

Normalizing a Data results in the mean being zero and the standard deviation, one. The performance of Neural networks decline when there is variation in the distribution of data and normalizing the input helps stabilize this undesired fluctuation. Now think of every layer in the neural network as an input layer to the subsequent layers, wouldn’t it be superb to normalize the data before it is supplied to the next layer?(Oh yeah!). As the weights of the nodes change, the resulting output will also change. Thus, the distribution of the data supplied as input for subsequent layers are constantly changing due to back propagation. This results in longer training periods and possibly even stagnation of the network. Well, some folks at Google thought about this and released a paper on Batch Normalization.

Major advantage of Batch Normalization include acceptance of higher learning rates, faster convergence (About 14 times faster), the exclusion of dropout layers and improved accuracy.

Normalization for attribute k is performed using the formula shown below using mini batches of data obtain from the previous layer

Normalization formula

Activation Function

Activation functions envelope the outputs of nodes in neural networks. Without activation function, the network would behave erratically as the output shoot back and forth with every iteration of training data, rendering all our precious computation power useless.

In this project, we’ll use Rectified Linear Unit (ReLU) as the activation function for hidden layers and Sigmoid function for the output layer.

ReLU(x) returns max(0, x)

Sigmoid(x) returns 1/(1+exp(-x))

Time to school Machines

Alright! we’ve learnt a lot in these few minutes, sufficient for this project. It’s about time we start coding. We shall construct a model capable of detecting smiling faces from images.

Data

The smile face data is obtained from here. It consists of images (size 64x64) of faces depicting smiles and frowns. It is the perfect dataset for our machine to take it’s baby steps into the realm of emotion!

The training set has 600 examples. The testing set has 150 examples.

Source: Happy House Dataset

Modules required

NumPy is the fundamental package for scientific computing with Python. It’s perfect for storing and operating on high dimensional data efficiently.

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Keras is a deep learning library that is capable of running on top of Tensorflow. It basically turns Tensorflow into cakewalk.

H5py is a Pythonic interface to the HDF5 binary data format. It lets you store huge amounts of numerical data, and easily manipulate that data from numpy.

Sklearn is python’s machine learning module with numerous useful functionalities and models.

import numpy as np # linear algebra
import pandas as pd # data processing
from tensorflow import keras # neural network
import h5py # extract data
from sklearn.metrics import confusion_matrix # confusion matrix

Loading the data

Copy the code snippet to load the data from the HDF5 file. The function load_dataset() returns a tuple, neatly segregating the data into training and testing data.

def load_dataset(path_to_train, path_to_test):
train_dataset = h5py.File(path_to_train)
train_x = np.array(train_dataset['train_set_x'][:])
train_y = np.array(train_dataset['train_set_y'][:])
test_dataset = h5py.File(path_to_test)
test_x = np.array(test_dataset['test_set_x'][:])
test_y = np.array(test_dataset['test_set_y'][:])
return train_x, train_y, test_x, test_y(train_x, train_y, test_x, test_y) = load_dataset('../input/train_happy.h5', '../input/test_happy.h5')

Neural Network Model

The Neural network we implement is a simple sequential model with Convolution layers, Batch Normalization Layers, Max Pooling layers and Dense Layers.

model = keras.Sequential([    #Input layer
keras.layers.BatchNormalization(input_shape = (64,64,3)),
keras.layers.Conv2D(filters = 32, kernel_size = 5, padding='Same', activation = 'relu'), keras.layers.BatchNormalization(), keras.layers.MaxPooling2D(pool_size=2, strides=2),

keras.layers.Conv2D(filters =64, kernel_size = 5, padding='Same', activation = 'relu'),
keras.layers.BatchNormalization(), keras.layers.MaxPooling2D(pool_size=2, strides=2), keras.layers.Flatten(), keras.layers.Dense(128, activation = 'relu'),

keras.layers.Dense(84, activation = 'relu'),

#Output layer
keras.layers.Dense(1, activation = 'sigmoid')
])

2D convolution layers with 32 filters, size of 5x5 and activation ReLU are utilized. Padding is included to prevent the dimensions of the resulting convolved features from shrinking, it is a common practice to avoid loss of information in the initial layers of the network.

Batch Normalization layer is used after every convolution layer, in order to normalize the convolved features. I included batch normalization as the input layer to transform and feed normalized input to the subsequent CNN.

Max Pooling layer aggregates cells in the resulting convolved features, thereby shrinking its size.

Flattening layer converts the 2D convolved features into a 1D ordered tuple to allow the fully connected layer to work on it.

Dense layers are fully connected layers. They can only accept inputs expressed in a one dimension. The inner layers use ReLU activation. However, the outer layers use Sigmoid function as the model performs binary classification (smiling or frowning).

The input layer has an argument, input_shape = (64, 64, 3) . This is because, the image supplied has a width and height of 64 units and 3 color channels (RGB).

The output layer is a dense layer with a single node. This node is only required to produce values between 1 (Smiling) and 0 (Not smiling) using the Sigmoid activation function.

Model Compiling

Before training a model, you need to configure the learning process, which is done via the compile() method. It accepts optimizer, loss and metrics as arguments.

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Optimizer accepts the optimizing algorithm. These are usually derived from the classic gradient descent. We use one such optimizer called ADAM, which is capable of adapting the learning rate for every parameter.

Loss function is used as a metric to determine the performance of the model and is the basis for back propagation during iterations through epochs. They provides a quantitative representation of the variation between expected output and observed output. For this project, we use Binary Cross entropy as the task involves binary classification.

Metrics are used to track the performance of our model through every epoch. Usually, accuracy (Correct rate) is the metric used to evaluate the performance of a neural network during training.

Source: ResearchGate

Model Training

The model is trained by calling the fit() function. Several arguments are passed to populate certain training parameters.

callback_list = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, verbose=1)]fit_data = model.fit(train_x, train_y, epochs = 32, validation_split=0.2, callbacks = callback_list)

Epochs are the number of iteration through which the model will be trained on the training data.

First Epoch of training

Validation split is used to split the training dataset randomly into two sets, where one is used for training and the other finds it’s use to determine the validation metrics after every epoch. In our case, 20% of the data is used for validation.

Early stopping is a callback argument that monitors the validation loss (loss measured using the validation set). Since we’ve set patience as 5, the training will stop if the validation loss hasn’t improved in 5 epochs.

Early stopping has occurred, stopping the training at the 26th epoch

Visualization of model metrics

Phew! We’ve come a long way now. Let’s sit back and visualize the die-hard training we’ve subjected our model to. It’s very interesting to note the variation between training accuracy and validation accuracy. Did you notice how the validation accuracy swings? (Behold life! Peaks and valleys) It is a classic behavior of back propagation.

Validation accuracy tends to be lower than the training accuracy

Training accuracy vs Validation accuracy

Let’s have a look at the comparison between the Training loss and validation loss. Again, the oscillations just won’t stop. But as the training proceeds, the validation loss drastically decreases, almost coincident with the Training loss.

Validation loss tends to be higher than the Training loss

Training Loss vs Validation Loss

Evaluation of our Model

Boom! Our model is mighty and has performed quite well on the test data with an accuracy of 95.33%.

Accuracy of our model

However, merely tracking the accuracy isn’t sufficient for this project. Since, this is a classification task, we must ensure that proper classes have been assigned, which is why the Confusion Matrix has to be produced. And Surprise surprise! (Not really) Our model has performed quite well, only misclassifying 7 items.

Confusion Matrix

Conclusion

Alas, the machine knows what we feel and you’ve just plunged head first into Deep learning. For a relatively simple CNN, our model has performed considerably well. However, for complex tasks you should utilize tried and tested champion models (Winners of ImageNet) like LeNet, ResNet, VGG and GoogLeNet, which are extremely adept and are significantly powerful with soaring accuracy.

Post word

Thanks for reading! I wished for this to be a simple introduction to Computer vision and hope I’ve remained true to that. I’ll be exploring more of deep learning, and everything I learn will be documented. Stay tuned and lets embark on this journey together.

Further Study

[1] Convolution Neural Networks by Coursera

[2] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

--

--