Deep learning and vision, from simple manipulation to image classification: Part 2
Introduction: Now that we have revisited some basic concepts related to computer vision in our


Now that we have revisited some basic concepts related to computer vision in our previous post, it is time to move forward and explore more sophisticated algorithms that will recognize either a dog or cat in a given image.

Through this post, we will work with the dogs vs. cats problem from Kaggle and its data, which can be found here. You’ll need to register with Kaggle in order to download the train and test data.

After you register and download the data, we’ll perform an exploratory analysis and then build, train, and evaluate a convolutional neural network for binomial classification. The model will output 0 or 1 for the cases where it determines that the image contains a dog or cat respectively.

[Step 1] Data exploration:

As stated before, data exploration is, most of the time, the first step we need to take before we even try to come up with preliminary experiments. By just looking at the files in each of the files, train.zip and test1.zip, we’ve downloaded, we can spot the following details:

Table 1: Initial dataset observations

As our test set is not labeled, it will not be possible for us to use it for getting performance metrics. The files will, therefore, be only used to generate the final submission file for the Kaggle judge.

Another important observation we can make by opening some of the images in the test and train sets is that they seem to be different in size and aspect ratio. In order to confirm this, we’ll randomly plot and compare some of them.

Snippet 1: Randomly plot images from the train set

train_path = "data/train"
images = glob.glob(os.path.join(train_path, "*.jpg"))

plt.figure(figsize=(16, 8))
for index in range(6):
    plt.subplot(2, 3, index+1)
    img_index=int(np.random.uniform(0, 24999))

Figure 1: Sample images from the training set
 Sample images from the training set for a convolutional neural network model

As we run the above script several times, we observe that our intuition was right: images differ from each other in size and aspect ratio. Normalization seems to be needed but several questions arise almost immediately: What size would we use for resizing and normalizing all the images so they can later be used to train our model? Wouldn’t the new size need to be determined so it works for both larger and smaller images? Finally, what proportion of images are small, medium, or large?

To address those questions, we prepare the following script to get the distribution over height and width (in 100-pixel ranges) for each image in the train set:

Snippet 2: Distribution oversize in the training set


arr_h=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
arr_w=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

for img_index in range(len(images)):

    arr_w[int(img_w / 100)-1] += 1
    arr_h[int(img_h / 100)-1] += 1

    if img_w > max_w: max_w = img_w
    elif img_w < min_w: min_w = img_w if img_h > max_h: max_h = img_h
    elif img_h < min_h: min_h = img_h

print("Max Width: %i - Min Width: %i \nMax Height: %i - Min Height: %i" % (max_w, min_w, max_h, min_h))

If we plot the arr_w and arr_h vectors containing the number of images with width and height ranging from 0 to 1,000 pixels (in 100-pixel intervals), we observe that the majority of them are smaller than 400 x 400 pixels.

Figure 2: Height and width distributions

Height and width distributions for a convolutional neural network model

We can now come up with a strategy for resizing and padding our images. This is the only preprocessing task we’ll do before training our convolutional neural network. The resizeImg and padImg functions will maintain the original aspect ratio for each image while padding, if necessary, for images with different aspect ratios:

Snippet 3: Resizing and padding functions


def resizeImg(image):

    slot = int(img_w/100) +1 if img_w > img_h else int(img_h/100) +1 

    if slot!=0:
        if img_w >= img_h:
            img = image.resize((resize_default,int(resize_default*img_h/img_w)), Image.ANTIALIAS)
            img = image.resize((int(resize_default*img_w/img_h),resize_default), Image.ANTIALIAS)

    return img;

def padImg(image):


    if img_w > resize_default or img_h > resize_default:
        if img_w >= img_h:
            new_size = (img_w, img_w)
            new_size = (img_h, img_h)
        new_size = (resize_default, resize_default)

    img = Image.new("RGB", new_size)
    img.paste(image, (int((new_size[0]-img_w)/2),int((new_size[1]-img_h)/2)))

    return img;

#testImage = Image.open(images[int(np.random.uniform(0, 24999))])
testImage = Image.open(images[468])
resized = resizeImg(testImage)
padded = padImg(resized)

plt.figure(figsize=(12, 8))
plt.subplot(1, 3, 1)
plt.subplot(1, 3, 2)
plt.subplot(1, 3, 3)

Calling both functions will have the following output:

Figure 3: Padding and resizing of images

Padding and resizing of images for a convolutional neural network model

All images will be resized to 64×64 pixels and padded vertically or horizontally, if necessary. We can batch process all images as a preliminary step or include the functions right before we provide the samples to the trainer when fitting the model.

[Step 2] Building the convolutional neural network:

Up to this point, we’re familiar with convolutions for image processing. We’ve also explored the data we have available and decided that padding and resizing are needed in order to provide our model with a normalized input pattern. The 64×64 pixel image equals to 4,096 features (input neurons), which means we need to fit it into a 2-class classifier. It means that for every 64×64 pixel image we feed into the convolutional network, it’ll try to predict whether the input data belong to the classes cat or dog.

In addition to the two functions we’ve already seen for resizing and padding, we’ll need some other ones before we train the network. The get_label and getXYBatch functions shown in Snippet 4 are explained below:

Get_label: as we’ll get an output vector for every input pattern (or image), it will have a 2-element vector shape. There are only two possible values for the resulting vector: [0, 1] and [1, 0]. The first one will count as “cat” whereas the second one will count as “dog” in terms of the result the network is predicting.

getXYBatch: given our computers don’t have infinite memory, allocating all 25,000 images for training is just not possible. We will resize and pad batches of 60-to-500 images and then feed the trainer with them in the training steps.

Snippet 4: get_label and getXYBatch functions

# extract labels
# positives = [1, 0], negatives = [0, 1]
def get_label(path):
    if path.split('/')[-1:][0].startswith('cat'):
        return np.array([1, 0])
        return np.array([0, 1])

def getXYBatch(X_input, Y_input, batch_size):
    X_array = np.array(padImg(resizeImg(Image.open(X_input[0])))).reshape([-1]) / 255
    Y_array = Y_input[0]

    choice = np.random.choice(range(len(X_input)), batch_size, replace=False)
    for item in choice:
        tmpimg = np.array(padImg(resizeImg(Image.open(X_input[item])))).reshape([-1]) / 255
        X_array = np.vstack((X_array, tmpimg))
        Y_array = np.vstack((Y_array,Y_input[item]))

    X_array = X_array[1:]
    Y_array = Y_array[1:]

    X_array = X_array.reshape([-1,resize_default,resize_default,3])

    return X_array, Y_array;

Now we split the train set into two parts for actual training but also for validation. We’ll use 10% of the training images to measure how well the model is performing after, let’s say, 100 iterations. The following code will do it for us:

Snippet 5: Splitting the training set

train_path = "data/train"
images = glob.glob(os.path.join(train_path, "*.jpg"))

# extract pixels
data_images = images

data_labels = np.array([get_label(p) for p in images])
data_labels_out = np.argmax(data_labels, 1)

print("Positive samples: %i\nNegative samples: %i \n" % (len(data_labels_out)-np.count_nonzero(data_labels_out)
                                                      , np.count_nonzero(data_labels_out)))
#Split Data Sets
X_train, X_test, y_train, y_test = train_test_split(data_images, data_labels, test_size=0.2)
y_train_out = np.argmax(y_train, 1)
y_test_out = np.argmax(y_test, 1)

Finally, before jumping into the model’s code itself, assuming we’re excited about it, we’ll define some convenience functions to simplify the layers’ construction:

dropout: turn off hidden neurons given a probability (only in the training phase).
weight_variable: variables for the neurons’ weights.
bias_variable: variables for the neurons’ biases.
conv2d: convolution between the input and weights, with strides 1 and padding ‘SAME’
max_pool_2x2: max pooling operation, keeps only the maximum elements after each convolutional layer.

Snippet 6: Common tensorflow methods

def dropout(x, prob, train_phase):
    return tf.cond(train_phase,
                   lambda: tf.nn.dropout(x, prob),
                   lambda: x)

def weight_variable(shape):
  return tf.Variable(tf.truncated_normal(shape, stddev=0.1))

def bias_variable(shape):
  return tf.Variable(tf.constant(0.1, shape=shape))

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1], padding='SAME')

Now, let’s build the layers of the network. Our model will have an input layer followed by convolution and max-pooling layers. In the last part of the network architecture, we will flatten the feature maps and have a fully connected layer. A representation of the model is shown in Figure 4.

Figure 4: Neural Network Architecture

Convolutional neural network architecture

We define two x and y variables for the 64×64 pixel images. As they use the RGB schema (3 channels), the final shape for the input layer will be 64x64x3.

Snippet 7: Network implementation

sess = tf.InteractiveSession()

# tf Graph Input
x = tf.placeholder(tf.float32, [None,64,64,3])
y = tf.placeholder(tf.float32, [None, 2])

# dropout placeholder
keep_prob = tf.placeholder(tf.float32)

# train flag placeholder
train_phase = tf.placeholder(tf.bool) # For Batch Normalization

# Set model weights
W1 = weight_variable([3, 3, 3, 32])
b1 = bias_variable([32])

W2 = weight_variable([3, 3, 32, 64])
b2 = bias_variable([64])

W3 = weight_variable([3, 3, 64, 64])
b3 = bias_variable([64])

W4 = weight_variable([16 * 16 * 64, 512])
b4 = bias_variable([512])

W5 = weight_variable([512, 2])
b5 = bias_variable([2])

# hidden layers
conv1 = tf.nn.relu(conv2d(x, W1) + b1)
maxp1 = max_pool_2x2(conv1)

conv2 = tf.nn.relu(conv2d(maxp1, W2) + b2)
#maxp2 = max_pool_2x2(conv2)

conv3 = tf.nn.relu(conv2d(conv2, W3) + b3)
maxp3 = max_pool_2x2(conv3)

# fully connected
maxp3_flat = tf.reshape(maxp3, [-1, 16 * 16 * 64])

full1 = tf.nn.relu(tf.matmul(maxp3_flat, W4) + b4)
drop1 = tf.nn.dropout(full1, keep_prob)

output = tf.matmul(drop1, W5) + b5

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=y))

all_variables = tf.trainable_variables()

As describing each of the functions and methods used will be tedious and make this post super long, feel free to browse the Tensor Flow official documentation for those that you are interested in: https://www.tensorflow.org/api_docs/.

You may also want to revisit some concepts related to earning and optimization such as Loss Functions, Stochastic Gradient Descent, and Cross Entropy.

[Step 3] Training time:

Now we just need to define some hyperparameters and let the trainer fit the model to our training data. We’ll display the model accuracy after every 50 steps. Running the snippet below will show the training progress as shown in Figure 5.

Snippet 8: Trainer

# Hyper-parameters
training_steps = 2000
batch_size = 500
display_step = 100

# Mini-batch Gradient Descent
training_accuracy = []
training_loss     = []

for i in range(training_steps):

    X,Y = getXYBatch(X_train,y_train,batch_size)

    batch_accuracy, batch_loss, _ = sess.run([accuracy, loss, train_step],
                                             feed_dict={x:X, y:Y, train_phase: True, keep_prob: 1.0})
    # Displaying info
    if (i+1)%display_step == 0 or i == 0:
        print("Step %05d: accuracy=%.4f\tloss=%.6f\tlearning rate=%.6f" %
              (i+1, batch_accuracy, batch_loss, learning_rate.eval()))

save_path = saver.save(sess, "./saved/model2K .ckpt")
print("Model saved in file: %s" % save_path)      

plot_titles = ["Training accuracy", "Training Loss"]
for i, plot_data in enumerate([training_accuracy, training_loss]):
    plt.subplot(1, 2, i+1)

Figure 5: Progress while training

Progress while training - Convolutional neural network

We can also plot the accuracy and loss at each training step. In an ideal scenario, the accuracy will become incremental over time, whereas the loss will decrease.

Figure 6: Training accuracy and Loss

Training accuracy and Loss - Convolutional neural network

[Step 4] Using the model with different images:

Our final test consists of using the model with a completely new image that the model hasn’t seen before. We can browse for cats or dog on the internet and pass the images to the classifier using the following code:

Snippet 9: Using the model

test_img = Image.open(X_test[int(np.random.uniform(0, len(X_test)))])

input_array = np.array(padImg(resizeImg(test_img))).reshape([-1]) / 255
input_array = input_array.reshape([-1,64,64,3])

prediction = sess.run(tf.argmax(y, 1), feed_dict={x: input_array, train_phase: False, keep_prob: 1.0})
print("Predicted: " + ("Cat" if prediction[0]==0 else "Dog"))

Figure 7: Model output with an unseen image:

Model output with an unseen image - Convolutional neural network

Hopefully, the model will predict accurately the class (cat or dog) for each image we input. However, there are several other techniques we can use from this point in order to make a more precise model.


In this post, we’ve built a convolutional neural network model capable of classifying images based on whether they contain a cat or dog. While we didn’t revisit all the terms and concepts required to fully understand what we coded and why, it’s a good starting point to see how these techniques can be used in real-life scenarios. Have you ever seen a captcha asking you to click on images containing, let’s say, cars, in order to verify you are not a bot? Can you think of other possible use cases for this type of binary classification?

Find the full code of this post at: https://github.com/gariem/samples/tree/master/meetup/santex-machinelearning/cats-dogs

Read our latest eBook!

Santex’s Ultimate Guide to Development Outsourcing