Deep learning and vision, from simple manipulation to image classification: Part 2

Introduction:

After we revisited some basic concepts related to computer visión in our previous post, it is time to move forward and explore more sophisticated algorithms that will recognize either a dog or cat in a given image.

Through this post, we will work with the Dogs vs Cats problem from Kaggle and its data, which can be found here. You’ll need to register with Kaggle in order to download the train and test data.

After you register and download the data, we’ll perform an exploratory analysis and then build, train and evaluate a convolutional neural network for binomial classification. The model will output 0 or 1 for the cases when it determines that the image contains a dog or cat respectively.

[Step 1] Data exploration:

As stated before, data exploration is -most of the time- the first step before we even try to come with preliminary experiments. By just looking at the files in each of the files train.zip and test1.zip we’ve downloaded, we can spot the following details:

Table 1: Initial dataset observations

As our test set is not labeled, it will not be possible for us to use it for getting performance metrics. The files will, therefore, be only used to generate the final submission file for the Kaggle judge.

Another important observation we can make by opening some of the images in the test and train sets is that they seem to be different in size and aspect ratio. In order to confirm this, we’ll randomly plot and compare some of them.

Snippet 1: Randomly plot images from the train set

train_path = "data/train"
images = glob.glob(os.path.join(train_path, "*.jpg"))

plt.figure(figsize=(16, 8))
for index in range(6):
    plt.subplot(2, 3, index+1)
    img_index=int(np.random.uniform(0, 24999))
    plt.imshow(plt.imread(images[img_index]))
    plt.title(images[img_index])

Figure 1: Sample images from the training set

As we run the above script several times, we observe that our intuition was right: images differ from each other in size and aspect ratio. Normalization seems to be needed but immediately several questions arise: What would be the size we will use for resizing and normalizing all the images so they can later be used to train our model? Wouldn’t the new size need to be determined so it works for both larger and smaller images? Finally, what proportion of images are small, medium or large?

To address those questions, we prepare the following script to get the distribution over height and width (in 100-pixel ranges) for each image in the train set:

Snippet 2: Distribution oversize in the training set

max_w=0
max_h=0
min_w=2048
min_h=2048

arr_h=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
arr_w=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

for img_index in range(len(images)):
    img=Image.open(images[img_index]).size
    img_w=img[0]
    img_h=img[1]
    
    arr_w[int(img_w / 100)-1] += 1
    arr_h[int(img_h / 100)-1] += 1

    if img_w > max_w: max_w = img_w
    elif img_w < min_w: min_w = img_w if img_h > max_h: max_h = img_h
    elif img_h < min_h: min_h = img_h

print("Max Width: %i - Min Width: %i \nMax Height: %i - Min Height: %i" % (max_w, min_w, max_h, min_h))

If we plot the arr_w and arr_h vectors containing the number of images with width and height ranging from 0 to 1,000 pixels (in 100-pixel intervals), we observe that the majority of them are smaller than 400 x 400 pixel.

Figure 2: Height and width distributions

We can now come up with a strategy for resizing and padding our images as the only preprocessing task we’ll do before training our convolutional neural network. The resizeImg and padImg functions will maintain the original aspect ratio for each image while padding if necessary for images with different aspect ratios:

Snippet 3: Resizing and padding functions

resize_default=64

def resizeImg(image):
    
    img_w=image.size[0]
    img_h=image.size[1]
    slot = int(img_w/100) +1 if img_w > img_h else int(img_h/100) +1 
    
    if slot!=0:
        if img_w >= img_h:
            img = image.resize((resize_default,int(resize_default*img_h/img_w)), Image.ANTIALIAS)
        else:
            img = image.resize((int(resize_default*img_w/img_h),resize_default), Image.ANTIALIAS)
        
    return img;

def padImg(image):
    
    img_w=image.size[0]
    img_h=image.size[1]
    
    if img_w > resize_default or img_h > resize_default:
        if img_w >= img_h:
            new_size = (img_w, img_w)
        else:
            new_size = (img_h, img_h)
    else:
        new_size = (resize_default, resize_default)
        
    img = Image.new("RGB", new_size)
    img.paste(image, (int((new_size[0]-img_w)/2),int((new_size[1]-img_h)/2)))
        
    return img;


#testImage = Image.open(images[int(np.random.uniform(0, 24999))])
testImage = Image.open(images[468])
resized = resizeImg(testImage)
padded = padImg(resized)

plt.figure(figsize=(12, 8))
plt.subplot(1, 3, 1)
plt.imshow(testImage)
plt.title("Original")
plt.subplot(1, 3, 2)
plt.imshow(resized)
plt.title("Resized")
plt.subplot(1, 3, 3)
plt.imshow(padded)
plt.title("Padded")

Calling both functions will have the following output:

Figure 3: Padding and resizing of images

All images will be resized to 64×64 pixels and if padded vertically or horizontally if necessary. We can batch process all images as a preliminary step or include the functions right before we provide the samples to the trainer when fitting the model.

[Step 2] Building the convolutional neural network:

Up to this point, we’re familiar with convolutions for image processing. We’ve also explored the data we have available and decided that padding and resizing are needed in order to provide our model a normalized input pattern. The 64×64 pixel image equals to 4,096 features (input neurons) we need to fit into a 2-class classifier. It means that for every 64×64 pixel image we feed into the convolutional network, it’ll try to predict whether the input data belong to the classes cat or dog.

In addition to the two functions we’ve already seen for resizing and padding, we’ll need some other ones before we train the network. The get_label and getXYBatch functions shown in Snippet 4 are explained below:

Get_label: as we’ll get an output vector for every input pattern (or image), it will have a 2-element vector shape. There are only two possible values for the resulting vector: [0, 1] and [1, 0]. The first one will count as “cat” whereas the second one as “dog” in terms of the result the network is predicting.

getXYBatch: given our computer don’t have infinite memory, allocating all the 25,000 images for training is just not possible. We’ll resize and pad batches of 60-to-500 images and then feed the trainer with them in the training steps.

Snippet 4: get_label and getXYBatch functions

# extract labels
# positives = [1, 0], negatives = [0, 1]
def get_label(path):
    if path.split('/')[-1:][0].startswith('cat'): 
        return np.array([1, 0])
    else:
        return np.array([0, 1])

def getXYBatch(X_input, Y_input, batch_size):   
    X_array = np.array(padImg(resizeImg(Image.open(X_input[0])))).reshape([-1]) / 255
    Y_array = Y_input[0]

    choice = np.random.choice(range(len(X_input)), batch_size, replace=False)
    for item in choice:
        tmpimg = np.array(padImg(resizeImg(Image.open(X_input[item])))).reshape([-1]) / 255
        X_array = np.vstack((X_array, tmpimg))
        Y_array = np.vstack((Y_array,Y_input[item]))

    X_array = X_array[1:]
    Y_array = Y_array[1:]
    
    X_array = X_array.reshape([-1,resize_default,resize_default,3])
    
    return X_array, Y_array;

Now we split the train set into two parts for actual training but also for validation. We’ll use a 10% of the training images to measure how well the model is performing after, let say, 100 iterations. The following code will do it for us:

Snippet 5: Splitting the training set

train_path = "data/train"
images = glob.glob(os.path.join(train_path, "*.jpg"))
random.shuffle(images)

# extract pixels
data_images = images
        
data_labels = np.array([get_label(p) for p in images])
data_labels_out = np.argmax(data_labels, 1)

print("Positive samples: %i\nNegative samples: %i \n" % (len(data_labels_out)-np.count_nonzero(data_labels_out)
                                                      , np.count_nonzero(data_labels_out)))
#Split Data Sets
X_train, X_test, y_train, y_test = train_test_split(data_images, data_labels, test_size=0.2)
y_train_out = np.argmax(y_train, 1)
y_test_out = np.argmax(y_test, 1)

Finally, before jumping into the model’s code itself (assuming we’re excited about it). We’ll define some convenience functions to simplify the layers construction:

dropout: turn off hidden neurons given a probability (only in the training phase).
weight_variable: variables for the neurons’ weights.
bias_variable: variables for the neurons’ biases.
conv2d: convolution between the input and weights, with strides 1 and padding ‘SAME’
max_pool_2x2: max pooling operation, keeps only the maximum elements after each convolutional layer.

Snippet 6: Common tensorflow methods

def dropout(x, prob, train_phase):
    return tf.cond(train_phase, 
                   lambda: tf.nn.dropout(x, prob),
                   lambda: x)

def weight_variable(shape):
  return tf.Variable(tf.truncated_normal(shape, stddev=0.1))

def bias_variable(shape):
  return tf.Variable(tf.constant(0.1, shape=shape))

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1], padding='SAME')

Now let’s build the layers of the network. Our model will have an input layer followed by convolution and max-pooling layers. In the last part of the network architecture, we will flatten the feature maps and have a fully connected layer. A representation of the model is shown in Figure 4.

Figure 4: Neural Network Architecture

We define two x and y variables for the 64×64 pixel images. As they use the RGB schema (3 channels), the final shape for the input layer will be 64x64x3.

Snippet 7: Network implementation

sess = tf.InteractiveSession()

# tf Graph Input
x = tf.placeholder(tf.float32, [None,64,64,3]) 
y = tf.placeholder(tf.float32, [None, 2])

# dropout placeholder
keep_prob = tf.placeholder(tf.float32)

# train flag placeholder
train_phase = tf.placeholder(tf.bool) # For Batch Normalization

# Set model weights
W1 = weight_variable([3, 3, 3, 32])
b1 = bias_variable([32])

W2 = weight_variable([3, 3, 32, 64])
b2 = bias_variable([64])

W3 = weight_variable([3, 3, 64, 64])
b3 = bias_variable([64])

W4 = weight_variable([16 * 16 * 64, 512])
b4 = bias_variable([512])

W5 = weight_variable([512, 2])
b5 = bias_variable([2])

# hidden layers
conv1 = tf.nn.relu(conv2d(x, W1) + b1)
maxp1 = max_pool_2x2(conv1)

conv2 = tf.nn.relu(conv2d(maxp1, W2) + b2)
#maxp2 = max_pool_2x2(conv2)

conv3 = tf.nn.relu(conv2d(conv2, W3) + b3)
maxp3 = max_pool_2x2(conv3)

# fully connected
maxp3_flat = tf.reshape(maxp3, [-1, 16 * 16 * 64])

full1 = tf.nn.relu(tf.matmul(maxp3_flat, W4) + b4)
drop1 = tf.nn.dropout(full1, keep_prob)

#output
output = tf.matmul(drop1, W5) + b5
softmax=tf.nn.softmax(output)

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=y))

all_variables = tf.trainable_variables() 

As describing each function and method used will be tedious and make this post super long, feel free to browse in the Tensor Flow official documentation those you are interested in: https://www.tensorflow.org/api_docs/

You may also want to revisit some concepts related to earning and optimization such as Loss Functions, Stochastic Gradient Descent and Cross Entropy.

[Step 3] Training time:

Now we just need to define some hyperparameters and let the trainer fit the model to our training data. We’ll display the model accuracy after every 50 steps. Running the snippet below will show the training progress as shown in Figure 5.

Snippet 8: Trainer

# Hyper-parameters
training_steps = 2000
batch_size = 500
display_step = 100

# Mini-batch Gradient Descent
training_accuracy = []
training_loss     = []

for i in range(training_steps):
    
    X,Y = getXYBatch(X_train,y_train,batch_size)
    
    batch_accuracy, batch_loss, _ = sess.run([accuracy, loss, train_step],
                                             feed_dict={x:X, y:Y, train_phase: True, keep_prob: 1.0})
    training_accuracy.append(batch_accuracy)
    training_loss.append(batch_loss)
    # Displaying info
    if (i+1)%display_step == 0 or i == 0:
        print("Step %05d: accuracy=%.4f\tloss=%.6f\tlearning rate=%.6f" %
              (i+1, batch_accuracy, batch_loss, learning_rate.eval()))

save_path = saver.save(sess, "./saved/model2K .ckpt")
print("Model saved in file: %s" % save_path)      
        
plt.figure(figsize=(10,4))
plot_titles = ["Training accuracy", "Training Loss"]
for i, plot_data in enumerate([training_accuracy, training_loss]):
    plt.subplot(1, 2, i+1)
    plt.plot(plot_data)
    plt.title(plot_titles[i])

Figure 5: Progress while training

We can also plot the accuracy and loss at each training step. In an ideal scenario, the accuracy will increment over time whereas the loss will decrease.

Figure 6: Training accuracy and Loss

[Step 4] Using the model with different images:

Our final test consists of using the model with a completely new image that the model hasn’t seen before. We can browse for cats or dog on the internet and pass the images to the classifier using the following code:

Snippet 9: Using the model

y=tf.nn.softmax(output)
test_img = Image.open(X_test[int(np.random.uniform(0, len(X_test)))])

input_array = np.array(padImg(resizeImg(test_img))).reshape([-1]) / 255
input_array = input_array.reshape([-1,64,64,3])

softmax=tf.nn.softmax(output)
prediction = sess.run(tf.argmax(y, 1), feed_dict={x: input_array, train_phase: False, keep_prob: 1.0})
print("Predicted: " + ("Cat" if prediction[0]==0 else "Dog"))
test_img

Figure 7: Model output with an unseen image:

Hopefully, the model will predict accurately the class (cat or dog) for each image we input. However, there are several other techniques we can use from this point in order to make a more precise model.

Discussion:

In this post, we’ve built a convolutional neural network model capable of classifying images based on if they contain a cat or dog. While we didn’t revisit all the term and concepts required to fully understand what we coded and why it’s a good starting point to see how these techniques can be used in real-life scenarios. Have you ever seen a captcha asking you to click on images containing, let’s say, cars in order to verify you are not a bot? Can you think of other possible use cases for this type of binary classification?

Find the full code of this post at: https://github.com/gariem/samples/tree/master/meetup/santex-machinelearning/cats-dogs

Leave a reply