Tag Archives: deep learning

Deep learning and vision, from simple manipulation to image classification: Part 2


After we revisited some basic concepts related to computer visión in our previous post, it is time to move forward and explore more sophisticated algorithms that will recognize either a dog or cat in a given image.

Through this post, we will work with the Dogs vs Cats problem from Kaggle and its data, which can be found here. You’ll need to register with Kaggle in order to download the train and test data.

After you register and download the data, we’ll perform an exploratory analysis and then build, train and evaluate a convolutional neural network for binomial classification. The model will output 0 or 1 for the cases when it determines that the image contains a dog or cat respectively.

[Step 1] Data exploration:

As stated before, data exploration is -most of the time- the first step before we even try to come with preliminary experiments. By just looking at the files in each of the files train.zip and test1.zip we’ve downloaded, we can spot the following details:

Table 1: Initial dataset observations

As our test set is not labeled, it will not be possible for us to use it for getting performance metrics. The files will, therefore, be only used to generate the final submission file for the Kaggle judge.

Another important observation we can make by opening some of the images in the test and train sets is that they seem to be different in size and aspect ratio. In order to confirm this, we’ll randomly plot and compare some of them.

Snippet 1: Randomly plot images from the train set

train_path = "data/train"
images = glob.glob(os.path.join(train_path, "*.jpg"))

plt.figure(figsize=(16, 8))
for index in range(6):
    plt.subplot(2, 3, index+1)
    img_index=int(np.random.uniform(0, 24999))

Figure 1: Sample images from the training set

As we run the above script several times, we observe that our intuition was right: images differ from each other in size and aspect ratio. Normalization seems to be needed but immediately several questions arise: What would be the size we will use for resizing and normalizing all the images so they can later be used to train our model? Wouldn’t the new size need to be determined so it works for both larger and smaller images? Finally, what proportion of images are small, medium or large?

To address those questions, we prepare the following script to get the distribution over height and width (in 100-pixel ranges) for each image in the train set:

Snippet 2: Distribution oversize in the training set


arr_h=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
arr_w=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

for img_index in range(len(images)):
    arr_w[int(img_w / 100)-1] += 1
    arr_h[int(img_h / 100)-1] += 1

    if img_w > max_w: max_w = img_w
    elif img_w < min_w: min_w = img_w if img_h > max_h: max_h = img_h
    elif img_h < min_h: min_h = img_h

print("Max Width: %i - Min Width: %i \nMax Height: %i - Min Height: %i" % (max_w, min_w, max_h, min_h))

If we plot the arr_w and arr_h vectors containing the number of images with width and height ranging from 0 to 1,000 pixels (in 100-pixel intervals), we observe that the majority of them are smaller than 400 x 400 pixel.

Figure 2: Height and width distributions

We can now come up with a strategy for resizing and padding our images as the only preprocessing task we’ll do before training our convolutional neural network. The resizeImg and padImg functions will maintain the original aspect ratio for each image while padding if necessary for images with different aspect ratios:

Snippet 3: Resizing and padding functions


def resizeImg(image):
    slot = int(img_w/100) +1 if img_w > img_h else int(img_h/100) +1 
    if slot!=0:
        if img_w >= img_h:
            img = image.resize((resize_default,int(resize_default*img_h/img_w)), Image.ANTIALIAS)
            img = image.resize((int(resize_default*img_w/img_h),resize_default), Image.ANTIALIAS)
    return img;

def padImg(image):
    if img_w > resize_default or img_h > resize_default:
        if img_w >= img_h:
            new_size = (img_w, img_w)
            new_size = (img_h, img_h)
        new_size = (resize_default, resize_default)
    img = Image.new("RGB", new_size)
    img.paste(image, (int((new_size[0]-img_w)/2),int((new_size[1]-img_h)/2)))
    return img;

#testImage = Image.open(images[int(np.random.uniform(0, 24999))])
testImage = Image.open(images[468])
resized = resizeImg(testImage)
padded = padImg(resized)

plt.figure(figsize=(12, 8))
plt.subplot(1, 3, 1)
plt.subplot(1, 3, 2)
plt.subplot(1, 3, 3)

Calling both functions will have the following output:

Figure 3: Padding and resizing of images

All images will be resized to 64×64 pixels and if padded vertically or horizontally if necessary. We can batch process all images as a preliminary step or include the functions right before we provide the samples to the trainer when fitting the model.

[Step 2] Building the convolutional neural network:

Up to this point, we’re familiar with convolutions for image processing. We’ve also explored the data we have available and decided that padding and resizing are needed in order to provide our model a normalized input pattern. The 64×64 pixel image equals to 4,096 features (input neurons) we need to fit into a 2-class classifier. It means that for every 64×64 pixel image we feed into the convolutional network, it’ll try to predict whether the input data belong to the classes cat or dog.

In addition to the two functions we’ve already seen for resizing and padding, we’ll need some other ones before we train the network. The get_label and getXYBatch functions shown in Snippet 4 are explained below:

Get_label: as we’ll get an output vector for every input pattern (or image), it will have a 2-element vector shape. There are only two possible values for the resulting vector: [0, 1] and [1, 0]. The first one will count as “cat” whereas the second one as “dog” in terms of the result the network is predicting.

getXYBatch: given our computer don’t have infinite memory, allocating all the 25,000 images for training is just not possible. We’ll resize and pad batches of 60-to-500 images and then feed the trainer with them in the training steps.

Snippet 4: get_label and getXYBatch functions

# extract labels
# positives = [1, 0], negatives = [0, 1]
def get_label(path):
    if path.split('/')[-1:][0].startswith('cat'): 
        return np.array([1, 0])
        return np.array([0, 1])

def getXYBatch(X_input, Y_input, batch_size):   
    X_array = np.array(padImg(resizeImg(Image.open(X_input[0])))).reshape([-1]) / 255
    Y_array = Y_input[0]

    choice = np.random.choice(range(len(X_input)), batch_size, replace=False)
    for item in choice:
        tmpimg = np.array(padImg(resizeImg(Image.open(X_input[item])))).reshape([-1]) / 255
        X_array = np.vstack((X_array, tmpimg))
        Y_array = np.vstack((Y_array,Y_input[item]))

    X_array = X_array[1:]
    Y_array = Y_array[1:]
    X_array = X_array.reshape([-1,resize_default,resize_default,3])
    return X_array, Y_array;

Now we split the train set into two parts for actual training but also for validation. We’ll use a 10% of the training images to measure how well the model is performing after, let say, 100 iterations. The following code will do it for us:

Snippet 5: Splitting the training set

train_path = "data/train"
images = glob.glob(os.path.join(train_path, "*.jpg"))

# extract pixels
data_images = images
data_labels = np.array([get_label(p) for p in images])
data_labels_out = np.argmax(data_labels, 1)

print("Positive samples: %i\nNegative samples: %i \n" % (len(data_labels_out)-np.count_nonzero(data_labels_out)
                                                      , np.count_nonzero(data_labels_out)))
#Split Data Sets
X_train, X_test, y_train, y_test = train_test_split(data_images, data_labels, test_size=0.2)
y_train_out = np.argmax(y_train, 1)
y_test_out = np.argmax(y_test, 1)

Finally, before jumping into the model’s code itself (assuming we’re excited about it). We’ll define some convenience functions to simplify the layers construction:

dropout: turn off hidden neurons given a probability (only in the training phase).
weight_variable: variables for the neurons’ weights.
bias_variable: variables for the neurons’ biases.
conv2d: convolution between the input and weights, with strides 1 and padding ‘SAME’
max_pool_2x2: max pooling operation, keeps only the maximum elements after each convolutional layer.

Snippet 6: Common tensorflow methods

def dropout(x, prob, train_phase):
    return tf.cond(train_phase, 
                   lambda: tf.nn.dropout(x, prob),
                   lambda: x)

def weight_variable(shape):
  return tf.Variable(tf.truncated_normal(shape, stddev=0.1))

def bias_variable(shape):
  return tf.Variable(tf.constant(0.1, shape=shape))

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1], padding='SAME')

Now let’s build the layers of the network. Our model will have an input layer followed by convolution and max-pooling layers. In the last part of the network architecture, we will flatten the feature maps and have a fully connected layer. A representation of the model is shown in Figure 4.

Figure 4: Neural Network Architecture

We define two x and y variables for the 64×64 pixel images. As they use the RGB schema (3 channels), the final shape for the input layer will be 64x64x3.

Snippet 7: Network implementation

sess = tf.InteractiveSession()

# tf Graph Input
x = tf.placeholder(tf.float32, [None,64,64,3]) 
y = tf.placeholder(tf.float32, [None, 2])

# dropout placeholder
keep_prob = tf.placeholder(tf.float32)

# train flag placeholder
train_phase = tf.placeholder(tf.bool) # For Batch Normalization

# Set model weights
W1 = weight_variable([3, 3, 3, 32])
b1 = bias_variable([32])

W2 = weight_variable([3, 3, 32, 64])
b2 = bias_variable([64])

W3 = weight_variable([3, 3, 64, 64])
b3 = bias_variable([64])

W4 = weight_variable([16 * 16 * 64, 512])
b4 = bias_variable([512])

W5 = weight_variable([512, 2])
b5 = bias_variable([2])

# hidden layers
conv1 = tf.nn.relu(conv2d(x, W1) + b1)
maxp1 = max_pool_2x2(conv1)

conv2 = tf.nn.relu(conv2d(maxp1, W2) + b2)
#maxp2 = max_pool_2x2(conv2)

conv3 = tf.nn.relu(conv2d(conv2, W3) + b3)
maxp3 = max_pool_2x2(conv3)

# fully connected
maxp3_flat = tf.reshape(maxp3, [-1, 16 * 16 * 64])

full1 = tf.nn.relu(tf.matmul(maxp3_flat, W4) + b4)
drop1 = tf.nn.dropout(full1, keep_prob)

output = tf.matmul(drop1, W5) + b5

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=y))

all_variables = tf.trainable_variables() 

As describing each function and method used will be tedious and make this post super long, feel free to browse in the Tensor Flow official documentation those you are interested in: https://www.tensorflow.org/api_docs/

You may also want to revisit some concepts related to earning and optimization such as Loss Functions, Stochastic Gradient Descent and Cross Entropy.

[Step 3] Training time:

Now we just need to define some hyperparameters and let the trainer fit the model to our training data. We’ll display the model accuracy after every 50 steps. Running the snippet below will show the training progress as shown in Figure 5.

Snippet 8: Trainer

# Hyper-parameters
training_steps = 2000
batch_size = 500
display_step = 100

# Mini-batch Gradient Descent
training_accuracy = []
training_loss     = []

for i in range(training_steps):
    X,Y = getXYBatch(X_train,y_train,batch_size)
    batch_accuracy, batch_loss, _ = sess.run([accuracy, loss, train_step],
                                             feed_dict={x:X, y:Y, train_phase: True, keep_prob: 1.0})
    # Displaying info
    if (i+1)%display_step == 0 or i == 0:
        print("Step %05d: accuracy=%.4f\tloss=%.6f\tlearning rate=%.6f" %
              (i+1, batch_accuracy, batch_loss, learning_rate.eval()))

save_path = saver.save(sess, "./saved/model2K .ckpt")
print("Model saved in file: %s" % save_path)      
plot_titles = ["Training accuracy", "Training Loss"]
for i, plot_data in enumerate([training_accuracy, training_loss]):
    plt.subplot(1, 2, i+1)

Figure 5: Progress while training

We can also plot the accuracy and loss at each training step. In an ideal scenario, the accuracy will increment over time whereas the loss will decrease.

Figure 6: Training accuracy and Loss

[Step 4] Using the model with different images:

Our final test consists of using the model with a completely new image that the model hasn’t seen before. We can browse for cats or dog on the internet and pass the images to the classifier using the following code:

Snippet 9: Using the model

test_img = Image.open(X_test[int(np.random.uniform(0, len(X_test)))])

input_array = np.array(padImg(resizeImg(test_img))).reshape([-1]) / 255
input_array = input_array.reshape([-1,64,64,3])

prediction = sess.run(tf.argmax(y, 1), feed_dict={x: input_array, train_phase: False, keep_prob: 1.0})
print("Predicted: " + ("Cat" if prediction[0]==0 else "Dog"))

Figure 7: Model output with an unseen image:

Hopefully, the model will predict accurately the class (cat or dog) for each image we input. However, there are several other techniques we can use from this point in order to make a more precise model.


In this post, we’ve built a convolutional neural network model capable of classifying images based on if they contain a cat or dog. While we didn’t revisit all the term and concepts required to fully understand what we coded and why it’s a good starting point to see how these techniques can be used in real-life scenarios. Have you ever seen a captcha asking you to click on images containing, let’s say, cars in order to verify you are not a bot? Can you think of other possible use cases for this type of binary classification?

Find the full code of this post at: https://github.com/gariem/samples/tree/master/meetup/santex-machinelearning/cats-dogs

Deep learning and vision: From simple manipulation to image classification – Part 1


When the MANIAC I computer defeated a human in a chess-like game for the first time in 1956, it created a fork in history and the idea of machines being able to complete tasks by replicating how the human brain works started to gain hope. Those, however, were tough times to even achieve discrete performance in other tasks due to the lack of data and computing power available.

Since then, a series of the so-called “AI Winters” took place one right after another and  the dream of computers performing at similar levels to humans almost had vanished. It was only until the beginning of 2005 that AI started to regain attention with Deep Learning being a singular force propelling its growth.

Today, companies are pouring billions of dollars into AI development and intelligent machines continue to take place in real-world activities every day.

In this post series, we will review basic concepts about image manipulation, convolutional neural networks, and Deep Learning. We will then dive deeper into the computer vision field and train a Convolutional Neural Network to recognize cats and dogs in arbitrary images, all of this using the Python programming language, TensorFlow, and several other convenience packages. If you are new to tensorflow, you can browse through its site and examples at https://www.tensorflow.org

[Getting ready] Setting up our environment

Through this tutorial we will make use the python programming language as well as several other packages and tools including Anaconda (conda) as our environment manager. Follow these instructions to get our environment ready:

  1. Download and install Anaconda. It is available for free download at https://www.anaconda.com/download/
  2. Once you’ve Anaconda installed, you will need to create a conda environment and add some packages to it. Let’s call it “deeplearning”. The following commands will complete the task:
  • Download and install Anaconda. Make sure it is registered in your PATH environment variable.
  • Update Anaconda packages: conda update --all
  • Create an Anaconda environment: conda create -n deeplearning python=3.6 jupyter scikit-learn scikit-image
  • Activate the environment: source activate deeplearning
  • Update setuptools: pip install --upgrade -I setuptools
  • Find the right tfBinaryURL for you in this url and run: pip install --upgrade tfBinaryURL

Now your environment should be ready. Let’s test it by running the Jupyter Notebook and executing a simple hello world command:

  • Open a terminal and run: jupyter notebook
  • In the browser window opened by the Jupyter server, create a new python 3 notebook using the buttons at the top right section of the page.
  • In the new notebook input box, enter print("hello world") and then press the shift and enter keys to execute the command. You should see the “hello world” message printed right after the command input box.
  • You can also give a name to your notebook by editing the default “untitled” name in the top left section.

Figure 1: Notebook with “hello world” message

[Warm up] Image processing: Filters and Convolutions

As we will be training and using a “convolutional” neural network, it is a good idea to understand why are those called that way. So before we build our CNN model, we will recap some concepts needed in image processing.

A convolution is basically a mathematical operation of two functions having the third one as the result. Convolutions are applied in several fields including image processing and computer vision.

In the field of image processing, a convolution matrix is used for image manipulation like blurring, sharpening or edge detection. The original image is treated as a matrix with values from 0 to 255 according to the color intensity in each pixel. For grayscale images, this matrix will have only two dimensions WxH (Width x Height). However, for color images using the RGB scheme there will be a third dimension and therefore the matrix will become a structure with shape WxHx3 (Width x Height x 3 RGB Channels).

Image manipulation and convolutions in practice

Despite the formal definition and all maths behind convolutions for image processing, we can understand it as a relatively simple operation similar but not equal to matrix multiplication. Let’s see two classic examples with a grayscale image:

Image Equalization:
Before we start with convolutions, let’s warm up doing some basic image manipulation by equalizing a grayscale image.

The image on the left has been acquired with a sensor (camera or telescope) and suffers from over exposition. Therefore, it looks like there is too much lightness in the whole image. We need to enhance the image up to the point where it looks like the one in the right side below:

Figure 2: Galaxy image before and after equalization

Performing exploration of the data we have is a recommended practice in almost any discipline. As we are given an image, calculating and visualizing its histogram seems to be a common task to start with. To obtain our grayscale image histogram we just need to count how many pixels do have an intensity of 0, 1, 2 and so for up to 255, where 0 is a total black pixel and 255 a complete white one. Let’s code it:

# read the image in grayscale mode “L”
img_matrix = imageio.imread('images/galaxia.jpg')
img = Image.fromarray(img_matrix, 'L')

# count pixels at each value in the 0-255 range
rows, cols = img_matrix.shape
histogram = np.zeros(256)
for x in range(0, cols):
   for y in range(0, rows):
     histogram[img_matrix[y, x]] = histogram[img_matrix[y, x]] + 1

# plot the histogram using pyplot
plt.figure(figsize=(14, 4))
ax = plt.subplot(1, 2, 1)
ax.bar(range(256), histogram)
plt.title("Image histogram")

After running the script above, we can see the histogram with a notorious deviation to the right. The total of the pixels have values higher than 100. It is our goal to make the histogram look more distributed or, more properly, equalize it.

Figure 3: Image histograms before and after equalization

The snippet below shows a simple algorithm to achieve the histogram equalization:

histogram_eq = histogram / (rows * cols)
accum = np.zeros(256)
accum[0] = histogram_eq[0]
for i in range(1, 255):
   accum[i] = accum[i - 1] + histogram_eq[i]
image_new = np.zeros((rows, cols), dtype=np.uint8)
for x in range(rows):
   for y in range(cols):
     image_new[x, y] = np.abs(np.floor(255 * accum[img_matrix[x, y]]))

Now, to visualize the new equalized image, we just need to convert the image_new array back to a grayscale image. Try it yourself!.

Want a challenge? You can try to apply a similar algorithm to equalize the following colorful image. While the principle is similar, one cannot just compute and equalize the three RGB channels independently.

Figure 4: Overexposed colorful image

Test yourself and show us what approach you used to address this challenge. We’d love to see your code and discuss your solution!

Image convolutions:

What happened with our previous exercise is that we did a linear transformation on an image by simply modifying how distributed the gray tones were. However, that’s not all we can do with an image. Another modification we can do is to replace each pixel with the mean value of its neighbors which is what we call median filter. A median filter is a non-linear filter often used to reduce noise in images. You can read more about the median filter in this article in Wikipedia.

As we previously stated, a convolution is an operation between two functions to obtain a third one. In the image processing domain, the first function will be our original image and the second one a convolution matrix (also called Kernel or Filter matrix) with shape NxN where N is an even number frequently having values of 3, 5 or 7.

The animation below shows how we compute an output matrix as a result of performing a convolution with an input image matrix and a 3×3 kernel:

Animation 1: Convolution with a 3×3 Kernel

To obtain the modified image (in the shape of a numerical matrix containing grayscale values), we start by taking a subsection of the input image with the same shape as our Kernel, then we perform element-wise multiplications between our input sample and kernel. Finally, we add the 9 products and divide the result by the sum of all values in the kernel. The initial 320 value for our output matrix is, therefore, the result of the following operation:

But, why do we have a top row and first column with zero values? The answer is that, in order to be able to perform the element-wise operation described before for the input elements at the border of our input matrix, we need to pad the original image with as many rows and columns as the size of our kernel matrix minus one, divided by two. So in our example, our image will be padded with 1 column and 1 row because our kernel matrix size is 3: (3-1)/2

An important element is also introduced here and must be remembered for later: the strides. As we will perform the operation in each pixel, we need to follow some approach to visiting all the pixel in the original image such as a sliding window. The stride will determine how many pixels we move right and bottom in each step. Most of the time the strides are 1 for both horizontal and vertical displacements.

You may also be wondering how does one determine the shape the convolution kernel will have? There are several well-known kernels you can use for different purposes. Below you can see some of them and the result they produce when applied:

Now let’s code it. Try yourself this simple function for padding an image and performing a convolution with any kernel matrix:

def filter_simple(source, kernel, mask_rows, mask_cols):
   padding_rows = int((mask_rows - 1) / 2)
   padding_cols = int((mask_cols - 1) / 2)
   rows, cols = source.shape
   padded = np.zeros((rows + 2 * padding_rows, cols + 2 * padding_cols), dtype=np.uint8)
   padded[padding_rows:padding_rows + rows, padding_cols:padding_cols + cols] = source
   result = np.zeros((rows + 2 * padding_rows, cols + 2 * padding_cols), dtype=np.uint8)
   for i in range(padding_rows, padding_rows + rows - 1):
     for j in range(padding_cols, padding_cols + cols - 1):
       aux = padded[i - padding_rows:i + padding_rows + 1, j - padding_cols: j + padding_cols + 1]
       out_value = 0
       for x in range(mask_rows):
         for y in range(mask_cols):
           out_value = out_value + (aux[x,y] * kernel[x,y]) / np.sum(kernel)
       result[i, j] = out_value
  result = result[padding_rows:padding_rows + rows, padding_cols: padding_cols + cols]
  return result

Now you can call the function it as simple as:

# define our kernel
kernel_blur = np.matrix([[1,1,1],[1,1,1],[1,1,1]])
# kernel_edge = np.matrix([[-1,-1,-1],[-1,9,-1],[-1,-1,-1]])
image_new = filter_simple(img_matrix,kernel_blur, 3, 3)

Figure 5: Simple edge detection with convolution filters

There are a couple of places where you can also improve the code. Have you also noticed that the execution is not so fast? Do you think you can improve it? Motivate yourself and read about optimization for convolution operations. Again, we’d love to see your code and discuss any question you may have.

Most of image processing and numeric libraries in languages such as python also offer ready-to-use optimized functions for 2D and 3D convolutions. Check out this example in the SciPy documentation!

[Finally] Discussion

Convolutions are a key concept to understand before we move to Convolutional Neural Networks as the kernel, strides, and other parameters have a lot of importance when dealing with them. In the next post, we will see how a neural network can learn its own kernels instead of using predefined ones. The ability to perform transformations such as noise reduction and edge detection that filters add is probably one of the most important reasons for CNN to become so popular and accurate.


  1. Bengio, Y. (2016). Machines Who Learn. Scientific American, 314(6), 46-51. doi:10.1038/scientificamerican0616-46
  2. https://www.tensorflow.org/
  3. https://en.wikipedia.org/wiki/Kernel_(image_processing)