U-NET Implementation from Scratch using TensorFlow (2024)

Underlying concepts and step by step Python code explanation

Vidushi Bhatia

Published in

Geek Culture

9 min read

Jul 7, 2021

U-NET Implementation from Scratch using TensorFlow (3)

Larry Roberts in his Ph. D. thesis (cir. 1960) at MIT discussed the possibilities of extracting 3D geometrical information and is considered to have set the foundation of the research surrounding Computer Vision. Since then, researchers have made tremendous progress, especially within the last decade, making Computer Vision the frontier of real-world AI applications in the form of facial recognition, medical imaging, self-driving cars, and many more.

In this blog, my purpose is to deep dive into one such tremendous computer vision model called the U-Net. The blog provides insights on operations used in the U-Net architecture like Convolution, Max Pooling, Transposed Convolution, Skip Connections and also explains how to implement these concepts from scratch using TensorFlow.

By the end of this blog, you would have created the following architecture (fig-2) to classify image pixels into segments (like fig-1).

U-NET Implementation from Scratch using TensorFlow (4)

Overview of U-Net
Understanding the Key Operations used in U-Net
Processing the Data
Defining the U-Net Architecture
Training the Model
Evaluating the Model
Prediction!

U-Net architecture was introduced by Olaf Ronneberger, Philipp Fischer, Thomas Brox in 2015 for tumor detection but since has been found to be useful across multiple industries. As an image segmentation tool, the model aims to classify each pixel as one of the output classes, creating an output resembling fig-1.

Many Neural Nets have tried to perform ‘image segmentation’ before, but U-Net beats its predecessors by being less computationally expensive and minimizing information loss. Let’s deep dive further to learn more about how U-Net does this.

U-NET Implementation from Scratch using TensorFlow (5)

Before we create a U-Net, let’s understand the key operations used in the architecture (bottom-right corner of fig-3)

If we only use fully connected layers to create networks for high-resolution images, the models would become extremely computationally expensive. Hence, the mathematical operation called ‘convolution’ is a white knight in the Computer Vision story. Convolution retains the influence of all input pixels but keeps them only loosely connected to reduce computation cost.
To perform a convolution operation, repeat the following steps for the entire input image matrix:

Step 1: Take a filter matrix K of size smaller than the input image matrix I. Conduct element-wise multiplication with the overlaid elements and then add to create a single value in the output matrix.
Step 2: Move the filter to the columns on the right based on the defined stride and repeat step 1.
Example: If we started the operation with column 1 and stride is 3, then we’ll move to column 4 and repeat Step 1.

U-NET Implementation from Scratch using TensorFlow (6)

Change in dimensions for Convolution Operation:
Input Matrix: A x B x C where the height is A; width is B and channels/depth is C (e.g. RGB images would have 3 channels)
Filter Matrix: D x E x C x G where the height of filter is D; width is E, C is number of channels/depth (same as input image) and G is number of applied filters
Output Matrix: H x W x G where Height and Width can be computed using the formula shown below and G is the number of filters that were applied to the input

U-NET Implementation from Scratch using TensorFlow (7)

The elements of the filter matrix act as the ‘weight’ parameters and are optimized during training the model. Please refer to this article for more information on Conv operation and ConvNets

To allocate a class to each pixel in an image, Image Segmentation requires the downscaled image (due to convolution) to be upscaled to a size closer to the original image. This can be done using fully connected layers but it becomes very computationally expensive. To solve for this, U-Net uses transposed convolution operation which increases the dimensions of the input image by using a filter bigger than the input.

U-NET Implementation from Scratch using TensorFlow (8)

Please refer to this article to find out more about Transposed Convolutions

Pooling is used for the same purpose as convolution— to reduce the number of parameters and increase the speed of computation. The layer also inadvertently allows for a bit of regularization. There are typically 2 operations performed in pooling — average or max. In both of them, we create subsets of the input based on filter size ‘f’, stride ‘s’ and then apply these functions (max or average) to the input matrix.

U-NET Implementation from Scratch using TensorFlow (9)

Unlike convolutions, no weight parameters are generated in pooling operations

Skip Connections in U-Net copies the image matrix from the earlier layers (LHS layers of fig-3) and uses it as a part of the later layers (RHS layers). This enables the model to preserve information from a richer matrix and prevent information loss. A lot of popular Computer Vision architectures use skip connections to make the output richer.

U-NET Implementation from Scratch using TensorFlow (10)

Now that we are brushed up on some underlying concepts, let’s start implementing this model and get some hands-on knowledge using The Oxford-IIIT Pet Dataset. The files in this dataset are of varying sizes and we’ll use resize, reshape to transform them all into a consistent desired size. We will also normalize the image matrix by dividing the pixel values by 256. Please note that the values in the ‘mask’ matrix represent the classes, hence, we won’t normalize them.

for file in img:
 index = img.index(file)
 path = os.path.join(path1, file)
 single_img = Image.open(path).convert('RGB')
 single_img = single_img.resize((i_h,i_w))
 single_img = np.reshape(single_img,(i_h,i_w,i_c)) 
 single_img = single_img/256.
 X[index] = single_img single_mask_ind = mask[index]
 path = os.path.join(path2, single_mask_ind)
 single_mask = Image.open(path)
 single_mask = single_mask.resize((m_h, m_w))
 single_mask = np.reshape(single_mask,(m_h,m_w,m_c)) 
 single_mask = single_mask - 1 
 y[index] = single_mask

U-NET Implementation from Scratch using TensorFlow (11)

Congratulations! Our folder of images has been converted to X (dims: # images, img height, img width, img channels) and y (dims: # masks, mask height, mask width, mask channels). We can now proceed with designing the architecture of U-Net!

The number of images in X should be equal to the number of masks in y, other dimensions of the datasets can differ.

While coding the U-Net architecture, I divided it into 2 parts — encoder and decoder. They can further be divided into a sequence of repeated encoder mini-blocks and decoder mini-blocks.

To design a U-Net, we will have to design reusable mini-blocks and simply string them together.

U-NET Implementation from Scratch using TensorFlow (12)

We will develop a function for encoder mini-block which would allow us to dynamically create all encoder layers. If we look at the above diagram, there are two conv 3x3 operations in each mini-block with a max pool operation (the latter is not present in the ‘bottleneck’ block).

The below function allows us to implement the same along with options for operations like Batch Normalization, dropout to make the model more robust. We have used ‘He initialization’ along with ReLU to get best results. Before we apply max pool, we are saving the information for a skip connection that we’ll use later in the decoder.

def EncoderMiniBlock(inputs, n_filters=32, dropout_prob=0.3, max_pooling=True):
 conv = Conv2D(n_filters, 
 3,  # filter size
 activation='relu',
 padding='same',
 kernel_initializer='HeNormal')(inputs)
 conv = Conv2D(n_filters, 
 3, # filter size
 activation='relu',
 padding='same',
 kernel_initializer='HeNormal')(conv) conv = BatchNormalization()(conv, training=False) if dropout_prob > 0: 
 conv = tf.keras.layers.Dropout(dropout_prob)(conv)
 if max_pooling:
 next_layer = tf.keras.layers.MaxPooling2D(pool_size = (2,2))(conv) 
 else:
 next_layer = convskip_connection = conv 
 return next_layer, skip_connection

To complete the encoder, we’ll stack these mini-blocks with the number of filters doubling in each subsequent block (like shown in fig- 10)

U-NET Implementation from Scratch using TensorFlow (13)

The decoder first increases image dimensions using transposed convolutions and then merges the results with the information from skip connection (stored in the encoder code block). With 2 more convolution operations, our mini-block would be ready. Note that we are using ‘same’ padding in convolutions to ensure our image size doesn’t decrease.

def DecoderMiniBlock(prev_layer_input, skip_layer_input, n_filters=32):
 up = Conv2DTranspose(
 n_filters,
 (3,3),
 strides=(2,2),
 padding='same')(prev_layer_input) merge = concatenate([up, skip_layer_input], axis=3) conv = Conv2D(n_filters, 
 3, 
 activation='relu',
 padding='same',
 kernel_initializer='HeNormal')(merge)
 conv = Conv2D(n_filters,
 3, 
 activation='relu',
 padding='same',
 kernel_initializer='HeNormal')(conv)
 return conv

After stacking 4 mini-blocks, we will top up the compiled decoder with a conv 1x1 operation which converts the mini-block output to the desired dimensions.

The number of filters used in output layer would be equal to the number of output classes. Hence, our output will have the dimensions: H * W * # classes

After compiling all the mini-blocks shown in the previous section, we need to now decide an optimizer, loss function and accuracy metric for the model. We can then use model.fit() for training. Below, I have used Adam optimizer along with Sparse Categorical Cross Entropy.

If your output labels are one-hot encoded, use Categorical Cross Entropy instead of Sparse Categorical Cross Entropy

unet.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])results = unet.fit(X_train, y_train, batch_size=32, epochs=20, validation_data=(X_valid, y_valid))

U-NET Implementation from Scratch using TensorFlow (14)

First, we will check if our model is learning at the correct rate. We can do so by plotting ‘loss function’ for each epoch. If the learning rate is too large, the ‘train loss’ would oscillate, otherwise, we would see a consistently decreasing loss.
Second, we will look for high bias or underfitting i.e. if both the training and validation accuracy is very low. This means the model hasn’t been trained well and would need to be tuned. Some options to solve for high bias are — a bigger network, more training iterations or adding more features. A better optimization algorithm and better initialization of weights also might help.
Lastly, we will check for high variance or overfitting i.e. if the train accuracy is high but the validation accuracy is low. This means that the model is very tightly fitted to the train data and not general enough to predict new data values. To solve for this, we can use regularization which will shrink the influence of weights or add more examples to our train set.

After evaluation, tune the model to get the best results on the above shown criteria

U-NET Implementation from Scratch using TensorFlow (15)

Now that we have checked how are model is performing in numbers, we can also visualize its predictions by using model.predict(). Don’t forget to ensure the dimensions of your input match the input dimensions of the trained model. Also, to visualize the predicted mask, adjust it’s axis to match the output dimensions.

def VisualizeResults(index):
 img = X_valid[index]
 img = img[np.newaxis, ...]
 pred_y = unet.predict(img)
 pred_mask = tf.argmax(pred_y[0], axis=-1)
 pred_mask = pred_mask[..., tf.newaxis]
 fig, arr = plt.subplots(1, 3, figsize=(15, 15))
 arr[0].imshow(X_valid[index])
 arr[0].set_title('Processed Image')
 arr[1].imshow(y_valid[index,:,:,0])
 arr[1].set_title('Actual Masked Image ')
 arr[2].imshow(pred_mask[:,:,0])
 arr[2].set_title('Predicted Masked Image ')

The below images compare the actual mask vs the predicted mask from the U-Net model. Try using the model we have created to predict the outline and background of an image of your choice!