The U-Net : A Complete Guide (2024)

Alejandro Ito Aramendia

Block 1

An input image with dimensions 572² is fed into the U-Net. This input image consists of only 1 channel, likely a grayscale channel.
Two 3x3 convolution layers (unpadded) are then applied to the input image, each followed by a ReLU layer. At the same time the number of channels are increased to 64 in order to capture higher level features.
A 2x2 max pooling layer with a stride of 2 is then applied. This downsamples the feature map to half its size, 284².

Block 2

Just like in block 1, two 3x3 convolution layers (unpadded) are applied to the output of block 1, each followed again by a ReLU layer. At each new block the number of feature channels are doubled, now to 128.
Next a 2x2 max pooling layer is again applied to the resulting feature map reducing the spatial dimensions by half to 140².

Block 3

The procedure used in block 1 and 2 is the same as in block 3, so will not be repeated.

Block 4

Same as block 3.

Block 5

In the final block of the contracting path, the number of feature channels reach 1024 after being doubled at each block.
This block also contains two 3x3 convolution layers (unpadded), which are each followed by a ReLU layer. However, for symmetry purposes, I have only included one layer and included the second layer in the expanding path.

After complex features and patterns have been extracted, the feature map moves on to the expanding path.

The expanding path uses both convolution and up-convolution operations to combine learnt features and upsample the input feature map until it generates a segmentation map.

Much like with the contracting path, each block will be discussed below.

Before we read further: Skip connections are used to send images directly from the contracting path to the expanding path without them having to go through all the blocks. This allows for both high and low level features to be preserved and learnt, reducing any information loss that occurs during the contracting path.

Block 5

Continuing on from the contracting path, a second 3x3 convolution (unpadded) is applied with a ReLU layer after it.
Then a 2x2 convolution (up-convolution) layer is applied, upsampling the spatial dimensions twofold and also halving the number of channels to 512.

Block 4

Using skip connections, the corresponding feature map from the contracting path is then concatenated, doubling the feature channels to 1024. Note that this concatenation must be cropped to match the expanding path’s dimensions.
Two 3x3 convolution layers (unpadded) are applied, each with a ReLU layer following, reducing the channels to 512.
After, a 2x2 convolution (up-convolution) layer is applied, upsampling the spatial dimensions twofold and also halving the number of channels to 256.

Block 3

The procedure used in block 5 and 4 is the same as in block 3, so will not be repeated.

Block 2

Same as block 3.

Block 1

In the final block of the expanding path, there are 128 channels after concatenating the skip connection.
Next, two 3x3 convolution layers (unpadded) are applied on the feature map, with ReLU layers inbetween reducing the number of feature channels to 64.
Finally, a 1x1 convolution layer, followed by an activation layer (sigmoid for binary classification) is used to reduce the number of channels to the desired number of classes. In this case, 2 classes, as binary classification is often used in medical imaging.

After upsampling the feature map in the expanding path, a segmentation map should be generated, with each pixel classified individually.

In this section I would like to discuss what up-convolutions are and how changing the number of feature channels is possible. Convolutions, pooling, strides and padding were discussed in my previous CNN article and therefore, I have chosen not to cover them again. If necessary, please recap these concepts here.

Now let’s get into it.

Up-Convolution

An up-convolution, also known as a deconvolution or transpose convolution, is a method used to upsample images and recover spatial information.

Let’s look at the example below and briefly discuss what’s happening.

The best way to perform up-convolutions is to expand and duplicate each element from the input feature map to the same size as the filter. This process up-samples the input. The filter is then applied over each of these expanded regions.

For example, the expanded green input above is initially just composed of four 1s. Likewise, the expanded red, yellow and grey regions are initially filled with just 2s, 3s and 4s, respectively. Next, the filter is applied over each of these regions and the results are summed to form the output feature map.

In the U-Net described above, the spatial dimensions were doubled, which means that a 2x2 filter was used with a stride of 2.

Changing the Amount of Channels

Throughout the U-Net, the number of feature channels are constantly changing. How do convolution operations affect this?

Well, the convolutions itself do not directly affect the number of channels present. It is in fact, determined by the number of filters used in the convolution layer. If 64 filters are applied over the input, with each attempting to extract a different feature, 64 feature maps will also be generated.

This may seem obvious to some, but was something that stumped me while learning this.

U-Nets are often used in medical imaging. They play crucial roles in detecting and locating tumors, cysts and other abnormalities.

Below is a possible example of what an input and output of a U-Net may look like.

A medical grayscale image of a uterus was used as an input and fed into a U-Net. After having being processed in the U-Net, each pixel was classified into one of two classes: tumor or not-tumor. This segmentation map can be seen in the output image.

To conclude this article, let’s summarise what we have learnt.

The U-Net is an architecture that consists of 23 total layers. Using a combination of convolution, up-convolution, pooling and skip connections, the U-Net is able to extract and capture complex features, while also keeping and reconstructing spatial information. This allows for the localisation of features in an image, thus producing accurate segmentation maps. This is especially useful in medical image analysis where accurately locating and detecting abnormalities is vital.

Thank you for getting this far, if you have any questions do not hesitate to ask.

References

[1] Olaf Ronneberger, Philipp Fischer, Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597.

The U-Net : A Complete Guide (2024)

Table of Contents

Block 1

Block 2

Block 3

Block 4

Block 5

Block 5

Block 4

Block 3

Block 2

Block 1

Up-Convolution

Changing the Amount of Channels

References