Skip to content

pechyony/semantic-segmentation

Repository files navigation

Semantic Segmentation

Project 1 of term 3 of Udacity self-driving car nanodegree

Table of Contents

  1. Setup

  2. Usage

  3. Files

  4. Implementation
    4.1 Dataset
    4.2 Architecture

    4.3 Training
    4.4 Scoring
    4.5 Samples of segmented test images

Setup

Follow instructions here to set up an environment for this project.

Usage

  1. Set up project environemnt as specified above.
  2. Run python main.py to train a model and segment test images

Files

  • main.py - training and scoring
  • helper.py - auxiliary functions
  • project_tests.py - unit testing functions
  • training_log.txt - log file of the training process
  • architecture.png - architecture diagram
  • scoring_results - folder of segmented test images
  • training_samples - folder of the sample training images

Implementation

The goal of semantic segmentation is to classify each pixel in the image. In this project we used a fully convolutional neural network to label all pixels that belong to a road. This network was trained and tested over KITTI Road dataset that can be downloaded from here.

Dataset

The dataset has 289 training and 290 testing images. All pixels of training images are classified into 3 classes: current road, side road and background. Here is an example of training image alt text and its labeled pixels alt text The current road is marked with pink, other road is marked with black and background is marked with red. The road pixels are a union of pink and black pixels.

Architecture

Fully convolutional network consists of encoder, 1x1 convolutions, decoder.

Encoder

We used the first 5 convolutional and max pooling layers of VGG16 network. The architecture of these layers can be found here, Table 1, column D.

1x1 Convolutions

The output of encoder is an input to the layer of 1x1 convolutions. 1x1 convolutions layer is equivalent to fully connected layers of the original VGG16 network. However, unlike fully connected layers, 1x1 convolutions layer allows the pretrained network to segment images of any size. The architecture of 1x1 convolutions layer is:

4096 convolutions of size 7x7 with stride 1
4096 RELU activation unit
Dropout with probability 0.5
4096 convolutions of size 1x1
4096 RELU activation units
Dropout with probability 0.5
2 convolutions of size 1x1 

Decoder

Decoder layer upsamples the output of 1x1 convolutions layer to the size of the original image. We used three upsampling layers:

[D1] 2 transposed convolutions of size 4x4 with stride 2 and 'same' padding  
[D2] 2 transposed convolutions of size 4x4 with stride 2 and 'same' padding  
[D3] 2 transposed convolutions of size 16x16 with stride 8 and 'same' padding

We also used skip connections to connect the output of the intermediate layers of encoder with the input of the final layers of decoder. In particular, we used the scaled outputs of POOL3 and POOL4 layers of encoder, projected to have depth 2. The complete architecture of decoder layer and its connections to previous layers are shown below. alt text

Training

We used a pre-trained VGG16 network that is available here. We trained the network with the following values of hyperparameter:

learning rate = 0.001
keep probability = 0.5
batch size = 8

We also used L2-regularization of the weights of the last convolutional and all transposed convolutional layers. The regularization weight was set to 0.0001.

We used a data-driven approach to set up the number of epochs. After each epoch we computed IOU over the training set. We stopped training when there was no improvement of IOU over the last 10 epochs. Then we chose the model that generated the highest IOU. To limit the training process we set the maximal number of epochs to 200, but as we describe below, we didn't reach this upper bound.

training_log.txt file contains a complete log of our training process, including cross-entropy error and IOU after each epoch. The training process finished after 92 epochs, with the best model being the one obtained after epoch 82. This model has cross-entropy error 0.0254 and IOU 0.9677. We chose this model to segment the test images.

Scoring

The image generated by our network has depth 2, with each pixel having two real-valued scores. The first score is for background, the second one is for the road. We used softmax function to convert these scores to probabilities. If a pixel has road probability larger than 0.5 then it is labeled as road, otherwise it is labeled as background.

Samples of segmented test images

In this section we show several images segmented using our model. The road is marked with green, all other pixels are labelled as background. The complete set of segmented images is in scoring_results folder. alt text alt text alt text alt text

About

Project 2 in Term 3 of Udacity self-driving car nanodegree

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages