- General info
- Requirements
- Project 1: Object Detection using Efficientdet
- Project 2: Water Segmentation using U-Net
- Project 3: Face key-point recognition using CNN
In this repository you will learn the basics for detecting objects, keypoints or even making segmentation on images using models such as: EfficientDet, Mobilenet-SSD, U-Net...
For the first project you only need a Google account with Colab and Drive. (I am using Colab pro for the training)
But if you want to train locally you need to install manually Tensorflow Object Detection, you'll find a good tutorial on this link
For the inference you need inference.py shared in this repository, the directory saved_model and the file label_map.pbtxt. (you can automatically download both by running the code in colab)
What is EfficientDet ?
I will assume you have some knowledge in computer vision and CNNs, if not you can skip this part.
EfficientDet is a family of deep learning models designed for object detection, EfficientDet7 achieved state of the art results on COCO dataset, it is both scalable and efficient meaning that it can recognize objects at vastly different scales and need fewer computational performance than the other models.
To understand EfficientDet we need to understand two key improvements made:
- Bi-directional Feature Pyramid Network (BiFPN)
- Compound Scaling
Recognizing objects at different scale is a fundamental challenge in computer vision.
Different authors have tried to solve this differently, there were three main categories of solutions that existed before the introduction of FPN:
But they all have some issues:
- Featurized image pyramid is too long to train and is infeasible in terms of memory because you need to train a CNN for every scales of an image.
- Single feature map is actually used by Faster RCNN but lose representational capacity for object detection in the first layers with low level features embedding.
- Pyramidal feature hierarchy is used by SSD, it avoids using low level features in the first levels of a CNN by directly using the high level feature at the end of a CNN and then adds several new layers but by doing so it misses the opportunity to reuse the earlier layers which are important for detecting small objects.
What Feature Pyramid Network does is to combine low-resolution, semantically strong features in the later layers with high-resolution and semantically weak features in the earlier layers via a top-down pathway and lateral connections. Thus, leading to Multi-scale feature fusion.
It is somehow similar to the architecture of U-Net when you think about it.
Feature network design and evolution of FPN
What EfficientDet, and BiFPN in particular, did was to:
- Add bottum-up path aggregation network, conventional top-down FPN is limited by the one-way information flow. (Making it bidirectional)
- Remove all nodes that have only one input edge. The intuition is that if a node has only one input edge with no feature fusion, then it will have less contribution to the feature network.
- Add an extra edge from the original input to the output node if they are at the same level in order to fuse more features without adding much cost
- Treat each bidirectional path as one single layer and have multiple of these to enable more high-level feature fusion.
The second key improvement was made by EfficientNet (the backbone of EfficientDet) with compound scaling.
Previous work mostly scale up a baseline detector by employing bigger backbone networks (ResNets, AmoebaNet..) using larger input images or stacking more FPN layers. These methods are usually ineffective since they only focus on a single or limited scaling dimensions.
They proposed to use a single compound coefficient to jointly scale up all three dimensions while mantaining a balance between all dimensions of the network.
To conclude by combining an EfficientNet backbone, a Bi directionnal Feature Pyramid Network and convolutions we get this:
- First open the file ObjectDetection.ipynb of this repository in Colab.
- Then you need images in JPG format and annotations in Pascal VOC format (xml files). (You can use FastAnnotations, a framework that I made 😄)
- Once you have them simply put them in a zip file named data.zip, don't bother making a train/test or annotation folder everything will be handled automatically to make the process easier.
- And now you can upload data.zip to your Drive.
- Finally just run the code, it will train an EfficientDet0 model on the data you sent to Drive.
If you want to change the model to let's say EfficientDet5 or Mobilenet-SSD you need to download the model from Tensorflow Object Detection Zoo, for instance the changes needed for EfficientDet5 will be:
!wget http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d0_coco17_tpu-32.tar.gz
!tar -xzvf efficientdet_d0_coco17_tpu-32.tar.gz
to
!wget http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d5_coco17_tpu-32.tar.gz
!tar -xzvf efficientdet_d5_coco17_tpu-32.tar.gz
and
ModelName = 'efficientdet_d0_coco17_tpu-32'
to
ModelName = 'efficientdet_d5_coco17_tpu-32'
After the training this is the result that I get for the dataset Stanford Dogs (with 9 classes only):
And these are some test made with the new model trained:
In order to use the model locally there are a few steps:
this is where your files are downloaded
What is segmentation ?
In this project we will try to identify water on images thanks to a dataset from Kaggle using a technique called Segmentation.
Segmentation is made with the use of an autoencoder which is an unsupervised Artificial Neural Network that attempts to encode the data by compressing it into the lower dimensions (bottleneck layer or code) and then decode the data to construct the targeted mask.
A mask is an image made of numbers or colors corresponding to the different classes present in the image.
Exemple of mask
Here we are going to use U-Net as our autoencoder which is a model generally used in medical segmentation in order to detect diseases or certain parts of the body in order to operate surgeries.
Below you can see the U-Net architecture:
U-Net architecture
In addition, if you want to improve the performances of your model you can add a pretrained model such as ResNet50 or VGG19 as the encoder at the start of your U-Net model and then attach decoder at the end
- First open the file Segmentation.ipynb of this repository in Colab.
- Then you need images and masks (jpg or png or other type of images...).
- Once you have them simply upload them to your Drive.
- Finally modify the DATASET_PATH variable and the different paths where the dataset is made.
for img in os.listdir(DATASET_PATH+'/Annotations/ADE20K'):
for img2 in os.listdir(DATASET_PATH+'/JPEGImages/ADE20K'):
- Run the code and enjoy.
And these are some results I got after running the algorithm:
The model has some difficulties around the edges of the water but he has the idea, the reason is because I didn't use a pre-trained model and made everything from scratch with a few epochs.
Just like the object detection above, we need to repeat the same steps:
this is where your files are downloaded
How does it work ?
Key points can be used for a variety of tasks:
- Apply a filter on a face
- Detect emotions on a face
- Identify someone based on their traits
- Identify what action a person is doing... (Used for sports, or thiefs in a supermarket)
In order to achieve that we need a neural network composed of a CNN and fully connected layers that predict (x, y) coordinates for each key-points, for instance if we have 5 key-points we'll need a linear layer of 10 outputs (x1, y1, x2, y2, ..., x5, y5).
- First open the file FacekeyPoint.ipynb of this repository in Colab.
- Then you need images and annotations (jpg and you might use FastAnnotations).
- Once you have them simply upload them to your Drive.
- Finally modify the different paths where the dataset is made and run.