This project shows how to localize objects in images by using simple convolutional neural networks.
Before getting started, we have to download a dataset and generate a csv file containing the annotations (boxes).
- Download The Oxford-IIIT Pet Dataset
 - Download The Oxford-IIIT Pet Dataset Annotations
 - tar xf images.tar.gz
 - tar xf annotations.tar.gz
 - mv annotations/xmls/* images/
 - python3 generate_dataset.py
 
First, let's look at YOLOv2's approach:
- Pretrain Darknet-19 on ImageNet (feature extractor)
 - Remove the last convolutional layer
 - Add three 3 x 3 convolutional layers with 1024 filters
 - Add a 1 x 1 convolutional layer with the number of outputs needed for detection
 
We proceed in the same way to build the object detector:
- Choose a model from Keras Applications i.e. feature extractor
 - Remove the dense layer
 - Freeze some/all/no layers
 - Add one/multiple/no convolution block (or 
_inverted_res_blockfor MobileNetv2) - Add a convolution layer for the coordinates
 
The code in this repository uses MobileNetv2, because it is faster than other models and the performance can be adapted. For example, if alpha = 0.35 with 96x96 is not good enough, one can just increase both values (see here for a comparison). If you use another architecture, change preprocess_input.
python3 example_1/train.py- Adjust the WEIGHTS_FILE in 
example_1/test.py(given by the last script) python3 example_1/test.py
In the following images red is the predicted box, green is the ground truth:
This time we have to run the scripts example_2/train.py and example_2/test.py.
In order to distinguish between classes, we have to modify the loss function. I'm using here w_1*log((y_hat - y)^2 + 1) + w_2*FL(p_hat, p) where w_1 = w_2 = 1 are two weights and FL(p_hat, p) = -(0.9(1 - p_hat)^2 p*log(p_hat) + 0.1*p_hat^2(1 - p)log(1-p_hat)) (focal loss).
Instead of using all 37 classes, the code will only output class 0 (contains only class 0) or class 1 (contains class 1 to 36). However, it is easy to extend this to more classes (use categorical cross entropy instead of focal loss and try out different weights).
In this example, we use a skip-net architecture similar to U-Net. For an in-depth explanation see my blog post.
This example is based on the three YOLO papers. For an in-depth explanation see this blog post.
- enable augmentations: see 
example_4the same code can be added to the other examples - better augmentations: try out different values (flips, rotation etc.)
 - for MobileNetv1/2: increase 
ALPHAandIMAGE_SIZEin train_model.py - other architectures: increase 
IMAGE_SIZE - add more layers
 - try out other loss functions (MAE, smooth L1 loss etc.)
 - other optimizer: SGD with momentum 0.9, adjust learning rate
 - use a feature pyramid
 - read keras-team/keras#9965
 
- increase 
BATCH_SIZE - less layers, 
IMAGE_SIZEandALPHA 
- If the new dataset is small and similar to ImageNet, freeze all layers.
 - If the new dataset is small and not similar to ImageNet, freeze some layers.
 - If the new dataset is large, freeze no layers.
 - read http://cs231n.github.io/transfer-learning/
 




