Table of Contents


  • Github repo:

  • Live version: where MODEL_NAME is yolo or mobilenet and URL is a direct link to a jpg or png image.

    #Example1 of a detection request - MobileNet
    #Example2 of a detection request - YOLO

The goal of object detection is to find objects of interest in an image or a video. This is more complex than image classification.

Object detection models return the bounding boxes of each object of interest in an image as well as confidence scores of these objects to belong to a certain category.

Example of object detection.

Object detection

In recent years there has been a lot of development in Deep Learning models for object detection (Faster R-CNN, SSD, Inception…). In this post, we will briefly describe 3 models that achieve acceptable performances on the MS COCO dataset while they only require a single forward pass to predict all bounding boxes (one-stage detector). These models are fast and suitable for real-time object detection.


YOLOv3 1 model is one of the most famous object detection models and it stands for “You Only Look Once”. It is based on fully conventional network (FCN). YOLOv3 uses a features extractor that has 53 layers called Darknet53 and trained on ImageNet. 53 more layers are stacked to the feature extractor giving us 106 layers FCN.


YOLO makes detections at three different scales (three different sizes at three different places in the network) as shown in the architecture. For an input image of size 416x416, the first detection is done on a downsampled feature map with a stride of 32 which results in a 13x13 (the stride is defined as the ratio between initial shape and the shape of extracted features).

The object detection output is obtained by applying a detection kernel (convolution) of shape 1x1x(B x (4 + 1 + C)) where B is the number of bounding boxes a cell of the feature map can predict, C is the total number of classes, 4 is for bonding boxes coordinates and 1 for object score.


The cell on the input image that contains the center of the ground truth bounding box is responsible for the detection of the associated object. The red cell predicts B bounding boxes, this could be useful when two objects are overlapping and the red cell should be able to detect both of them.

YOLOv3 construct prediction by using dimension clusters as anchor boxes. Thus, the network does not predict the final bounding box of the object, but it predicts off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes, called anchor boxes, are obtained by doing a K-means clustering on ground truth boxes (9 anchors in this case, 3 for each detection scale in descending order). The coordinates of box (x,y,w,h) are constructed from predictions as follows

$$ b_x = \sigma(t_x) ; b_y = \sigma(t_y) $$ $$ b_w = p_w e^{t_w} ; b_h = p_h e^{t_h} $$

where bx, by, bw, bh are the x,y center coordinates, width, and height of our prediction.  tx, ty, tw, th are the network outputs.  cx and cy are the top-left offset coordinates of the red cell. pw and ph are anchors dimensions associated with the box and $\sigma$ is the sigmoid function.

For a single image of size 416 x 416, YOLO predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647 bounding boxes. Postprocessing is necessary to remove duplicate predictions. YOLO uses a combination of different methods such as object confidence thresholding and Intersection Over Union (IOU) Non Maximum Suppression thresholding. For more details about YOLO, we refer the reader to 2.

MobileNetv3 - SSDLite

MobileNets is another type of models that were designed for inference on mobile devices. It provides acceptable accuracy and low latency predictions.

MobileNetv33 4 is based on AutoML to find the best architecture from a search space of efficient mobile building blocks. The new building block is an extension of a previous building block (from MobileNet v2) with a new non-linearity activation function (h-swish) and squeeze-and-excitation module.

MobileNetV3 extends the MobileNetV2 inverted bottleneck structure by adding h-swish and mobile-friendly squeeze-and-excitation as searchable options [4]

After defining this new building block, a neural architecture search (NAS) process is applied in two stages: Platform-Aware NAS (to find the global network structures) followed by another NAS to search per layer optimal setting.

The resulting architecture is then trained for object classification on ImageNet which later will be used as a feature extractor in SSDLite.


SSDLite5 is a variant of SSD6 models (single-shot detection) for one-stage detection. It uses cells for object detection as in YOLO with some differences (different detection grids, different anchors construction). More details about SSD models are available in 6.


Better performances for computer vision models came at the expense of the increased complexity of these models (number of model parameters). How can we design smaller and efficient networks while keeping an acceptable level of accuracy?

Different methods are available to tackle this question: Model compression (pruning or quantization), design new efficient blocks (mobilenet), network search, or model scaling.

EfficientDet7 employs EfficientNet8 as the feature extractor backbone, the latter achieves great performances by using model scaling.

  • EfficientNet model scaling

Different approaches could be used for model scaling: change the depth of the network (number of layers), change the width of the network (channels per layer) or change the input size (resolution). For all these approaches, the accuracy saturates after a certain level. To achieve better accuracy and efficiency, one should balance between width, depth, and resolution.

Authors of EfficientNet propose a compound scaling method that uses a coefficient $\phi$ to scale width, depth, and resolution.

Figure2 in [8]

The baseline architecture is found by using a multi-objective neural architecture search that optimizes both accuracy and FLOPS, and then this baseline is improved by using compound scaling.

  • EfficientDet architecture

EfficientDet uses EfficientNet trained on ImageNet and adds a bi-directional feature pyramid network (biFPN) and a network for boxes and classes predictions. The final architecture is given as follows

Figure3 in [7]

The base architecture EfficientDet-D0 is then used for compound scaling to obtain better performing models (with a higher number of parameters).

Experiments and examples

  • Inference

In this section, we will show how to use the previous models to perform object detection by using weights pre-trained on the MS COCO dataset9.

MS COCO dataset is a large scale object detection dataset with 80 defined classes. YOLOv3, MobileNetv3-SSDLite, and EfficientDet(D0) were trained on this dataset and the weights are available to use for inference on any image in the following notebooks.

To run these notebooks locally, clone the repo and run notebooks using the following docker container (contains all requirements).

git clone
cd Object-Detection_MobileNetv3-EfficientDet-YOLO

docker run --rm -it -p 8888:8888 -v $(pwd):/app  imadelh/opencv_tf:base jupyter lab --ip --no-browser --allow-root
Examples of inference using EfficientDet

The performance of these models is measured by mean average precision (mAP). The mAP is the mean of AP over all classes in the dataset.

To calculate AP for a specific class, the Precision-Recall curve $p(recall)$ is computed from the model’s detections for different confidence levels (for certain fixed IOU threshold). AP is the area under this curve.

$$ AP_{\text{@iou threshold}} = \int _{0} ^{1} p( r )dr $$

In practice, a discrete approximation is computed by an interpolation at recall = [0, 0.1, 0.2,…,1.0] $$ AP_{\text{@iou threshold}} \approx \frac{1}{11} \sum _{r \in \ { 0, 0.1,…,1.0 \ } } p( r ) $$

For the MS COCO dataset, the AP is an average over 10 different levels of IOU from 0.5 to 0.95 [email protected][.5:0.05:.95] for 80 different classes.

Model AP Nb of Params
EfficientDet(D0) 33.8 3.9M
YOLOv3 33.0 65M
MobileNetv3-SSDLite 22.0 3.22M
  • Deployement: serverless container

To bring the trained models to the user, we use Flask and Gunicorn to build a simple API that takes an image URL and a model name and returns detected objects.

For each model, we create a model package in /ml_models_api/* that will use the pertained weights and do necessary preprocessing on the raw images and then returns a list of detected objects with associated bounding boxes.

# example of a detection request

# returns


A Docker container is created for the application and then deployed on a cloud instance or a serverless container service such as Google Run.

Details of deployment are given in the Github repository


This post was a brief introduction to different computer vision models used for real-time object detection. These models could be used to do transfer learning/fine-tuning for object detection on a new dataset with different classes. As of the time of writing this post, a new model called YoloV4 ( achieved new states of the art on COCO dataset at real-time speed (65FPS).