Search

Question

What is the general architecture of the Single-Short Detector (SSD)?

Answer 1

A Single-Shot Detector uses a backbone network, to which it adds additional convolutional feature layers.
These layers decrease in size progressively and allow predictions at multiple scales.
Attached to each feature layer (or optionally an exisiting feaure layer from the base network) is a convolutional detection layer that produces a fixed set of predictions.
These predicted bounding boxes and scores are then processed by a non-maximum suppression step to produce the final detections.

Answer 2

In a convolutional fashion, SSD evaluates a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. \(8 \times 8\) and \(4 \times 4\) in (b) and (c)). For each default box, we predict
both the shape offsets and the confidences for all object categories (\((c_1, c_2,... , c_p)\)).

Answer 3

Anchor boxes are defined by their scale \(s\) and aspect ratio \(a\).
The width and the height of the each default box is then computed as: \(w = s\sqrt{a}\), \(h = \frac{s}{\sqrt{a}}\).

Answer 4

By taking the square root of the aspect ratio and multiplying it for one side and dividing for the other side, you still get the desired aspect ratio, while also keeping the area equal to the scale\(^{[1]}\).

Answer 5

A default bounding box matches any ground truth box with jaccard overlap higher than a threshold (0.5).

Answer 6

SSD uses a technique called hard negative mining:
Instead of using all the negative examples, SSD training sorts them using the highest confidence loss for each default box and picks the top ones so that the ratio between the negatives and positives is at most \(3:1\).

Answer 7

The loss function is the sum of a localization loss and a classification loss.
\[\mathcal{L} = \frac{1}{N}(\mathcal{L}_\text{cls} + \alpha \mathcal{L}_\text{loc})\]
where \(N\) is the number of matched bounding boxes and \(\alpha\) balances the weights between two losses, picked by cross validation.

The localization loss is a smooth L1 loss between the predicted bounding box correction and the true values.
The classification loss is a softmax loss over multiple classes

Answer 8

SSD matches each ground truth box to the default box with the best jaccard overlap and then matches default boxes to any ground truth with jaccard overlap higher than a threshold (0.5).

SSD: Single Shot MultiBox Detector