Paper Reviews on Object Detection Methods and Progression

  • RCNN

    rcnn

    • Two-stage model

    • Uses a selective search method to estimate high-quality region proposal on an image

    • Estimated proposals are wrapped on the input image resulting in multiple object images based on proposals

    • Features are extracted from wrapped object images, using some pre-trained CNN

    • These features are both used for bounding box regression and object classification

    • More of a brute force approach, and has multiply repeated and unnecessary computational steps

    • Overall procedure: Proposals obtained by selective search on an image -> proposals are wrapped to original image size -> features are extracted using pre-trained CNN for each and every wrapped proposal -> these features are used in classification and bounding box regression

    • Results: 66% mAP, with 0.02fps


  • Fast RCNN

    frcnn

    • Two-stage model

    • Uses a selective search method to estimate high-quality region proposal on an extracted CNN features

    • Removes the repeated feature extraction step in RCNN by extracting feature only once.

    • Extracted proposals are mapped to the obtained CNN features rather than input image as in RCNN

    • These features are both used for bounding box regression and object classification

    • Still involves some repeated computational steps

    • Overall procedure: First Image features are extracted using pre-trained CNN -> Proposals obtained by selective search on an image are mapped on to obtained features -> proposed features are used in classification and bounding box regression

    • Results: 70%mAP, with 0.4fps


  • Faster RCNN

    fasrcnn

    • Two-stage model

    • Faster RCNN uses Region Proposal network for proposal extraction instead of selective search method

    • Region Proposal Network is CNN which provides a reduced number of proposals, while still ensuring high-quality proposals

    • Region Proposal Network estimated proposals along with objectness score, and objectness score refers to the existence of the object in the proposal

    • Higher the Objectness score the wrapped proposal image is used for bounding box regression and classification

    • Overall procedure: First Image feature are extracted using pre-trained CNN -> features are used to estimate proposals and objectness score -> based on objectness score selected features are used for classification and bounding box regression

    • Results: 73% mAP, with 7fps


  • YOLO

    yolo

    • Single-stage model

    • Removes two separate steps of objectness score and object classification, by merging them

    • Directly estimated classification score instead of objectness score. It also uses an additional class to represent no-object

    • It just uses one pass to estimate classification score and bounding box, so considered as Single-stage model

    • Overall procedure: First Image features are extracted using pre-trained CNN -> These features are directly used in classification and bounding box estimation

    • Results: 66% mAP, with 21fps


  • SSD

    ssd

    • Single-stage model

    • The main drawback of Yolo was, Yolo failed in detecting small objects

    • SSD uses feature information for all the layers, which includes more information about detailed features along with more aggregated features

    • This method of including multi-scale feature information boosts the performance of SSD as compared to YOLO

    • The method uses multiple classifiers and detector blocks: each classifier and detector block estimates N (hyperparameter) objects in an image along with their bounding box

    • This method also uses an additional class for no object

    • Overall procedure: multi-scale features are extracted using pre-trained CNN -> each scale feature are used in the classifier and detector block -> each object detection and classification is performed at every feature scale -> Non-maximal suppression is applied for merging all bounding boxes and obtain final detection and classification

    • Results: 74% mAP, with 46 fps


  • Cornernet

    • Single-stage model
    • Very interesting idea to eliminate the use of multiple anchor boxes
    • Proposed the new layer known as corner pooling, to estimate if the pixel i,j is right corner or left corner of any object
    • Reduces the complexity from O(w^2h^2) (in estimating anchor proposals) to O(wh)

References
  • Selective Search: https://link.springer.com/article/10.1007/s11263-013-0620-5
  • SSD slides: http://faculty.iitmandi.ac.in/~aditya/cs671/index.html

Leave a Comment