After the emergence of AlexNet in 2012, Convolutional NeuralNetworks became the most efficient and widely used method for image recognitiontasks, and were much more superior than traditional image processingtechniques.
ConvNets have shown a remarkable performance on imageclassification tasks, in which given an input image and set of categories, thenetwork decides the strongest category present in the image. The convolutional neural network architectures can easily betrained to classify images. However, classifying images is not enough for thetask of object detection.
For object detection purposes, each element in theimage has to be classified and localized. In order to do object detection, weneed an algorithm on top of ConvNets to do this. This section serves as anintroduction to the algorithms frequently used on top of ConvNets to detect,localize and classify objects in images, along with a detailed discussion onthe algorithm selected by us for our task, Single Shot Multibox Detector (SSD).1.1 R-CNNOne of the first techniques developed by researchersdeveloped to deal the tasks of object detection, localization andclassification were R-CNN’s. A R-CNN33 is a special type of CNN thathas the ability to locate and detect objects in images.
The goal of R-CNN is totake as input an image, and in that image identify correctly where the mainobjects are in the image, via a bounding box.The image below shows the output of a typical R-CNN:Figure 2.2?5: An example output of R-CNN.Image Source: https://towardsdatascience.com/understanding-ssd-multibox-real-time-object-detection-in-deep-learning-495ef744fabHow does it the R-CNN find out where to place the boundingboxes? It basically proposes randomly a bunch of boxes in the image and checksto see if any of them fit correctly. Once the region proposals for boundingboxes have been generated, the images in the bounding boxes are passed througha pre-trained model of AlexNet, and after that a Support Vector Machine (SVM),which classifies the image in the box into one of the given classes.
Once theobject has been classified, the bounding box is run through a linear regressionmodel to improve the bounding boxes by making them tighter. R-CNN worksreasonably well as far as accuracy of the bounding boxes is concerned, but itis quite slow for as it requires a forward pass for every single regionproposal for each image (~2000 region proposals per image). Also, it is veryhard to train as it requires three different models to be trained separately,the CNN which generates the features in every image, the classifier which predictsthe class, and the linear regression model which tightens the bounding boxes. Figure 2.2?5: R-CNN Workflow. Image Source: https://adeshpande3.github.
io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html 1.2 Fast R-CNNThe problems stated above were solved by the introduction ofFast R-CNN34.
Fast R-CNN built on theprevious works to classify object proposals regions much more efficiently. Thekey idea which makes R-CNN faster was the technique used known as Region ofInterest (RoI) Pooling. It basically works by swapping the order of generatingregion proposals and running the CNN. In this technique, the image is first passedthrough a CNN and features of the region proposals are obtained from the lastfeature map of the CNN. Also, in Fast R-CNN, the CNN, classifier and boundingbox regressor are trained jointly, where previously there were three differentmodels to extract image features, classify and further tighten the boundingboxes. All three were computed in a single network in Fast R-CNN.
Effectively,this was significantly faster than R-CNN.1.3 Faster R-CNNFaster R-CNN25 further improved upon the speedof the previous techniques by bringing about an advancement in one of the remainingbottlenecks, the region proposer.
It speeds up the region proposal mechanism byinserting a region proposal network (RPN) after the last convolutional layer. Effectively,region proposals are produced by just looking at the last convolutional featuremap. From there onwards, the same pipeline is used as in R-CNN.Figure 2.2?5: Faster R-CNN Workflow.
html1.4 Single Shot Multibox Detector (SSD)We now present the object detection and localization techniqueused by us for our task of drowsiness detection. The technique is known as theSingle Shot Multibox Detector (SSD)35, and has been evaluated to havemuch better performance and precision for object detection tasks. To begin our understandingof SSD, we start with the explanation of the name:· SingleShot: This refers to the fact that object detection and localization isdone in a single forward pass of thenetwork.· MultiBox: This is the name of the technique developed by the authors for thetask of bounding box regression (i.e.
making bounding boxes thinner)· Detector:The network is an object detector which also classifies the detected objects.1.4.1 ArchitectureFigure 2.
2?5: SSD architecture. ImageSource: https://arxiv.org/pdf/1512.02325.pdfAs shown in the figure above, the architecture of SSD buildson the architecture of the VGG-16 architecture, but does away with the fullyconnected layers.
VGG-16 is the base network because it has very strongperformance in image classification tasks, and it is used very widely for transferlearning tasks.1.4.2 MultiBoxIt is a bounding box regression technique developed by theauthors of the paper, n MultiBox, the researchers “priors”, which arepre-computed, fixed size bounding boxes that closely match the distribution ofthe original ground truth boxes. These priors are selected in such a way thattheir Intersection over Union ratio (IoU) is greater than 0.5. MultiBox startswith the priors as predictions and attempt to regress closer to the groundtruth bounding boxes.The resulting architecture contains 11 priors per featuremap cell (8×8, 6×6, 4×4, 3×3, 2×2) and only one on the 1×1 feature map,resulting in a total of 1420 priors per image, thus enabling robust coverage ofinput images at multiple scales, to detect objects of various sizes.At the end, MultiBox only retains the top K predictions thathave minimised both location (LOC) and confidence (CONF) losses.