Tracking people by learning their appearance

In the early s, object detection was carried out using template matching based algorithms [ 9 ], where a template of the specific object is slid over the input image to find the best possible match in the input image.

Appearance-Based Human Tracking

In the late s, the focus was shifted toward the geometric appearance-based object detection [ 10 , 11 ]. In these methods, the basic focus was on height, width, angles, and other geometric properties. In the s, object detection paradigm was transferred to low-level features based on some statistical classifiers such as local binary pattern LBP [ 12 ], histogram of oriented gradient [ 13 ], scale-invariant feature transform [ 14 ], and covariance [ 15 ].

Feature extraction-based object detection and classification involved training of machine based on extracted features. For many years in computer vision field, handcrafted traditional features were used for object detection. But, with the progress in deep learning after accomplishing the remarkable performance in image classification challenge [ 16 ], convolution neural networks are being used for this purpose. After the success of object classification in [ 16 ], researchers transferred their attentions toward object detection and classification.

Deep convolution neural networks work exceptionally good for extraction of local and global features in terms of edges, texture, and appearance. In recent years, the research community has moved in the direction of region-based networks for object detection. This type of object detection is being used in different applications like video description [ 17 ]. In region-based algorithms for object detection, convolution features are extracted over proposed regions followed by categorization of the region into a specific class. With the attractive performance of AlexNet [ 16 ], Girshick et al.

Dublerede citater

They employed selective search for proposing the areas where the potential objects can be found [ 19 ]. They called their object detection network as region convolution neural network R-CNN.

Flamingoes, elephants and sharks: How do blind adults learn about animal appearance?

The basic flow of region convolution neural network R-CNN can be described as follows: Regions are proposed for each object in the input image using selective search [ 19 ]. Proposed regions are resized to same consistent size for classification of the proposal into predefined classes based on extracted CNN features of regions. Although the proposed R-CNN was a major breakthrough in the field of object detection, it has some significant weaknesses: Training processes is quite slow because R-CNN has different separate stages to train.

Object detection is slow because CNN features are extracted for individual proposal for each testing image. To overcome the feature extraction issue for each proposal, Kaiming He et al. The basic idea was that the convolution layers accept the input of any size; fully connected layers force input to be fixed size for making matrix multiplication possible.

They used SPP layer after last convolution layer for obtaining the fix-sized features to feed in fully connected layer. SPPNet extracts convolution features on input image only once for proposals of different sizes.

  1. Discussion.
  2. bride in mail nude order searching.
  3. the phone and pager co oklahoma.
  4. what does a new jersey birth certificate look like!
  5. criminal background you can print!

This network improves the performance of testing, but it does not improve the performance of training the R-CNN. Furthermore, weights of convolution layers before SPP layer cannot be changed which limits the fine-tuning process.

Facial recognition system - Wikipedia

Fast R-CNN employs the idea of computation sharing of convolution for different proposed regions. It adds region of interest ROI -pooling layer after the last convolution layer for generating fix-sized features of individual proposals. The fix-sized features from ROI-pooling layers are fed to the stack of fully connected layers that further split down into two branch networks: one acts as the object classification network and the other for bounding box regression.

They claimed that the overall performance of training step of R-CNN is enhanced by three times and ten times for testing. Modern advancements in object localization using deep neural network [ 22 ] motivated Ren et al. They proposed efficient RPN for proposing proposals for objects. Faster R-CNN is a purely convolution neural network without any handcrafted features that employ fully convolution neural network FCN for region proposal.

Redmon et al. They completely dropped the region proposal step; YOLO splits the complete image into grids and predicts the detection on the bases of candidate regions. Each grid has a class probability C , B as the bounding box locations and a probability for each box. Removing the RPN step enhances the performance of the detection; YOLO can detect the objects while running in real time with about 45 fps.

In current era, biometric identification systems are required more than ever because of the improved security requirement in the globe. There have been a lot of efforts by researchers for face recognition technology FRT. The basic division of FRT can be the traditional handcrafted feature-based identification and deep learning-based identification. Eigenface [ 25 ] and Fisherface [ 26 ] were commonly used approaches in the last decade for face identification.

Eigenfaces reduced the feature points for measuring maximum change in face features using minimum set of features. For reducing the features, they used principal component analysis PCA. Linear face can be recognized based on linear structure of the face using Eigenfaces. In contrast with the Eigenfaces, Fisherfaces are a supervised learning-based face identification method based on traditional texture features. Fisherfaces employ linear discriminator analysis for finding the uniquely describing data points. Both of these methodologies extract features in terms of Euclidean distance to identify the face.

Researchers have also used LBP for facial recognition [ 27 , 28 ].

How to Trace Someone's IP Address / Track down Cyber Bullies

Hadid et al. They worked on face detection and recognition.

Citater pr. år

Face detection was achieved by training a support vector machine of second degree on extracted features. Face recognition was achieved using LBP-based texture descriptor. Machine was trained on these descriptors for face recognition. Advancement in convolution neural networks has achieved remarkable performance by increasing accuracy and efficiency. The very basic assumption in deep neural networks is to feed as much data as possible for getting better results. Requirement of huge data makes deep learning-based approaches data hungry.

Lu et al. They divided their complete network into three networks: one backbone network called trunk network and two other networks called branch networks that emit from trunk network. The central network is trained once for learning the deep features for face identification. The central network is generated using residual blocks. Resolution-specific coupled mapping is employed in branch network for training. Input image and comparison image from gallery are transformed to same representation for comparing.

Based on distance the decision is made about identified face. Schroff et al.

Their proposed system extracts the feature space in terms of Euclidean space. They optimized the feature mapping of facial structure using deep convolution neural network. Their proposed system, FaceNet, generates a feature vector of dimensions that is optimized using triplet loss. Their proposed triplet loss comprises three face images: two from the same pair and one from a separate individual. The loss function tries to separate the same individual faces from different individual faces.

Their triplet loss function is trained to minimize the distance between the same identity faces and maximize the distance between different identities. Inception model with little modification is employed in FaceNet for extracting convolution features. They tested their system on LFW dataset [ 31 ]. A research group from Facebook, Taigman et al.


They used deep convolution neural network having nine convolution layers for extracting facial features. Facial landmarks are used in their system for face alignment. The facial landmarks are estimated using support vector regressor SVR. Extracted features from nine-layered network are passed to Softmax layer for classification. They employed cross-entropy to reduce the loss of correct labels. They also proposed a huge face recognition dataset named as Social Face Dataset [ 32 ]. They used their dataset for training the system for face identification.

Multiple researchers have focused on movement and spatial features for tracking the multiple objects [ 33 , 34 ].