Dynamic kernels and semantic representations for recognition of varying size scene image (PhD)

Gupta, Shikha

Dynamic kernels and semantic representations for recognition of varying size scene image (PhD)

Gupta, Shikha

URI: http://hdl.handle.net/123456789/329

Date: 2020-07-15

Abstract:

This thesis addresses the task of scene image classification. The real-world scene images are of di erent sizes and constituent with complex semantic concepts. Typical techniques for scene recognition resize the images to a fix standard size and then extract the features. However, this leads to a significant loss of information as the size of images varies significantly in the range of 104 to 106 pixels. This thesis addresses this issue by considering varying size images in their true resolution to avoid the loss of information due to resizing. True size images results in varying size feature representation of scene images. To build the support vector machine (SVM) based classification model for such representation, two novel dynamic kernels are proposed. Since a scene is composed of complex semantic concepts, obtaining a concept-based representation is quite challenging. This thesis further addresses this issue by the proposed framework for the generation of semantic concept-based representation of varying size scene images. This thesis proposes two dynamic kernels, namely, spatial probabilistic sequence kernel (SPSK) and deep spatial pyramid match kernel (DSPMK) for the classification of varying length patterns of scene images. Dynamic kernels are the similarity functions that take two varying sizes of the input and compute the similarity score. In SPSK images are represented by sets of low-level local feature vectors. SPSK incorporates spatial configuration of local feature vectors in the computation of probabilistic sequence kernel. Low-level features used in the computation of SPSK are the local descriptors and failed to capture the complex geometric structure of scene images. For better feature representation, low-level features are replaced by a learned convolutional neural network (CNN) based features. The main challenge is the usage of CNN requires to bring di erent sized input images to a fixed predefined size either by reducing, enlarging or cropping. To handle this, the proposed work provides a mechanism to pass the images to CNN in their original resolution. This results in varying size sets of deep activation maps as image representation. To build the SVMbased classifier for such representation, a DSPMK as the novel dynamic kernel is proposed. DSPMK operates over sets of activation maps on di erent pyramid levels. At each level, activation maps are divided into fix number of spatial regions and the final similarity score between two examples is obtained by computing the weighted combination of intermediate matching scores. To capture the constituent concept information of scene images, a scene image is represented in semantic concept space by the posterior probabilities of concepts present in it and such representation is known as semantic multinomial (SMN) representation. SMN representation requires concept annotated dataset with concept specific features for concept modeling which are infeasible to generate manually due to large size of database. The proposed research work focused on building the concept models via pseudo-concepts in the absence of true concept annotated data. For the sets of local feature vector representation of scene images, clusters of local feature vectors of all the database images are proposed as cues to the pseud-concepts. Further, the pseudo-concept models are built using the proposed dynamic kernel-based SVM framework. Disadvantages of the low-level feature-based SMN representation include, concept models are built using features from the complete image instead of concept specific features, and handcrafted features used for building the concept models are local descriptors, moreover, it do not capture much of the semantic information. To overcome these limitations, a novel deep CNN-based SMN representation is proposed that uses the deeper convolutional layers filter responses of pre-trained CNNs as cues to pseudo-concepts. Convolutional layer filters are considered as concepts detector, but ground truth information of filters (i.e., which filter is learning what concept) is not known during the training process of CNNs. Hence, the true-concept identity of a particular filter from its activation is not inferred. However, activation maps responses can be visualized using different visualization techniques. The non-significant pseudo-concepts are removed using the proposed filter specific threshold-based approach and similar pseudo-concepts are grouped using subspace modeling. Pseudo-concept models are built using linear kernel-based SVM to generate novel SMN representation. The proposed procedure for building pseudo-concept models is weakly supervised as an image may contain multiple pseudo-concepts. Further to improve the pseudo-concept modeling, the proposed thesis work focuses on the semantic analysis of filter responses of true size images. A strategy is proposed in which filter responses of true resolution images act as cues for pseudo-concepts in the absence of true concepts labeled data. Procedure to select prominent pseudo-concepts and group the similar one is proposed. In the end, pseudo-concept models are built using proposed modified DSPMK-based framework to generate SMN representation of varying sized images. Potential of the proposed approaches are evaluated using standard scenes recognition datasets such as MIT8 scene, Vogel Schiele, MIT67 indoor and SUN397.