Abstract:
This thesis addresses the task of scene image classification. The real-world scene images
are of di erent sizes and constituent with complex semantic concepts. Typical techniques
for scene recognition resize the images to a fix standard size and then extract the features.
However, this leads to a significant loss of information as the size of images varies significantly
in the range of 104 to 106 pixels. This thesis addresses this issue by considering
varying size images in their true resolution to avoid the loss of information due to resizing.
True size images results in varying size feature representation of scene images. To build the
support vector machine (SVM) based classification model for such representation, two novel
dynamic kernels are proposed. Since a scene is composed of complex semantic concepts,
obtaining a concept-based representation is quite challenging. This thesis further addresses
this issue by the proposed framework for the generation of semantic concept-based representation
of varying size scene images.
This thesis proposes two dynamic kernels, namely, spatial probabilistic sequence kernel
(SPSK) and deep spatial pyramid match kernel (DSPMK) for the classification of varying
length patterns of scene images. Dynamic kernels are the similarity functions that take two
varying sizes of the input and compute the similarity score. In SPSK images are represented
by sets of low-level local feature vectors. SPSK incorporates spatial configuration of local
feature vectors in the computation of probabilistic sequence kernel. Low-level features used
in the computation of SPSK are the local descriptors and failed to capture the complex
geometric structure of scene images. For better feature representation, low-level features
are replaced by a learned convolutional neural network (CNN) based features. The main
challenge is the usage of CNN requires to bring di erent sized input images to a fixed predefined
size either by reducing, enlarging or cropping. To handle this, the proposed work
provides a mechanism to pass the images to CNN in their original resolution. This results
in varying size sets of deep activation maps as image representation. To build the SVMbased
classifier for such representation, a DSPMK as the novel dynamic kernel is proposed.
DSPMK operates over sets of activation maps on di erent pyramid levels. At each level,
activation maps are divided into fix number of spatial regions and the final similarity score
between two examples is obtained by computing the weighted combination of intermediate
matching scores.
To capture the constituent concept information of scene images, a scene image is represented
in semantic concept space by the posterior probabilities of concepts present in it
and such representation is known as semantic multinomial (SMN) representation. SMN
representation requires concept annotated dataset with concept specific features for concept
modeling which are infeasible to generate manually due to large size of database. The proposed
research work focused on building the concept models via pseudo-concepts in the
absence of true concept annotated data. For the sets of local feature vector representation
of scene images, clusters of local feature vectors of all the database images are proposed as
cues to the pseud-concepts. Further, the pseudo-concept models are built using the proposed
dynamic kernel-based SVM framework. Disadvantages of the low-level feature-based SMN
representation include, concept models are built using features from the complete image
instead of concept specific features, and handcrafted features used for building the concept
models are local descriptors, moreover, it do not capture much of the semantic information.
To overcome these limitations, a novel deep CNN-based SMN representation is proposed
that uses the deeper convolutional layers filter responses of pre-trained CNNs as cues to
pseudo-concepts. Convolutional layer filters are considered as concepts detector, but ground
truth information of filters (i.e., which filter is learning what concept) is not known during
the training process of CNNs. Hence, the true-concept identity of a particular filter from its
activation is not inferred. However, activation maps responses can be visualized using different
visualization techniques. The non-significant pseudo-concepts are removed using the
proposed filter specific threshold-based approach and similar pseudo-concepts are grouped
using subspace modeling. Pseudo-concept models are built using linear kernel-based SVM
to generate novel SMN representation. The proposed procedure for building pseudo-concept
models is weakly supervised as an image may contain multiple pseudo-concepts.
Further to improve the pseudo-concept modeling, the proposed thesis work focuses on
the semantic analysis of filter responses of true size images. A strategy is proposed in which
filter responses of true resolution images act as cues for pseudo-concepts in the absence
of true concepts labeled data. Procedure to select prominent pseudo-concepts and group
the similar one is proposed. In the end, pseudo-concept models are built using proposed
modified DSPMK-based framework to generate SMN representation of varying sized images.
Potential of the proposed approaches are evaluated using standard scenes recognition
datasets such as MIT8 scene, Vogel Schiele, MIT67 indoor and SUN397.