Abstract:
Visual recognition is a challenging problem which depends on the discriminative
nature and robustness of the features used in recognition techniques. These techniques
are mainly focused on adapting hand-designed local features such as SIFT,
HOG, k-NN, and SURF etc., which are not scalable to other modalities. Hence
there is a paradigm shift from hand-designed local features to unsupervised learning
in order to extract features directly from the raw data. Visual signals (images)
can be modeled using independent subspace analysis (ISA), an extension to general
ICA model, which gives invariant features. ISA has been extended for large data set
to delivers hierarchy of features using convolution and stacking multiple layers of
ISA over each other. Albeit performance is good, it takes signi ficant amount of
time on large datasets due to high computational complexity and sequential implementation.
Two different methods are proposed to speed up feature learning in
multilayered ISA. First method for faster feature learning uses parallelization present
in the data. MapReduce, a scalable programming model, is used to parametrize ISA
model using multiple map-reduce functions over the equal disjoint sets of distributed
data. The second method for increasing speed uses spatio-temporal interest point
detectors to extract important blocks from video which removes irreverent video
blocks. The latter not only enhances the speed but also improves the classification
accuracy. Different input level modifications are also proposed which increases the
classification performance. A data set is also created for human-water activities for
surveillance purpose near water bodies and the ISA network is applied over it for
feature extraction and classification.
Multilayered ISA is used to extract features for ne-grained recognition of similar
objects i.e., categorizing various types of leaves, butteries and birds into their
subcategories like breeds and species. This architecture has three ISA layers to
extract features from the large image patches. The process convolves learned filters
over a large spatial region (image patch) which are learned by applying ISA on
small size image patches. Further, discriminative patches are used to train ISA
network which correspond to SIFT points and has optimal size based on classification
accuracy. Addition of more ISA layers increases the percentage of true-positives significantly enough while our computational cost is not affected due to the reduction
in data size. The proposed approach is tested over leaf, butterfly, and bird dataset.
Most of the techniques applied on the leaf was focused on structural features since
leaves have needges. These needges are enhanced by applying contrast limited
adaptive histogram equalization (CLAHE) on the leaf images. The hybrid technique
which work best for leaf dataset is wavelet transform of patches taken around SIFT
key points of the enhanced image. It should be hypothesized that adding another
ISA layer captures large spatial region and hence gives the complex structure present
there. All this together improves percentage of true-positives in the classification by
a significant amount.
Features learned from ISA are also used for action recognition in RGB and depth
videos where cuboids are extracted around spatio temporal interest points after normalizing
the frame size of different videos. The resulting cuboids are concatenated for training multilayered ISA model with two layers. Different dataset are used for
testing the framework such as MSR-Action3D, MSR Daily Activity 3D, UTD multimodal human action datasets having 20, 16, and 27 activities respectively