Significance of sparse representation for speech recognition and speech synthesis (PhD)

Show simple item record

dc.contributor.advisor Dr. Anil Kumar Sao
dc.contributor.author Sharma, Pulkit
dc.date.accessioned 2020-06-29T10:05:45Z
dc.date.available 2020-06-29T10:05:45Z
dc.date.issued 2018-03-20
dc.identifier.uri http://hdl.handle.net/123456789/235
dc.description A thesis submitted for the award of the degree of Doctor of Philosophy under the guidance of Dr. Anil Kumar Sao (Faculty, SCEE) en_US
dc.description.abstract In this thesis, sparse representation (SR) based signal processing is employed to derive features for speech recognition and foot print reduction of units election based speech synthesis (USS) systems. The objective in speech recognition is to convert a speech utterance in to text, however, the objective in speech synthesis is to generate speech corresponding to given text. In this work, the SR in speech recognition is employed to effectively discriminate among different speech units, while that in USS systems is used to compress the speech corpus. In SR based signal processing, a speech signal is decomposed into a dictionary and the corresponding representation, having a few significant coefficients. The use SR for a particular application is greatly influenced by the choice of the dictionary. The data belonging to two confusing classes may lie in overlapping subspaces, and thus a single dictionary may not effectively discriminate among them. Hence, in this work, class specific principal component analysis (PCA) based multiple dictionaries are used for tasks in speech recognition. Here, speech frames belonging to each speech class/unit are clustered in to different clusters, and a sub-dictionary is learned for each cluster. Our experiments reveal that coefficients corresponding to intermediate principal components (PCs) result in more discrimination among confusing speech units. Thus, a transformation function known as weighted decomposition is employed to emphasize the discriminative information present in the middle PC of the PCA-based dictionary. The performance of proposed features is evaluated using continuous density hidden Markov model based classifiers for various speech units classification tasks. The use of class specific dictionaries in the proposed SR based feature results in an increased computational complexity. In order to address this issue, we have proposed to use a deep sparse representation (DSR) based unified model to learn a single multi level dictionary for all the speech classes. The proposed DSR model alternate between a sparse and a dense layer, and it has been observed that representations obtained at different sparse layers have complimentary information. This means that a set of speech classes confusing at one layer is discriminative at another layer, and vice versa. Thus, we propose to concatenate representations obtained at different sparse layers to derive the final feature representation for speech recognition. GMM-HMM and DNN-HMM systems are used to evaluate the performance of the proposed feature fo rvarious speech recognition tasks. Experimental studies reveal that the deep dictionary derived using the proposed DSR model outperform both single overcomplete dictionary and multiple sub-dictionaries. The issue of speech recognition in noisy environment is addressed by enhancing the noisy speech, before deriving features for speech recognition. In particular, we have proposed a novel SR based method for speech enhancement, which is based on the observation that, given an appropriate dictionary, it is easy to estimate SR for speech signal, as opposed to noise. The objective of speech synthesis can be seen as contrary to the speech recognition, and the USS system results in the best quality of synthesized speech, as compared to the contemporary approaches. In USS systems, speech units from a pre-recored speech database are selected and concatenated to synthesize speech. Thus, the size of speech database in USS systems limits its use in low resource devices. In this thesis, SR based signal processing is explored to compress the speech database to be stored in USS systems. The proposed method reduce the size of speech corpus by storing significant coefficients of the sparse vector. It is also observed that the behavior of SR varies for different speech sounds (e.g., voiced, unvoiced etc.). Hence, for efficient compression, different number of significant coefficients of the SR are stored for different speech sounds. USS systems build using two Indian languages (Hindi and Rajasthani) are used to evaluate the performance of the proposed compression methods. It has been shown that multiple dictionaries learned for individual speech units result in better compression in USS systems. In addition, discriminating ability of SR is used in a kernel sparse representation based classifier(KSRC) for speech emotion recognition, where a given speech sample is classified in to various categories of emotions. Further,a group sparsity constraint is also employed on KSRC to improve its performance. This is achieved by considering the cooperation among training samples of same class while estimating the SR.
dc.language.iso en_US en_US
dc.publisher IITMandi en_US
dc.subject Greedy Methods en_US
dc.subject Synthesis en_US
dc.title Significance of sparse representation for speech recognition and speech synthesis (PhD) en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IIT Mandi Repository


Advanced Search

Browse

My Account