Abstract:
One of the major challenges in Text-to-Speech Synthesis Systems (TTS) is the incorporation of prosody in the synthesised speech. Various techniques based on linguistic characteristics have been proposed in the literature for improving prosody. However, prosody in TTS is still an open problem and this is addressed in this thesis.
In this work, approaches are proposed for improving the naturalness, intelligibility and prosody. It is performed by selecting the sequence of sound units from a large corpus in such a way that acoustic features are made consistent at (a) segmental level, and (b) supra-segmental level. At segmental level, the differences in acoustic features
of sound units are reduced over the entire utterance to improve naturalness and intelligibility. At supra-segmental level, units are selected by ensuring the consistency in the differences in acoustic features of adjacent syllables at phrase level. In this method, consistency of acoustic features is also maintained at utterance level. Unlike the existing USS based TTS, which rely mostly on linguistic information for improving prosody, the proposed approach makes use of acoustic information. Probabilistic approaches are proposed for selecting units based on an acoustic framework. Since the
context is only specified by acoustic features, the proposed approaches can be applied to any language and perhaps even for multilingual synthesis. The experimental results of the proposed approaches are demonstrated using five Indian languages. It was observed
from the subjective evaluation tests that f0 contributed to the naturalness of the systems whereas duration and energy helped in improving the intelligibility of the systems. Also, ensuring consistency of energy and f0 across syllables in phrase and duration across syllables in utterance further improved the prosody.