Prediction of voice aperiodicity based on spectral representations in HMM speech synthesis

Examples of English Speech Synthesis

Interspeech 2011, Florence, Italy

[pdf][slides]


Hanna Silén, Elina Helander, Moncef Gabbouj

Tampere University of Technology, Finland


This page contains synthesis samples for the prediction of voice aperiodicities based on synthetic spectral representations in the framework of hidden Markov model (HMM) based speech synthesis [1]. In the proposed approach, instead of HMM modeling of all speech features, only spectral and intonation features are synthesized using traditional modeling and model clustering while the voice aperiodicities - bandwise aperiodicity and voicing decisions - are predicted based on spectral representations using multivariate regression with Gaussian mixture modeling. A more detailed description of the method is provided in the paper.


Speech parameterization

The following speech parameterization scheme with an analysis update interval of 5ms is employed in the evaluations:


Evaluation systems

The aperiodicity features of the samples are generated using three different systems:

A more detailed description of the evaluation systems is given in the associated paper. For all systems, synthetic MCC and fundamental frequency trajectories are generated using the traditional HMM-based approach.

The HMM-training employs the HMM-based speech synthesis system HTS [5] (version 2.1) and the English speech database CMU ARCTIC (speakers slt and rms) available in [6]. Half of the data was used as training data and the rest as test data. Here, postfiltering was used instead of global variance [7].


Synthesis samples

Randomly chosen synthesis samples for the baseline and proposed systems are given below. Sentences were unseen in the training data.

Female speaker slt:

Baseline,
HMM-based
Proposed I,
BAP prediction
Proposed I,
BAP and voicing prediction

Male speaker rms

Baseline,
HMM-based
Proposed I,
BAP prediction
Proposed I,
BAP and voicing prediction


References:

[1] Kobayashi, T., Tokuda, K., Masuko, T., Yoshimura, T. and Kitamura, T., Simultaneous Modeling Of Spectrum, Pitch And Duration In HMM-Based Speech Synthesis, EUROSPEECH, pp. 2347-2350, 1999.

[2] Kawahara, H., Masuda-Katsuse, I., and de Cheveigné, A., Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds, Speech Communication 27, 1999, pp. 187-207.

[3] Fukada, T., Tokuda, K., Kobayashi, T., and Imai, S., An adaptive algorithm for mel-cepstral analysis of speech, in ICASSP, 1992, pp. 137-140.

[4] Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T., Mixed excitation for HMM-based speech synthesis, In EUROSPEECH, 2001, pp. 2263-2266.

[5] Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A., and Tokuda, K., The HMM-based Speech Synthesis System (HTS) Version 2.0, in ISCA SSW6, 2006, pp.294-299.

[6] CMU_ARCTIC speech synthesis databases, available at http://festvox.org/cmu_arctic/.

[7] Toda, T. and Tokuda, K., Speech parameter generation algorithm considering global variance for HMM-based speech synthesis, in Interspeech, 2005, pp. 2801-2804.


last modified: 2011-09-05 (hs)