Parameterization of Vocal Fry in HMM-Based TTS

Examples of Finnish Speech Synthesis

Silén, H., Helander, E., Nurminen, J., and Gabbouj, M.

link to the publication

This page contains Finnish samples of:

  1. Analysis/synthesis of the (a) proposed and (b) baseline speech parameterization, and

  2. HMM-based text-to-speech synthesis (TTS); system trained using the (a) proposed and (b) baseline parameterization.
For analysis/synthesis samples, the corresponding recorded sample is also provided.

Speech parameterization

For both analysis/synthesis and HMM-based TTS, two alternative speech parameterizations are considered:

An analysis update interval of 5ms is used for both approaches.

1. Analysis/synthesis

Sample 1: Recorded Baseline Proposed
Sample 2: Recorded Baseline Proposed

2. HMM-based text-to-speech

Sample 1: Baseline Proposed
Sample 2: Baseline Proposed

The HMM-training is done using the HMM-based speech synthesis system HTS [4] (version 2.1) and a Finnish speech database (80 minutes) recorded with a female voice (see [5]). Global variance mainly improves the speech spectrum [6] and may cause problems in F0. Here, postfiltering was used instead of global variance.

The listener should pay attention especially to the last word of the sentence. The database sentences typically contain utterance final vocal fry causing problems in both analysis/synthesis and HMM-based TTS.

The synthesis text prompts for the generated sentences can be found from [7].


[1] Kawahara, H., Masuda-Katsuse, I., and de Cheveigné, A., Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds, Speech Communication 27, 1999, pp. 187-207.

[2] Fukada, T., Tokuda, K., Kobayashi, T., and Imai, S., An adaptive algorithm for mel-cepstral analysis of speech, in ICASSP, 1992, pp. 137-140.

[3] Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T., Mixed excitation for HMM-based speech synthesis, In EUROSPEECH, 2001, pp. 2263-2266.

[4] Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A., and Tokuda, K., The HMM-based Speech Synthesis System (HTS) Version 2.0, in ISCA SSW6, 2006, pp.294-299.

[5] Silén, H., Helander, E., Nurminen, J., and Gabbouj, M., Evaluation of Finnish unit selection and HMM-based speech synthesis, in Interspeech, 2008, pp. 1853-1856.

[6] Toda, T. and Tokuda, K., Speech parameter generation algorithm considering global variance for HMM-based speech synthesis, in Interspeech, 2005, pp. 2801-2804.

[7] Ojala, T., Auditory quality evaluation of present Finnish text-to-speech systems, M.Sc. Thesis, Helsinki University of Technology, 2006.

last modified: 2009-04-22
Hanna Silen,