Synthetic speech detection using temporal modulation feature

Zhizheng Wu, X. Xiong, E. S. Chng, Haizhou Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features
Original languageEnglish
Title of host publication2013 IEEE International Conference on Acoustics, Speech and Signal Processing
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages5
ISBN (Print)978-1-4799-0356-6
Publication statusPublished - 2013

Fingerprint Dive into the research topics of 'Synthetic speech detection using temporal modulation feature'. Together they form a unique fingerprint.

Cite this