Voice manipulation and synthesis pose a growing threat to digital security, raising the need for effective systems to detect artificial speech. This study investigates the feasibility of distinguishing between real and synthetic voices through machine learning techniques applied to the Fake or Real (FoR) Dataset from York University. The dataset contains over 70,000 text-to-speech (TTS) recordings, balanced in gender, class, sample rate, volume, and number of channels. The approach utilizes Gammatone Frequency Cepstral Coefficients (GTCC) and Delta Gammatone Frequency Cepstral Coefficients (ΔGTCC) as key features for voice characterization. A logistic regression model, enhanced with Recursive Feature Elimination (RFE), was employed to identify the most discriminative coefficients for this task. RFE iteratively removed less significant features, enhancing both the performance and interpretability of the model. The final model achieved a 70% accuracy in testing, using only five ΔGTCC features. This comparative analysis of GTCC and ΔGTCC revealed their respective strengths in voice classification tasks, offering insights for future developments in voice authentication systems. The study not only advances voice authentication technologies but also highlights the crucial role of feature selection in improving the robustness of models designed to safeguard against synthetic voice threats.
Advancing Voice Authentication: Insights from Cepstral Coefficients and Recursive Feature Elimination in Speech Signal
Di Cesare M. G.Primo
;Cardone D.;Merla A.;Perpetuini D.Ultimo
2024-01-01
Abstract
Voice manipulation and synthesis pose a growing threat to digital security, raising the need for effective systems to detect artificial speech. This study investigates the feasibility of distinguishing between real and synthetic voices through machine learning techniques applied to the Fake or Real (FoR) Dataset from York University. The dataset contains over 70,000 text-to-speech (TTS) recordings, balanced in gender, class, sample rate, volume, and number of channels. The approach utilizes Gammatone Frequency Cepstral Coefficients (GTCC) and Delta Gammatone Frequency Cepstral Coefficients (ΔGTCC) as key features for voice characterization. A logistic regression model, enhanced with Recursive Feature Elimination (RFE), was employed to identify the most discriminative coefficients for this task. RFE iteratively removed less significant features, enhancing both the performance and interpretability of the model. The final model achieved a 70% accuracy in testing, using only five ΔGTCC features. This comparative analysis of GTCC and ΔGTCC revealed their respective strengths in voice classification tasks, offering insights for future developments in voice authentication systems. The study not only advances voice authentication technologies but also highlights the crucial role of feature selection in improving the robustness of models designed to safeguard against synthetic voice threats.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.