Non-uniform Speaker Disentanglement For Depression Detection From Raw Speech Signals
Overview
Authors
Affiliations
While speech-based depression detection methods that use speaker-identity features, such as speaker embeddings, are popular, they often compromise patient privacy. To address this issue, we propose a speaker disentanglement method that utilizes a non-uniform mechanism of adversarial SID loss maximization. This is achieved by varying the adversarial weight between different layers of a model during training. We find that a greater adversarial weight for the initial layers leads to performance improvement. Our approach using the ECAPA-TDNN model achieves an F1-score of 0.7349 (a 3.7% improvement over audio-only SOTA) on the DAIC-WoZ dataset, while simultaneously reducing the speaker-identification accuracy by 50%. Our findings suggest that identifying depression through speech signals can be accomplished without placing undue reliance on a speaker's identity, paving the way for privacy-preserving approaches of depression detection.
Speechformer-CTC: Sequential Modeling of Depression Detection with Speech Temporal Classification.
Wang J, Ravi V, Flint J, Alwan A Speech Commun. 2024; 163.
PMID: 39364289 PMC: 11449263. DOI: 10.1016/j.specom.2024.103106.
Ravi V, Wang J, Flint J, Alwan A CEUR Workshop Proc. 2024; 3649:57-63.
PMID: 38650610 PMC: 11034881.
Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement.
Ravi V, Wang J, Flint J, Alwan A Comput Speech Lang. 2024; 86.
PMID: 38313320 PMC: 10836190. DOI: 10.1016/j.csl.2023.101605.
Bekbolatova M, Mayer J, Ong C, Toma M Healthcare (Basel). 2024; 12(2).
PMID: 38255014 PMC: 10815906. DOI: 10.3390/healthcare12020125.