
Pre-decisional Draft
EVALUATING IDENTITY LEAKAGE IN SPEAKER DE-IDENTIFICATION SYSTEMS
Seungmin Seo, Oleg Aulov, Afzal Godil, Kevin Mangold
National Institute of Standards and Technology, Gaithersburg, MD, USA
ABSTRACT
Speaker de-identification aims to conceal a speaker’s identity
while preserving intelligibility of the underlying speech. We
introduce a benchmark that quantifies residual identity leak-
age with three complementary error rates: equal error rate,
cumulative match characteristic hit rate, and embedding-
space similarity measured via canonical correlation analysis
and Procrustes analysis. Evaluation results reveal that all
state-of-the-art speaker de-identification systems leak identity
information. The highest performing system in our evaluation
performs only slightly better than random guessing, while the
lowest performing system achieves a 45% hit rate within the
top 50 candidates based on CMC. These findings highlight
persistent privacy risks in current speaker de-identification
technologies.
Index Terms— speaker de-identification, voice privacy,
identity leakage
1. INTRODUCTION
The speech we stream through videoconferencing plat-
forms, voice assistants, and call-center recorders conveys
far more than lexical content: it embeds biometric sig-
natures that can single out an individual. Recent privacy
statutes—most prominently the EU’s General Data Protec-
tion Regulation (GDPR) and California’s Consumer Privacy
Act (CCPA)—explicitly classify these signatures as person-
ally identifiable information [1, 2].
Consequently, speaker de-identification (SDID) systems
that operate on live, spontaneous speech have become a
research priority. Unlike offline voice-conversion or text-to-
speech pipelines, real-time SDID must satisfy millisecond-
scale latency budgets and preserve intelligibility and nat-
uralness, while withstanding attacks from state-of-the-art
speaker-recognition models [3].
Individual components—e.g. disentangled speaker–content
representation learning [4] and neural audio codecs [5] —
have shown promise, yet the field still lacks a rigorous an-
swer to a central question: How much identity information
“leaks” through today’s end-to-end SDID pipelines?
Prior studies are difficult to compare [6, 7, 8, 9, 10, 11,
12]; most rely on a single speaker–recognition back-end, and
a solitary metric such as equal error rate (EER). To advance
beyond this fragmented landscape, we introduce a multi-view
identity-leakage evaluation suite that integrates EER, cumu-
lative match characteristic (CMC) analysis, and embedding-
space similarity measured with canonical correlation analysis
(CCA) followed by Procrustes alignment [13].
Each perspective exposes a distinct facet of residual
speaker information: EER quantifies binary verification
risk, CMC reflects search-rank leakage, and the embedding
analysis localises where representations converge in latent
space. Each SDID system was required to meet the real-
time processing budget, evaluated independently by the other
test-and-evaluation agency; the present paper concentrates on
privacy metrics. Under this protocol, every system leaks iden-
tity: the best performance achieved exceeds random guessing
only marginally yet still significantly, whereas the weakest
reaches a 45% hit-rate among the top-50 candidates on CMC.
These findings underscore the persistent challenge of robust,
privacy-preserving speaker de-identification.
2. SPEAKER DE-IDENTIFICATION SYSTEMS
The five SDID systems in this study were submitted to NIST
for evaluation—all developed under the IARPA ARTS pro-
gram
1
—including four performer systems and one baseline
built by a Test & Evaluation partner. Note that no system de-
scriptions are publicly available at the time of this writing, so
the references reflect relevant work by the same researchers.
[14, 15, 16, 17]
Each system takes as an input a streaming speech seg-
ment and outputs a streaming modified version designed to
conceal the speaker’s identity. The primary goals are (1) to
prevent speaker-recognition models from linking original and
de-identified segments, and (2) to ensure that de-identified
segments generated for the same speaker (under the same or
different anonymization profiles) are either consistent or dis-
tinct as appropriate.
3. EVALUATION
3.1. Data
The evaluation set is derived from the Mixer 3 corpus [18].
We retained only native American English speakers with at
1
www.iarpa.gov/research-programs/arts