SSDQ

SSDQ: Target Speaker Extraction via Semantic and Spatial Dual Querying

¹Nanjing University of Aeronautics and Astronautics
²University of Science and Technology Beijing

Abstract

Target Speaker Extraction (TSE) in real-world multi-speaker environments is highly challenging. Previous works have mostly relied on pre-enrollment speech to extract the target speaker's voice. However, such methods are limited in spontaneous scenarios when the pre-enrollment speech is unavailable or when spatial information is not utilized. To address this, we propose Semantic and Spatial Dual Querying (SSDQ), a unified framework that integrates natural language descriptions and region-based spatial queries to guide TSE. SSDQ employs dual query encoders for semantic and spatial cues, fusing them into the audio stream via a Feature-wise Linear Modulation (FiLM)-based interaction module. A novel Controllable Feature Wrapping (CFW) mechanism further enables dynamic balancing between speaker identity and acoustic clarity. We also introduce SS-Libri, a spatialized mixture dataset designed to benchmark dual-query systems. Extensive experiments demonstrate that SSDQ achieves superior extraction accuracy and robustness under challenging conditions, yielding a Scale-invariant Signal-to-Distortion Ratio improvement (SI-SDRi) of 19.63 dB, Signal-to-Noise Ratio (SNR) improvement (SNRi) of 20.30 dB, Perceptual Evaluation of Speech Quality (PESQ) of 1.83, and Short-Time Objective Intelligibility (STOI) of 0.259.

SSDQ framework

Sizes of model trees

The architecture of the proposed SSDQ network, which consists of: (1) Speech Encoder encodes the input mixture speech \(\textbf{y}^0(\tau)\) into frame-level representations \(\textbf{y}_t\) using Conv1D and ReLU layers; (2) Spatial Feature Calculation computes the spatial cue vector \(\textbf{c}_{spa}\) using IPD and TPD derived from spatial samples guided by the Region Query \(Q_{reg}\); (3) Text Encoder encodes the Text Query \(Q_{text}\) into a semantic embedding \(\textbf{c}_{sem}\); (4) Fusion module integrates \(\textbf{y}_t\), \(\textbf{c}_{spa}\), and \(\textbf{c}_{sem}\) using FiLM; (5) Mask Estimation applies a Dual-Path Recurrent Neural Network to model both intra- and inter-chunk dependencies. (6) CFW uses ResBlock1D and TCN to generate a wrapped feature \(\textbf{w}_t\), which is added to \(\textbf{y}_t\); The final target speech \(\hat{\textbf{z}}(\tau)\) is estimated from the wrapped features.

Demos

mixture

gt_spk1

Rezero_spk1

CLAPSep_spk1

Q_sem_spk1 (Ours)

Q_spa_spk1 (Ours)

DualQuery_spk1 (Ours)

mixture

gt_spk2

Rezero_spk2

CLAPSep_spk2

Q_sem_spk2 (Ours)

Q_spa_spk2 (Ours)

DualQuery_spk2 (Ours)

Results

Table 1 presents a comparative evaluation of TSE methods using different query and cue types on the SS-Libri dataset.

Fig.3 presents the relationship between input SNR levels and four key evaluation metrics—SI-SDRi, SDRi, PESQ, and STOI—after applying min-max normalization to map all values into the range \([0, 1]\), enabling cross-metric comparison.

Fig.4 illustrates the variation of SI-SDRi and SDRi performance metrics under different \(\lambda\) settings in the CFW module.