SSDQ: Target Speaker Extraction via Semantic and Spatial Dual Querying

1Nanjing University of Aeronautics and Astronautics
2University of Science and Technology Beijing

Task


Our proposed TSE via semantic and spatial dual querying. The spatial query specifies a spatial direction, while the semantic query defines speaker attributes and vague direction, allowing precise isolation of the target speaker (Spk: speaker).

Abstract

Target Speaker Extraction (TSE) in real-world multi-speaker environments is highly challenging. Previous works have mostly relied on pre-enrollment speech to extract the target speaker's voice. However, such methods are limited in spontaneous scenarios when the pre-enrollment speech is unavailable or when spatial information is not utilized. To address this, we propose Semantic and Spatial Dual Querying (SSDQ), a unified framework that integrates natural language descriptions and region-based spatial queries to guide TSE. SSDQ employs dual query encoders for semantic and spatial cues, fusing them into the audio stream via a Feature-wise Linear Modulation (FiLM)-based interaction module. A novel Controllable Feature Wrapping (CFW) mechanism further enables dynamic balancing between speaker identity and acoustic clarity. We also introduce SS-Libri, a spatialized mixture dataset designed to benchmark dual-query systems. Extensive experiments demonstrate that SSDQ achieves superior extraction accuracy and robustness under challenging conditions, yielding a Scale-invariant Signal-to-Distortion Ratio improvement (SI-SDRi) of 19.63 dB, Signal-to-Noise Ratio (SNR) improvement (SNRi) of 20.30 dB, Perceptual Evaluation of Speech Quality (PESQ) of 1.83, and Short-Time Objective Intelligibility (STOI) of 0.259.

SSDQ framework

Sizes of model trees

The architecture of the proposed SSDQ network, which consists of: (1) Speech Encoder encodes the input mixture speech \(\textbf{y}^0(\tau)\) into frame-level representations \(\textbf{y}_t\) using Conv1D and ReLU layers; (2) Spatial Feature Calculation computes the spatial cue vector \(\textbf{c}_{spa}\) using IPD and TPD derived from spatial samples guided by the Region Query \(Q_{reg}\); (3) Text Encoder encodes the Text Query \(Q_{text}\) into a semantic embedding \(\textbf{c}_{sem}\); (4) Fusion module integrates \(\textbf{y}_t\), \(\textbf{c}_{spa}\), and \(\textbf{c}_{sem}\) using FiLM; (5) Mask Estimation applies a Dual-Path Recurrent Neural Network to model both intra- and inter-chunk dependencies. (6) CFW uses ResBlock1D and TCN to generate a wrapped feature \(\textbf{w}_t\), which is added to \(\textbf{y}_t\); The final target speech \(\hat{\textbf{z}}(\tau)\) is estimated from the wrapped features.

Demos

mixture

gt_spk1

Rezero_spk1

CLAPSep_spk1

Q_sem_spk1 (Ours)

Q_spa_spk1 (Ours)

DualQuery_spk1 (Ours)

mixture

gt_spk2

Rezero_spk2

CLAPSep_spk2

Q_sem_spk2 (Ours)

Q_spa_spk2 (Ours)

DualQuery_spk2 (Ours)

Results

Table 1 presents a comparative evaluation of TSE methods using different query and cue types on the SS-Libri dataset.

Fig.3 presents the relationship between input SNR levels and four key evaluation metrics—SI-SDRi, SDRi, PESQ, and STOI—after applying min-max normalization to map all values into the range \([0, 1]\), enabling cross-metric comparison.

Fig.4 illustrates the variation of SI-SDRi and SDRi performance metrics under different \(\lambda\) settings in the CFW module.