[C93] SODA: SODA: A Unified Framework for Join Estimation of Speaker Orientation and Direction of Arrival

Abstract

Next-generation voice user interfaces must jointly infer a speaker’s orientation (direction of voice, DOV) and direction of arrival (DOA), yet practical deployment of such joint estimators has been hindered by large microphone arrays and the lack of unified datasets. We propose SODA, a unified framework for few-channel (2 or 4) settings that addresses these barriers. SODA repurposes an existing DOV dataset to generate DOA labels, employs a multi-task architecture, and, through stepwise lightweight design, selects a practical complexity–performance balance to support deployment on resourceconstraineddevices. Experiments show that SODA with 2 or 4 channels surpasses the 16-channel baseline on distance and orientation metrics, demonstrating the feasibility of high-precision spatial audio sensing with few microphones.

Publication
IEEE International Conference on Acoustics, Speech and Signal Processing 2026