Apprentissage de representation latentes pour l'audio spatialisé // learning latent representations a spatialized sound scenes

Vandœuvre-lès-Nancy

Alternance

Universite De Lorraine

Publiée le 15 avril

Description de l'offre

Topic description

Motivations
Les humains utilisent l'audition binaurale pour traiter des scènes sonores complexes, comme la communication verbale en milieu bruyant (cocktail party problem). Malgré des recherches approfondies, cette capacité reste mal comprise [1]. Les chercheurs s'inspirent de ce phénomène physiologique pour développer des techniques de traitement audio multicanal, comme l'amélioration de la parole ou l'analyse computationnelle des scènes auditives. Si certains travaux anciens en CASA (Computational Auditory Scene Analysis) exploraient la perception humaine des scènes audio, la plupart des recherches se concentrent sur les propriétés acoustiques du signal pour extraire une source [2] ou localiser des sons [3]. Peu d'études abordent l'intégration de ces informations.
Les avancées récentes en apprentissage profond ont montré que les modèles peuvent structurer efficacement l'espace latent des signaux audio, en mettant l'accent sur le contenu phonétique. Parallèlement, des efforts visent à relier les modèles de traitement audio à la perception auditive, au niveau cérébral [4] ou psychophysique [5]. Cependant, ces travaux se limitent souvent à un seul aspect : le contenu ou la localisation spatiale de la source, alors que les deux aspects sont essentiels pour comprendre comment les humains résolvent le problème de cocktail party. Ce projet vise à faire progresser la modélisation audio en proposant de nouveaux modèles de représentations multicanal avec des espaces latents structurés.

Objectifs
Cette thèse explore la représentation de la position d'une source sonore indépendamment de son contenu, ainsi que l'interaction entre localisation et représentation du signal. La plupart des modèles existants se focalisent sur le contenu du signal, ignorant la localisation spatiale, notamment parce qu'ils sont monocanaux et limités en informations spatiales. Nous étendrons d'abord ces modèles monocanaux pour exploiter des entrées multicanaux, intégrant ainsi les informations spatiales. Ensuite, nous contraindrons l'espace latent du modèle multicanal pour y encoder explicitement la localisation de la source.
Deux approches seront étudiées. Avec connaissance explicite de la position de la source pendant l'apprentissage : utilisation d'un cadre multitâche (reconstruction du signal + localisation) ou d'une contrainte directe sur l'espace latent (tâche de localisation supplémentaire). Cette méthode a prouvé son efficacité pour reproduire des structures cérébrales naturelles liées à l'identité sonore dans l'espace latent. Sans connaissance explicite : utilisation d'approches étudiant-enseignant (ciblage de localisation issu d'un réseau auxiliaire [6]) ou auto-supervisées (apprentissage par contraste [8], où deux versions d'une même scène sont fournies au modèle après transformations n'affectant pas la localisation) [7].
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Motivations
Humans rely on binaural hearing to process complex sound scenes. One example is speech communication in noisy environments with many competing sources, a challenge known as the cocktail party problem. Despite numerous studies on the topic, this ability is not fully understood [1]. Nevertheless, researchers have drawn inspiration from this physiological phenomenon for decades to develop multi-channel audio processing techniques, such as speech enhancement algorithms and computational auditory scene analysis. While some early work in CASA focused on how humans perceive complex audio scenes, most efforts have concentrated on the acoustic properties of the audio signal to achieve specific goals, such as extracting a source of interest [2] or localizing sound sources in space [3]. These studies rarely address how this information could be coded and processed together.
Recent work in deep learning-based audio representations has demonstrated the strong ability of models to provide a well-structured latent space for audio signals [4], emphasizing aspects such as phonetic content. Concurrently, there has been an increasing effort to connect audio processing models (such as speech recognition or sound source localization models) to auditory perception, either at the brain level [4] or at the psychophysical level [5]. However, these works primarily focus on one aspect of the signal, either its content or the spatial localization of sound sources but not both at the same time which is an important aspect to understand how humans tackle the cocktail party problem. The aim of the project is to advance research in computer-based audio modeling and auditory modeling by proposing new multi-channel audio representation models with structured latent spaces.

Goals and Objectives
This work during the PhD aims to explore the representation of a sound source's position independently of its content, as well as the interplay between localization and source signal representation. Most existing models focus on representing the signal content while ignoring acoustic aspects such as sound source localization. This is partly due to the fact that these models are mainly single-channel models that can hardly access any spatial information. In a first step we will extend existing single-channel audio representation models to leverage multi-channel inputs that can account for spatial information. Then we will shape the latent space of the multi-channel audio representation model to explicitly encode the spatial localization of the sound source. Shape the latent space of the multichannel audio representation model to explicitly encode the spatial localization of the sound source.
We will investigate two approaches: with or without explicit knowledge of the source position during training. When the source position is explicitly known at training time, we will leverage this information in a multitask framework (where the decoder is decomposed into one branch to reconstruct the signal and another to localize the signal) or as a constraint applied directly on the latent space (by performing an additional source localization task on the latent representation). This has been shown to be efficient in reproducing, for example, natural auditory cortex representation structures for sound identity in the latent space. To leverage multi-channel data without explicit information, we will investigate student-teacher approaches where the localization target is obtained from an auxiliary localization network [6] and self-supervised approaches [7] for example based on contrastive learning [8], where two versions of the same scene (subject to transforms that do not impact localization) are fed to the model.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Début de la thèse : 01/10/

Funding category

Other public funding

Funding further details

ANR Financement d'Agences de financement de la recherche

Postuler

Créer une alerte

Sauvegarder

Voir plus d'offres d'emploi

Estimer mon salaire

JE DÉPOSE MON CV

En cliquant sur "JE DÉPOSE MON CV", vous acceptez nos CGU et déclarez avoir pris connaissance de la politique de protection des données du site jobijoba.com.