Sound classification is a fundamental task in audio signal processing with applications ranging from environmental monitoring to urban planning. We present a novel two-stage system for open set audio classification that integrates deep learning techniques with Schafer’s soundscape theory. The two-stage architecture consists of: (1) a variational autoencoder (VAE) that first learns compressed representations of acoustic features and identifies distinctive sounds through reconstruction error analysis, and (2) a convolutional neural network (CNN) that then classifies these sounds into Schafer’s theoretical categories (keynotes, sound signals, and soundmarks) using the learned features. This sequential approach allows the system to first understand what makes sounds unique before categorizing their role in the soundscape. The VAE effectively learns the latent representation of acoustic features from mel-spectrograms, while the CNN leverages these representations for classification. We evaluated the system on standard datasets including UrbanSound8K, ESC-50, URBAN-SED, and TUT urban acoustic scenes, achieving an average accuracy of 80.7% across all datasets. Additionally, we tested our approach on a dataset composed of binaural recordings collected in the university neighborhood of Pescara, Italy. Our research provides empirical validation of Schafer’s theoretical framework through quantitative metrics, demonstrating strong alignment between computational classifications and theoretical descriptions. The proposed methodology advances soundscape analysis by objectively quantifying previously qualitative categories, enabling automated classification while maintaining theoretical fidelity, and creating a foundation for soundscape preservation policies that protect acoustic identity as a form of intangible cultural heritage.
A two-stage architecture for soundscape classification and preservation
Di Loreto, Samantha
;Montelpare, Sergio
2025-01-01
Abstract
Sound classification is a fundamental task in audio signal processing with applications ranging from environmental monitoring to urban planning. We present a novel two-stage system for open set audio classification that integrates deep learning techniques with Schafer’s soundscape theory. The two-stage architecture consists of: (1) a variational autoencoder (VAE) that first learns compressed representations of acoustic features and identifies distinctive sounds through reconstruction error analysis, and (2) a convolutional neural network (CNN) that then classifies these sounds into Schafer’s theoretical categories (keynotes, sound signals, and soundmarks) using the learned features. This sequential approach allows the system to first understand what makes sounds unique before categorizing their role in the soundscape. The VAE effectively learns the latent representation of acoustic features from mel-spectrograms, while the CNN leverages these representations for classification. We evaluated the system on standard datasets including UrbanSound8K, ESC-50, URBAN-SED, and TUT urban acoustic scenes, achieving an average accuracy of 80.7% across all datasets. Additionally, we tested our approach on a dataset composed of binaural recordings collected in the university neighborhood of Pescara, Italy. Our research provides empirical validation of Schafer’s theoretical framework through quantitative metrics, demonstrating strong alignment between computational classifications and theoretical descriptions. The proposed methodology advances soundscape analysis by objectively quantifying previously qualitative categories, enabling automated classification while maintaining theoretical fidelity, and creating a foundation for soundscape preservation policies that protect acoustic identity as a form of intangible cultural heritage.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


