Authors:
Sang Hoon Chung, London Glenn, Keyanu Maloney, Hao Zhu
Department of Electrical & Computer Engineering, Texas A&M University
Acoustic scene classification (ASC) aims to assign semantic labels—such as park, metro, or shopping mall—to audio recordings that represent real-world urban environments. These classifications support a variety of applications, including context-aware mobile systems, environmental monitoring, and intelligent sensing in smart-city infrastructures.
This project evaluates three distinct modeling paradigms for ASC under a unified preprocessing and evaluation framework. The study investigates:
Support Vector Machines (SVMs) using pooled log-mel statistics
Random Forests (RFs) leveraging decision-tree ensembles
Convolutional Recurrent Neural Networks (CRNNs) that learn hierarchical time–frequency patterns directly from spectrograms
A comprehensive data preparation pipeline was designed to standardize inputs across models, incorporating resampling, amplitude normalization, log-mel spectrogram computation, and multiple forms of data augmentation. In addition, grouped train/validation/test splitting based on recording locations ensures realistic generalization by preventing location-level leakage.
The project aims not only to compare performance across classical and deep learning approaches but also to understand the trade-offs in representational capacity, interpretability, computational requirements, and robustness. Together, these analyses provide insight into how different modeling strategies respond to the structure of the TAU dataset and what implications this has for future work in urban audio recognition.
This study utilizes the TAU Urban Acoustic Scenes 2019 (Development) dataset, which contains 10-second real-world audio recordings across 10 acoustic scene classes:
Airport
Bus
Metro
Metro station
Park
Public square
Shopping mall
Street, pedestrian
Street, traffic
Tram
The recordings were captured in multiple European cities (e.g., Barcelona, Helsinki, Lisbon, London, Lyon, Milan, Paris, Prague, Stockholm, Vienna) and across numerous distinct recording locations within each city. This multi-city, multi-location structure introduces rich environmental variability—making the classification task more realistic and challenging.
Each audio clip is provided at 44.1 kHz stereo, later transformed in this project to 22.05 kHz mono for consistency and computational efficiency. For classical models, pooled statistics (mean, standard deviation, percentiles) of log-mel spectrograms create fixed-length feature vectors. For deep learning models, full-resolution log-mel spectrograms are used directly.
| Scene class | Segments | Barcelona | Helsinki | Lisbon | London | Lyon | Milan | Paris | Prague | Stockholm | Vienna |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Airport | 1440 | 128 | 149 | 144 | 145 | 144 | 144 | 156 | 144 | 158 | 128 |
| Bus | 1440 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| Metro | 1440 | 141 | 144 | 144 | 146 | 144 | 144 | 144 | 144 | 145 | 144 |
| Metro station | 1440 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| Park | 1440 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| Public square | 1440 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| Shopping mall | 1440 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| Street, pedestrian | 1440 | 145 | 145 | 144 | 145 | 144 | 144 | 144 | 144 | 145 | 140 |
| Street, traffic | 1440 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| Tram | 1440 | 143 | 145 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| Total | 14400 | 1421 | 1447 | 1440 | 1444 | 1440 | 1440 | 1452 | 1440 | 1456 | 1420 |
| Scene class | Locations | Barcelona | Helsinki | Lisbon | London | Lyon | Milan | Paris | Prague | Stockholm | Vienna |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Airport | 40 | 4 | 3 | 4 | 3 | 4 | 4 | 4 | 6 | 5 | 3 |
| Bus | 71 | 4 | 4 | 11 | 7 | 7 | 7 | 11 | 10 | 6 | 4 |
| Metro | 67 | 3 | 5 | 11 | 4 | 9 | 8 | 9 | 10 | 4 | 4 |
| Metro station | 57 | 5 | 6 | 4 | 12 | 5 | 4 | 9 | 4 | 4 | 4 |
| Park | 41 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 4 |
| Public square | 43 | 4 | 4 | 4 | 4 | 5 | 4 | 4 | 6 | 4 | 4 |
| Shopping mall | 36 | 4 | 4 | 4 | 2 | 3 | 3 | 4 | 4 | 4 | 4 |
| Street, pedestrian | 46 | 7 | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 4 | 4 |
| Street, traffic | 43 | 4 | 4 | 4 | 5 | 4 | 6 | 4 | 4 | 4 | 4 |
| Tram | 70 | 4 | 4 | 6 | 9 | 7 | 11 | 9 | 11 | 5 | 4 |
| Total | 514 | 43 | 42 | 56 | 54 | 52 | 56 | 63 | 65 | 45 | 39 |
The TAU Urban Acoustic Scenes 2019 dataset consists of 10-second stereo audio clips recorded across multiple cities and locations. To ensure consistent and robust inputs, a unified data preparation pipeline was implemented as follows.
Resampling & Channel Reduction
Amplitude Normalization
Log-Mel Spectrograms
Pooled Feature Statistics (SVM & Random Forest)
Full Spectrogram Input (CRNN)
To improve robustness and limit overfitting, augmentation is applied during training.
Waveform-Level Augmentations
Feature-Space Augmentation (Classical models)
To avoid location-level leakage and ensure realistic evaluation:
Waveform & Spectrogram Plots
PCA / t-SNE of Feature Space

Figure 1. Waveforms (left) and log-mel spectrograms (right) for various urban sound scenes.

Figure 2. PCA of MFCC pooled features for urban sound samples.
This project evaluates three different modeling paradigms for acoustic scene classification: a Support Vector Machine (SVM), a Random Forest (RF), and a Convolutional Recurrent Neural Network (CRNN). Each model reflects a distinct approach to learning from audio features, ranging from classical machine learning to deep neural architectures.
The SVM operates on pooled log-mel statistical features extracted from each audio clip. These feature vectors combine the mean, standard deviation, and selected percentiles of the spectrogram to form a compact representation. An RBF kernel is used to capture non-linear class boundaries in this high-dimensional space.
Key Characteristics
SVMs provide stable optimization and strong performance when paired with well-engineered audio features.
The Random Forest classifier builds an ensemble of decision trees trained on bootstrap samples of the pooled log-mel feature vectors. Each tree explores different feature subsets, making the ensemble robust to noise and feature correlations.
Key Characteristics
Random Forests trade off a small amount of accuracy for interpretability and robustness.
The CRNN processes the full log-mel spectrogram as a time–frequency image and learns hierarchical features directly from the data. It combines convolutional layers for local pattern extraction with a recurrent layer that models temporal dependencies.
Architecture Overview
CRNNs are expressive and powerful but require more computation and careful tuning. They can underperform when training data is limited or when hyperparameters are not extensively optimized.
To explore the representational quality of the CRNN, the final softmax layer was removed and the penultimate LSTM output was used as a fixed-dimensional embedding. A k-Nearest Neighbors (k-NN) classifier was then trained on these embeddings.
Key Insights
This extension provides a promising direction for deeper research despite the CRNN underperforming in its baseline configuration.
The metric-learning extension evaluates how well a simple distance-based classifier (k-NN) performs when applied to the embeddings extracted from the CRNN’s penultimate layer. This approach substantially improves the baseline CRNN softmax classifier, increasing the test accuracy from 48.25% → 52.08% and the macro-F1 score from 0.458 → 0.510.
Performance gains are consistent across several difficult scene classes. For example, the k-NN variant improves F1-scores to 0.45 on airport, 0.54 on bus, 0.51 on metro, and 0.78 on park, while maintaining strong results on street_traffic (F1 = 0.63). The macro-averaged precision, recall, and F1 for the k-NN system are approximately 0.51, 0.53, and 0.51, respectively.
These results indicate that although the CRNN’s softmax classifier underperforms relative to classical models, the embedding representation learned by the CRNN is structurally meaningful and useful for downstream classification. By replacing the final linear softmax layer with a simple k-NN classifier, the system recovers several percentage points of accuracy and macro-F1, narrowing the performance gap to the Random Forest model.
Overall, the study shows that SVM remains the strongest model, but RF and CRNN also provide competitive performance, with the CRNN’s embeddings in particular offering promise for metric-learning or few-shot extensions.