ECEN758project

Acoustic Scene Classification: Comparing SVM, Random Forest, and CRNN Models

Authors:
Sang Hoon Chung, London Glenn, Keyanu Maloney, Hao Zhu

Department of Electrical & Computer Engineering, Texas A&M University


Abstract

This project explores the task of acoustic scene classification using the TAU Urban Acoustic Scenes 2019 dataset, which encompasses ten diverse urban environments recorded across multiple cities and recording locations. A unified preprocessing pipeline was implemented, including resampling, amplitude normalization, log-mel spectrogram generation, and data augmentation to enhance robustness under real-world variability. Three modeling paradigms were examined—Support Vector Machines, Random Forests, and a Convolutional Recurrent Neural Network—each representing distinct approaches to audio feature representation and classification. Classical models operate on pooled log-mel statistics, whereas the CRNN processes full spectrograms to learn hierarchical time–frequency patterns. This comparative study highlights the representational trade-offs between feature-engineered methods and deep sequential architectures and provides a foundation for understanding how model complexity, data structure, and preprocessing choices influence performance in urban audio recognition tasks.

Project Overview

Acoustic scene classification (ASC) aims to assign semantic labels—such as park, metro, or shopping mall—to audio recordings that represent real-world urban environments. These classifications support a variety of applications, including context-aware mobile systems, environmental monitoring, and intelligent sensing in smart-city infrastructures.

This project evaluates three distinct modeling paradigms for ASC under a unified preprocessing and evaluation framework. The study investigates:

Support Vector Machines (SVMs) using pooled log-mel statistics

Random Forests (RFs) leveraging decision-tree ensembles

Convolutional Recurrent Neural Networks (CRNNs) that learn hierarchical time–frequency patterns directly from spectrograms

A comprehensive data preparation pipeline was designed to standardize inputs across models, incorporating resampling, amplitude normalization, log-mel spectrogram computation, and multiple forms of data augmentation. In addition, grouped train/validation/test splitting based on recording locations ensures realistic generalization by preventing location-level leakage.

The project aims not only to compare performance across classical and deep learning approaches but also to understand the trade-offs in representational capacity, interpretability, computational requirements, and robustness. Together, these analyses provide insight into how different modeling strategies respond to the structure of the TAU dataset and what implications this has for future work in urban audio recognition.


Dataset Summary

This study utilizes the TAU Urban Acoustic Scenes 2019 (Development) dataset, which contains 10-second real-world audio recordings across 10 acoustic scene classes:

Airport

Bus

Metro

Metro station

Park

Public square

Shopping mall

Street, pedestrian

Street, traffic

Tram

The recordings were captured in multiple European cities (e.g., Barcelona, Helsinki, Lisbon, London, Lyon, Milan, Paris, Prague, Stockholm, Vienna) and across numerous distinct recording locations within each city. This multi-city, multi-location structure introduces rich environmental variability—making the classification task more realistic and challenging.

Each audio clip is provided at 44.1 kHz stereo, later transformed in this project to 22.05 kHz mono for consistency and computational efficiency. For classical models, pooled statistics (mean, standard deviation, percentiles) of log-mel spectrograms create fixed-length feature vectors. For deep learning models, full-resolution log-mel spectrograms are used directly.

The dataset also includes metadata identifying recording locations, enabling grouped splitting strategies that preserve class balance while preventing overlaps between training and test environments—an essential requirement for stable and fair evaluation.

Audio Segments (Development Dataset)

Scene class Segments Barcelona Helsinki Lisbon London Lyon Milan Paris Prague Stockholm Vienna
Airport 1440 128 149 144 145 144 144 156 144 158 128
Bus 1440 144 144 144 144 144 144 144 144 144 144
Metro 1440 141 144 144 146 144 144 144 144 145 144
Metro station 1440 144 144 144 144 144 144 144 144 144 144
Park 1440 144 144 144 144 144 144 144 144 144 144
Public square 1440 144 144 144 144 144 144 144 144 144 144
Shopping mall 1440 144 144 144 144 144 144 144 144 144 144
Street, pedestrian 1440 145 145 144 145 144 144 144 144 145 140
Street, traffic 1440 144 144 144 144 144 144 144 144 144 144
Tram 1440 143 145 144 144 144 144 144 144 144 144
Total 14400 1421 1447 1440 1444 1440 1440 1452 1440 1456 1420

Audio Segments (Recording Locations)

Scene class Locations Barcelona Helsinki Lisbon London Lyon Milan Paris Prague Stockholm Vienna
Airport 40 4 3 4 3 4 4 4 6 5 3
Bus 71 4 4 11 7 7 7 11 10 6 4
Metro 67 3 5 11 4 9 8 9 10 4 4
Metro station 57 5 6 4 12 5 4 9 4 4 4
Park 41 4 4 4 4 4 4 4 4 5 4
Public square 43 4 4 4 4 5 4 4 6 4 4
Shopping mall 36 4 4 4 2 3 3 4 4 4 4
Street, pedestrian 46 7 4 4 4 4 5 5 5 4 4
Street, traffic 43 4 4 4 5 4 6 4 4 4 4
Tram 70 4 4 6 9 7 11 9 11 5 4
Total 514 43 42 56 54 52 56 63 65 45 39

Data Processing

The TAU Urban Acoustic Scenes 2019 dataset consists of 10-second stereo audio clips recorded across multiple cities and locations. To ensure consistent and robust inputs, a unified data preparation pipeline was implemented as follows.


1. Audio Standardization

Resampling & Channel Reduction

Amplitude Normalization


2. Time–Frequency Feature Construction

Log-Mel Spectrograms

Pooled Feature Statistics (SVM & Random Forest)

Full Spectrogram Input (CRNN)


3. Data Augmentation

To improve robustness and limit overfitting, augmentation is applied during training.

Waveform-Level Augmentations

Feature-Space Augmentation (Classical models)


4. Grouped Dataset Splitting

To avoid location-level leakage and ensure realistic evaluation:


5. Visualization and Diagnostic Analysis

Waveform & Spectrogram Plots

PCA / t-SNE of Feature Space

img1

Figure 1. Waveforms (left) and log-mel spectrograms (right) for various urban sound scenes.

img2

Figure 2. PCA of MFCC pooled features for urban sound samples.


Model Description

This project evaluates three different modeling paradigms for acoustic scene classification: a Support Vector Machine (SVM), a Random Forest (RF), and a Convolutional Recurrent Neural Network (CRNN). Each model reflects a distinct approach to learning from audio features, ranging from classical machine learning to deep neural architectures.


1. Support Vector Machine (SVM)

The SVM operates on pooled log-mel statistical features extracted from each audio clip. These feature vectors combine the mean, standard deviation, and selected percentiles of the spectrogram to form a compact representation. An RBF kernel is used to capture non-linear class boundaries in this high-dimensional space.

Key Characteristics

SVMs provide stable optimization and strong performance when paired with well-engineered audio features.


2. Random Forest (RF)

The Random Forest classifier builds an ensemble of decision trees trained on bootstrap samples of the pooled log-mel feature vectors. Each tree explores different feature subsets, making the ensemble robust to noise and feature correlations.

Key Characteristics

Random Forests trade off a small amount of accuracy for interpretability and robustness.


3. Convolutional Recurrent Neural Network (CRNN)

The CRNN processes the full log-mel spectrogram as a time–frequency image and learns hierarchical features directly from the data. It combines convolutional layers for local pattern extraction with a recurrent layer that models temporal dependencies.

Architecture Overview

CRNNs are expressive and powerful but require more computation and careful tuning. They can underperform when training data is limited or when hyperparameters are not extensively optimized.


4. Metric-Learning Extension (CRNN Embeddings)

To explore the representational quality of the CRNN, the final softmax layer was removed and the penultimate LSTM output was used as a fixed-dimensional embedding. A k-Nearest Neighbors (k-NN) classifier was then trained on these embeddings.

Key Insights

This extension provides a promising direction for deeper research despite the CRNN underperforming in its baseline configuration.


5. The Research Extension

The metric-learning extension evaluates how well a simple distance-based classifier (k-NN) performs when applied to the embeddings extracted from the CRNN’s penultimate layer. This approach substantially improves the baseline CRNN softmax classifier, increasing the test accuracy from 48.25% → 52.08% and the macro-F1 score from 0.458 → 0.510.

Performance gains are consistent across several difficult scene classes. For example, the k-NN variant improves F1-scores to 0.45 on airport, 0.54 on bus, 0.51 on metro, and 0.78 on park, while maintaining strong results on street_traffic (F1 = 0.63). The macro-averaged precision, recall, and F1 for the k-NN system are approximately 0.51, 0.53, and 0.51, respectively.

These results indicate that although the CRNN’s softmax classifier underperforms relative to classical models, the embedding representation learned by the CRNN is structurally meaningful and useful for downstream classification. By replacing the final linear softmax layer with a simple k-NN classifier, the system recovers several percentage points of accuracy and macro-F1, narrowing the performance gap to the Random Forest model.

Overall, the study shows that SVM remains the strongest model, but RF and CRNN also provide competitive performance, with the CRNN’s embeddings in particular offering promise for metric-learning or few-shot extensions.