Building an Optimized Deep Neural Network for Music Decade Classification

Step 1: Introduction & Problem Statement

Can a machine learning model hear the difference between the swinging 60s and the electronic 80s just by analyzing audio features? How does music's 'sound' evolve across decades? This project dives into that question by building and optimizing a Deep Neural Network (DNN).

My journey begins with the following:

Dataset: The YearPredictionMSD dataset from the UCI Machine Learning Repository, containing 515,000+ songs described by 90 audio features.
Goal: Transform this regression challenge into a 10-class classification problem – predicting the decade of release, from the 1920s up to the 2010s.
Tool: I'll use PyTorch, a powerful deep learning framework, to build, train, and meticulously optimize my DNN model.

Let me explore how I tackled this challenge, step by step.

Step 2: Understanding the Soundscape - Data Exploration & Prep

Before training any model, understanding the data is crucial. I started by transforming the raw year data and exploring the characteristics of the audio features.

Decade Binning

The first step, handled in src/data_processing.py, was converting the continuous 'Year' variable into discrete decade labels. I mapped years into 10 bins, representing the 1920s (label 0) through the 2010s (label 9).

Class Imbalance

Visualizing the distribution of songs across these new decade labels immediately revealed a significant challenge: class imbalance.

Distribution of Songs Across Decades — The dataset is heavily dominated by songs from the 1990s (label 7) and 2000s (label 8).

This imbalance necessitated using stratified splitting for my train (70%), validation (15%), and test (15%) sets to ensure each decade is proportionally represented across all subsets.

Feature Insights

I examined the distributions of the first 12 features (timbre averages) using histograms and box plots. These plots showed varying distributions – some roughly normal, others skewed – and highlighted the presence of potential outliers.

Feature Histograms — Histograms revealed diverse feature distributions.

Correlations

A heatmap of feature correlations indicated moderate relationships within certain feature groups (like the timbre averages or covariance features), suggesting some inherent structure related to the audio properties.

Correlation Heatmap — Moderate correlations exist, particularly within feature blocks.

Step 3: Setting the Stage - Baseline Model Performance

With prepared data, the next step was to establish baseline performance using simple DNN architectures. I tested three architectures:

Model_1 (128x128): Two hidden layers with 128 neurons each. (29k params)
Model_2 (256x256): Two hidden layers, but wider with 256 neurons each. (92k params)
Model_3 (256x128x64): Three hidden layers with decreasing width. (65k params)

Architecture	Final Val Loss	Final Val Accuracy	Training Time
Model_1 (128x128)	1.4841	0.5802	1.53s
Model_2 (256x256)	1.2711	0.5802	1.40s
Model_3 (256x128x64)	1.3519	0.5802	1.47s

Model_2 (256x256) consistently achieved the lowest validation loss, so it was selected as the baseline architecture for systematic optimization.

Step 4: Tuning the Instrument - Systematic Optimization

Having chosen Model_2 as my base, I embarked on a systematic optimization process. My strategy was to tune hyperparameters and test components sequentially, carrying forward the best setting from each stage.

1. Learning Rate Search

We used the torch-lr-finder library to perform an LR range test. The model was trained for one epoch with the LR exponentially increasing, and the loss was plotted against the LR.

LR Finder Plot — LR Finder plot suggested an optimal LR around 0.001.

Selected Optimal Learning Rate: 0.001

2. Weight Decay Tuning

We employed 5-Fold Cross-Validation on the training set. For each Weight Decay value, Model_2 was trained 5 times (10 epochs each).

Weight Decay K-Fold CV Plot — Average validation accuracy peaks at WD = 0.0001.

Selected Optimal Weight Decay: 0.0001

3. Component Sweeps

We tested variations of core neural network components:

Initialization: Default (Kaiming Uniform), Xavier Uniform, Kaiming Normal were tested. Differences were minimal, with Default performing marginally best.

Activation: ReLU, LeakyReLU, and GELU were compared. GELU achieved the highest peak validation accuracy (0.6609).

Normalization: None, BatchNorm1d, and LayerNorm were tested. Surprisingly, No Normalization performed best in this setup.

Optimizer: Adam, SGD (momentum=0.9), and RMSprop were compared. RMSprop achieved the highest peak accuracy (0.6617).

Activation Comparison — Activation function comparison - GELU slightly outperformed others.

Step 5: The Performance - Final Model Training & Results

With all components optimized, I assembled the final configuration:

Final Optimized Configuration:

• Architecture: Model_2 (256x256)
• Initialization: Default (Kaiming Uniform)
• Activation: GELU
• Normalization: None
• Optimizer: RMSprop
• Learning Rate: 0.001
• Weight Decay: 0.0001

Final Training Run

This final model was trained on the entire training dataset for 15 epochs, using the validation set to monitor progress.

Final Model Training History — The final model's learning curves over 15 epochs.

Test Set Evaluation

The ultimate measure of success is performance on the unseen test set:

0.8993

Final Test Loss

66.05%

Final Test Accuracy

Step 6: Key Takeaways - Lessons from the Tuning Process

This journey from raw data to an optimized DNN yielding 66.05% accuracy on a challenging, imbalanced decade classification task highlighted several valuable lessons:

Systematic > Haphazard
Tuning hyperparameters and components sequentially, carrying forward the best, provides clearer insights than changing multiple things at once.
Data is Foundational
Addressing data issues like class imbalance early (with stratified splitting) and applying appropriate preprocessing (like scaling) are critical first steps.
Informed Tuning Beats Guesswork
Tools like LR Finder for learning rates and techniques like K-Fold CV for regularization parameters provide data-driven starting points and validation.
Context is King
The "best" component isn't universal. While GELU and RMSprop showed slight advantages here, optimal choices depend heavily on the dataset, architecture, and training duration.
The Test Set Reigns Supreme
Validation sets guide optimization, but the final, untouched test set provides the only unbiased measure of how well the model generalizes.

Project Resources & Downloads

View on GitHub Download Dataset

data_processing.py

Data loading, preprocessing, and tensor conversion

models.py

DNN architectures and model components

Jupyter Notebooks

Interactive notebooks with code and analysis

PDF Reports

Rendered notebooks with outputs and visualizations

Key Takeaway

Building production ML models isn't about finding a silver bullet—it's about systematic experimentation and rigorous validation.

The journey from 58% baseline to 66% test accuracy demonstrates that small, informed decisions compound into significant improvements.

Building an Optimized Deep Neural Network