About Project

About This Project

This web-based Speech Emotion Recognition (SER) system is my dissertation project for a Bachelor of Science (Honors) degree in Software Development with Cyber Security at the University of Stirling. This project combines advanced machine learning techniques with modern web development to create a platform capable of classifying human emotions based on speech.

What is Speech Emotion Recognition (SER)?

Speech Emotion Recognition (SER) is a field of artificial intelligence and signal processing that focuses on analyzing and interpreting the emotional states conveyed in human speech. Beyond the words spoken, speech contains non-verbal information such as tone, pitch, and rhythm, which can indicate emotions like happiness, sadness, anger, or neutrality.

SER systems have various real-world applications, including improving human-computer interaction, enhancing customer service through call center analysis, and aiding in mental health diagnostics by detecting emotional distress from voice patterns.

How This System Works

This project implements a complete Speech Emotion Recognition system that allows users to analyze recorded speech and receive real-time emotion predictions. It combines deep learning, signal processing, web development, and cloud integration to offer an interactive and intelligent experience. Here's a breakdown of the system architecture and workflow:

1. User Input (Frontend)

The system offers three methods for speech input:

Uploading a prerecorded audio file (WAV, MP3, M4A)
Recording voice directly via the browser
Selecting from preloaded sample recordings

These features are built using Flask (Python) with frontend support from HTML, CSS, and JavaScript. Uploaded or recorded files are either stored locally (for samples) or to Azure Blob Storage for cloud handling. Each uploaded by user file is deleted after session.

2. Dataset Preparation

The deep learning model was trained using two well-established datasets:

CREMA-D: Contains 7,442 acted emotional speech samples from 91 professional actors, covering six emotional categories.
IEMOCAP: Includes approximately 12 hours of recorded dialogue—both scripted and improvised—labeled with natural emotion annotations.

These datasets provided a balance between clearly defined and naturally expressed emotions. Audio files were preprocessed using Librosa to ensure consistency in sampling rate, duration, and format.

3. Feature Extraction (Preprocessing)

Once an audio file is submitted, it is processed using the Librosa library (Python) to extract the following features:

Mel-Spectrogram: Visual representation of sound energy over time and frequency
MFCCs: Capture tonal characteristics of human speech
Zero-Crossing Rate (ZCR): Measures the rate of signal sign changes
Root Mean Square (RMS): Estimates overall loudness or energy

Features are normalized and padded to a uniform shape before being passed to the model.

4. Deep Learning Model (Inference)

The emotion classification is powered by a hybrid deep learning model built using TensorFlow and Keras. The architecture includes:

CNN layers: Capture spatial patterns from the spectrograms
GRU layers: Handle temporal dependencies and voice dynamics

The model outputs probabilities for five emotion classes: Happiness, Sadness, Anger, Frustration, and Neutral.

5. Result Display (UI Rendering)

The predicted emotion is rendered on a result page using Flask’s templating engine (Jinja2). It includes:

The detected emotion label
A matching animated GIF for visual feedback
Option to go back and try again

All user interaction is handled within a simple, clean, and responsive web interface.

6. Cloud Integration & Deployment

The project uses Microsoft Azure for storing uploaded audio and hosting the web app in a cloud environment. Key services include:

Azure Blob Storage – Stores user-uploaded and recorded audio files
Deployment on Azure App Service – For scaling the application to public access

This setup allows for greater scalability and reliability when handling audio files and model inference.

In summary, this project combines multiple domains—audio processing, machine learning, web development, and cloud computing—to create a functional, end-to-end emotion recognition system. It provides a hands-on demonstration of how deep learning models can be deployed in real-world applications.

GitHub Repository

The full source code for this project is available on GitHub as a private repository. Access to GitHub repository and Colab Notebooks can be provided upon request. Please contact me:

Marija Pravdivceva | 📧 m.marija@dev | linkedin.com/in/marija-pravdivceva-b2306a228

Acknowledgments

I would like to express my sincere gratitude to my supervisor Dr Yuanlin Gu for his invaluable guidance, encouragement, and support throughout the development of this project. Special thanks to the University of Stirling for providing the resources and guidance that made this project possible.

I would like to sincerely thank the Speech and Multimodal Intelligent Information Processing Laboratory at Cheyney University of Pennsylvania for collecting and providing access to the CREMA-D dataset . This resource has been instrumental in the development and training of my Speech Emotion Recognition model and is gratefully acknowledged.

Citation:
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014).
CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset.
IEEE Transactions on Affective Computing.
https://github.com/CheyneyComputerScience/CREMA-D

I would also like to express my gratitude to the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California for providing access to the IEMOCAP dataset . This dataset has played a vital role in training and evaluating the emotion classification model used in this project.

Citation:
Busso, C., Bulut, M., Yildirim, S., Kim, C. M., Chang, J. C., Lee, S., & Narayanan, S. (2008).
IEMOCAP: Interactive emotional dyadic motion capture database.
Language Resources and Evaluation, 42(4), 335–359.
https://sail.usc.edu/iemocap/