Speech Emotion Recognition

About This Project

This web-based Speech Emotion Recognition (SER) system is my dissertation project for a Bachelor of Science (Honors) degree in Software Development with Cyber Security at the University of Stirling. This project combines advanced machine learning techniques with modern web development to create a platform capable of classifying human emotions based on speech.

What is Speech Emotion Recognition (SER)?

Speech Emotion Recognition (SER) is a field of artificial intelligence and signal processing that focuses on analyzing and interpreting the emotional states conveyed in human speech. Beyond the words spoken, speech contains non-verbal information such as tone, pitch, and rhythm, which can indicate emotions like happiness, sadness, anger, or neutrality.

SER systems have various real-world applications, including improving human-computer interaction, enhancing customer service through call center analysis, and aiding in mental health diagnostics by detecting emotional distress from voice patterns.

How This System Works

This project implements a complete Speech Emotion Recognition system that allows users to analyze recorded speech and receive real-time emotion predictions. It combines deep learning, signal processing, web development, and cloud integration to offer an interactive and intelligent experience. Here's a breakdown of the system architecture and workflow:

1. User Input (Frontend)

The system offers three methods for speech input:

These features are built using Flask (Python) with frontend support from HTML, CSS, and JavaScript. Uploaded or recorded files are either stored locally (for samples) or to Azure Blob Storage for cloud handling. Each uploaded by user file is deleted after session.

2. Dataset Preparation

The deep learning model was trained using two well-established datasets:

These datasets provided a balance between clearly defined and naturally expressed emotions. Audio files were preprocessed using Librosa to ensure consistency in sampling rate, duration, and format.

3. Feature Extraction (Preprocessing)

Once an audio file is submitted, it is processed using the Librosa library (Python) to extract the following features:

Features are normalized and padded to a uniform shape before being passed to the model.

4. Deep Learning Model (Inference)

The emotion classification is powered by a hybrid deep learning model built using TensorFlow and Keras. The architecture includes:

The model outputs probabilities for five emotion classes: Happiness, Sadness, Anger, Frustration, and Neutral.

5. Result Display (UI Rendering)

The predicted emotion is rendered on a result page using Flask’s templating engine (Jinja2). It includes:

All user interaction is handled within a simple, clean, and responsive web interface.

6. Cloud Integration & Deployment

The project uses Microsoft Azure for storing uploaded audio and hosting the web app in a cloud environment. Key services include:

This setup allows for greater scalability and reliability when handling audio files and model inference.

In summary, this project combines multiple domains—audio processing, machine learning, web development, and cloud computing—to create a functional, end-to-end emotion recognition system. It provides a hands-on demonstration of how deep learning models can be deployed in real-world applications.

GitHub Repository

The full source code for this project is available on GitHub as a private repository. Access to GitHub repository and Colab Notebooks can be provided upon request. Please contact me:

Marija Pravdivceva | 📧 m.marija@dev | linkedin.com/in/marija-pravdivceva-b2306a228

Acknowledgments

I would like to express my sincere gratitude to my supervisor Dr Yuanlin Gu for his invaluable guidance, encouragement, and support throughout the development of this project. Special thanks to the University of Stirling for providing the resources and guidance that made this project possible.

I would like to sincerely thank the Speech and Multimodal Intelligent Information Processing Laboratory at Cheyney University of Pennsylvania for collecting and providing access to the CREMA-D dataset . This resource has been instrumental in the development and training of my Speech Emotion Recognition model and is gratefully acknowledged.

Citation:
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014).
CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset.
IEEE Transactions on Affective Computing.
https://github.com/CheyneyComputerScience/CREMA-D

I would also like to express my gratitude to the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California for providing access to the IEMOCAP dataset . This dataset has played a vital role in training and evaluating the emotion classification model used in this project.

Citation:
Busso, C., Bulut, M., Yildirim, S., Kim, C. M., Chang, J. C., Lee, S., & Narayanan, S. (2008).
IEMOCAP: Interactive emotional dyadic motion capture database.
Language Resources and Evaluation, 42(4), 335–359.
https://sail.usc.edu/iemocap/