About This Project
This web-based Speech Emotion Recognition (SER) system is my dissertation project for a Bachelor of Science (Honors) degree in Software Development with Cyber Security at the University of Stirling. This project combines advanced machine learning techniques with modern web development to create a platform capable of classifying human emotions based on speech.
What is Speech Emotion Recognition (SER)?
Speech Emotion Recognition (SER) is a field of artificial intelligence and signal processing that focuses on analyzing and interpreting the emotional states conveyed in human speech. Beyond the words spoken, speech contains non-verbal information such as tone, pitch, and rhythm, which can indicate emotions like happiness, sadness, anger, or neutrality.
SER systems have various real-world applications, including improving human-computer interaction, enhancing customer service through call center analysis, and aiding in mental health diagnostics by detecting emotional distress from voice patterns.
How This System Works
This project implements a complete Speech Emotion Recognition system that allows users to analyze recorded speech and receive real-time emotion predictions. It combines deep learning, signal processing, web development, and cloud integration to offer an interactive and intelligent experience. Here's a breakdown of the system architecture and workflow:
1. User Input (Frontend)
The system offers three methods for speech input:
- Uploading a prerecorded audio file (WAV, MP3, M4A)
- Recording voice directly via the browser
- Selecting from preloaded sample recordings
2. Dataset Preparation
The deep learning model was trained using two well-established datasets:
- CREMA-D: Contains 7,442 acted emotional speech samples from 91 professional actors, covering six emotional categories.
- IEMOCAP: Includes approximately 12 hours of recorded dialogue—both scripted and improvised—labeled with natural emotion annotations.
3. Feature Extraction (Preprocessing)
Once an audio file is submitted, it is processed using the Librosa library (Python) to extract the following features:
- Mel-Spectrogram: Visual representation of sound energy over time and frequency
- MFCCs: Capture tonal characteristics of human speech
- Zero-Crossing Rate (ZCR): Measures the rate of signal sign changes
- Root Mean Square (RMS): Estimates overall loudness or energy
4. Deep Learning Model (Inference)
The emotion classification is powered by a hybrid deep learning model built using TensorFlow and Keras. The architecture includes:
- CNN layers: Capture spatial patterns from the spectrograms
- GRU layers: Handle temporal dependencies and voice dynamics
5. Result Display (UI Rendering)
The predicted emotion is rendered on a result page using Flask’s templating engine (Jinja2). It includes:
- The detected emotion label
- A matching animated GIF for visual feedback
- Option to go back and try again
6. Cloud Integration & Deployment
The project uses Microsoft Azure for storing uploaded audio and hosting the web app in a cloud environment. Key services include:
- Azure Blob Storage – Stores user-uploaded and recorded audio files
- Deployment on Azure App Service – For scaling the application to public access
In summary, this project combines multiple domains—audio processing, machine learning, web development, and cloud computing—to create a functional, end-to-end emotion recognition system. It provides a hands-on demonstration of how deep learning models can be deployed in real-world applications.
GitHub Repository
The full source code for this project is available on GitHub as a private repository. Access to GitHub repository and Colab Notebooks can be provided upon request. Please contact me:
Acknowledgments
I would like to express my sincere gratitude to my supervisor Dr Yuanlin Gu for his invaluable guidance, encouragement, and support throughout the development of this project. Special thanks to the University of Stirling for providing the resources and guidance that made this project possible.
I would like to sincerely thank the Speech and Multimodal Intelligent Information Processing Laboratory at Cheyney University of Pennsylvania for collecting and providing access to the CREMA-D dataset . This resource has been instrumental in the development and training of my Speech Emotion Recognition model and is gratefully acknowledged.
Citation:
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014).
CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset.
IEEE Transactions on Affective Computing.
https://github.com/CheyneyComputerScience/CREMA-D
I would also like to express my gratitude to the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California for providing access to the IEMOCAP dataset . This dataset has played a vital role in training and evaluating the emotion classification model used in this project.
Citation:
Busso, C., Bulut, M., Yildirim, S., Kim, C. M., Chang, J. C., Lee, S., & Narayanan, S. (2008).
IEMOCAP: Interactive emotional dyadic motion capture database.
Language Resources and Evaluation, 42(4), 335–359.
https://sail.usc.edu/iemocap/