Speech Emotion Recognition Project Using ML
By Rohit Sharma
Updated on Jul 30, 2025 | 11 min read | 1.38K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 30, 2025 | 11 min read | 1.38K+ views
Share:
Table of Contents
One of the most effective ways for people to communicate their ideas and feelings is through speech. Our tone, pitch, and intensity convey emotions that are far more than just words, from happiness to fear. However, can machines identify these feelings based solely on our speech? Speech Emotion Recognition (SER) seeks to accomplish just that.
The goal of this project is to develop a machine learning model that can identify emotions in speech audio recordings. We'll make use of well-known datasets like RAVDESS, CREMA-D, TESS, and SAVEE. We will teach our model to categorize emotions such as fear, sadness, anger, and happiness.
Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.
Popular Data Science Programs
It is better to have at least some background in:
For this project, the following tools and libraries will be used:
Tool/Library |
Purpose |
Python |
Main programming language for coding the project |
NumPy & Pandas |
Numerical operations and data manipulation |
Librosa |
Audio processing and MFCC feature extraction |
Matplotlib & Seaborn |
Visualization of audio features and model performance |
Building and evaluating ML models like SVM and Random Forest |
Below are the models that we will be utilizing:
Model Name |
Purpose / Why Used |
Effective for high-dimensional data; works well for emotion classification tasks. |
|
Reduces overfitting and improves accuracy through ensemble decision trees. |
|
Acts as a strong baseline model for multi-class classification problems. |
|
Predicts emotions based on the labels of the closest data points in the feature space. |
On average, it will take about 3 to 5 hours to complete. Duration may vary depending on your familiarity with Python, audio features, & ML concepts. It’s best for beginner-level.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
We will use the kagglehub library to download the dataset directly into our Colab environment. Here is the code to do so:
import kagglehub
# Download the latest version of the speech emotion dataset
path = kagglehub.dataset_download("dmitrybabko/speech-emotion-recognition-en")
print("Path to dataset files:", path)
The dataset has been successfully downloaded and extracted.
In this step, we will import the following libraries:
Here is the code to do so:
# Numerical and data handling
import numpy as np
import pandas as pd
# Audio processing
import librosa
import soundfile as sf
# File handling and preprocessing
import os
import glob
import warnings
warnings.filterwarnings('ignore')
# Model training
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
In this step, we will load audio files from the downloaded dataset. The dataset contains folders for RAVDESS, CREMA-D, TESS, and SAVEE. Each folder contains .wav audio clips labeled by emotion.
We will:
Here is the code to accomplish the same:
# Define emotion labels to be recognized
emotions = {
"angry": "angry",
"disgust": "disgust",
"fear": "fearful",
"happy": "happy",
"neutral": "neutral",
"sad": "sad",
"surprise": "surprised"
}
# Path to your extracted dataset (adjust if needed)
DATASET_PATH = "/root/.cache/kagglehub/datasets/dmitrybabko/speech-emotion-recognition-en/versions/1/"
# Collect all .wav files
audio_files = glob.glob(os.path.join(DATASET_PATH, "**/*.wav"), recursive=True)
print(f"Total audio files found: {len(audio_files)}")
Output:
Total audio files found: 12162
The output tells us the number of all the audio clips across datasets.
In this step, we will extract meaningful features that our machine learning models can understand. We will use MFCC (Mel-Frequency Cepstral Coefficients). It is a powerful feature used in speech processing to capture the timbral texture of audio signals.
Use the code below to accomplish this:
import os
import librosa
import numpy as np
import pandas as pd
from tqdm import tqdm
# Path to the dataset
dataset_path = "/root/.cache/kagglehub/datasets/dmitrybabko/speech-emotion-recognition-en/versions/1/"
# Emotion mapping from filename keywords
emotion_map = {
'ang': 'angry',
'hap': 'happy',
'sad': 'sad',
'fea': 'fearful',
'dis': 'disgust',
'sur': 'surprised',
'neu': 'neutral',
'calm': 'calm'
}
# Helper function to extract emotion label from filename
def extract_emotion(filename):
for key in emotion_map:
if key in filename.lower():
return emotion_map[key]
return 'unknown'
# Prepare lists to hold features and labels
features = []
labels = []
print(" Extracting MFCC features...")
# Loop over all audio files
for root, dirs, files in os.walk(dataset_path):
for file in tqdm(files):
if file.endswith('.wav'):
try:
file_path = os.path.join(root, file)
# Load the audio
signal, sr = librosa.load(file_path, sr=22050)
# Extract MFCCs
mfccs = librosa.feature.mfcc(y=signal, sr=sr, n_mfcc=40)
mfccs_mean = np.mean(mfccs.T, axis=0) # Mean across time
# Extract label
label = extract_emotion(file)
# Append to lists
features.append(mfccs_mean)
labels.append(label)
except Exception as e:
print(f" Error with {file_path}: {e}")
# Convert to DataFrame
X = pd.DataFrame(features)
y = pd.Series(labels)
print("\n Feature extraction complete!")
print("Shape of features (X):", X.shape)
print("Unique emotion labels:", y.unique())
Output:
Extracting MFCC features...
0it [00:00, ?it/s]
0it [00:00, ?it/s]
100%|██████████| 200/200 [00:21<00:00, 9.52it/s]
100%|██████████| 200/200 [00:04<00:00, 45.56it/s]
100%|██████████| 200/200 [00:02<00:00, 99.73it/s]
100%|██████████| 200/200 [00:02<00:00, 90.48it/s]
100%|██████████| 200/200 [00:02<00:00, 87.50it/s]
100%|██████████| 200/200 [00:02<00:00, 82.59it/s]
100%|██████████| 200/200 [00:04<00:00, 43.59it/s]
100%|██████████| 200/200 [00:02<00:00, 87.68it/s]
100%|██████████| 200/200 [00:02<00:00, 78.90it/s]
100%|██████████| 200/200 [00:02<00:00, 96.16it/s]
100%|██████████| 200/200 [00:02<00:00, 90.91it/s]
100%|██████████| 200/200 [00:04<00:00, 44.90it/s]
100%|██████████| 200/200 [00:02<00:00, 99.39it/s]
100%|██████████| 200/200 [00:02<00:00, 86.59it/s]
100%|██████████| 480/480 [00:09<00:00, 51.45it/s]
100%|██████████| 7442/7442 [01:52<00:00, 66.34it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
100%|██████████| 60/60 [00:00<00:00, 61.91it/s]
100%|██████████| 60/60 [00:00<00:00, 60.25it/s]
100%|██████████| 60/60 [00:01<00:00, 57.33it/s]
100%|██████████| 60/60 [00:01<00:00, 40.24it/s]
100%|██████████| 60/60 [00:02<00:00, 23.39it/s]
100%|██████████| 60/60 [00:01<00:00, 59.35it/s]
100%|██████████| 60/60 [00:00<00:00, 60.31it/s]
100%|██████████| 60/60 [00:00<00:00, 61.54it/s]
100%|██████████| 60/60 [00:00<00:00, 65.71it/s]
100%|██████████| 60/60 [00:00<00:00, 60.74it/s]
100%|██████████| 60/60 [00:00<00:00, 61.42it/s]
100%|██████████| 60/60 [00:00<00:00, 60.52it/s]
100%|██████████| 60/60 [00:00<00:00, 63.20it/s]
100%|██████████| 60/60 [00:00<00:00, 62.28it/s]
100%|██████████| 60/60 [00:00<00:00, 64.03it/s]
100%|██████████| 60/60 [00:02<00:00, 26.71it/s]
100%|██████████| 60/60 [00:01<00:00, 34.69it/s]
100%|██████████| 60/60 [00:01<00:00, 59.27it/s]
100%|██████████| 60/60 [00:00<00:00, 61.81it/s]
100%|██████████| 60/60 [00:00<00:00, 62.16it/s]
100%|██████████| 60/60 [00:00<00:00, 61.48it/s]
100%|██████████| 60/60 [00:00<00:00, 61.34it/s]
100%|██████████| 60/60 [00:00<00:00, 63.03it/s]
100%|██████████| 60/60 [00:00<00:00, 64.39it/s]
Feature extraction complete!
Shape of features (X): (12162, 40)
Unique emotion labels: ['neutral' 'surprised' 'fearful' 'unknown' 'calm' 'disgust' 'sad' 'happy' 'angry']
Now that feature extraction is complete and we have:
In this step, we will convert emotion labels (like 'happy', 'sad', etc.) into numerical form and split the dataset.
Here is the code to accomplish the same:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# Encode emotion labels into numeric format
le = LabelEncoder()
y_encoded = le.fit_transform(y) # 'y' is the list of emotion labels
# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)
# Check the shape of splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
Output:
X_train shape: (9729, 40)
X_test shape: (2433, 40)
y_train shape: (9729,)
y_test shape: (2433,)
What does the output tell?
In this step, we will train the following supervised machine learning models:
Let’s train them all in a single script and compare their accuracy. Use the code below:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Define models
models = {
"Support Vector Machine": SVC(kernel='linear'),
"Random Forest Classifier": RandomForestClassifier(n_estimators=100, random_state=42),
"Logistic Regression": LogisticRegression(max_iter=1000),
"K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5)
}
# Train, predict, and evaluate each model
for name, model in models.items():
print(f"\n===== {name} =====")
# Training
model.fit(X_train, y_train)
# Prediction
y_pred = model.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Output:
===== Support Vector Machine =====
Accuracy: 0.6132
Classification Report:
precision recall f1-score support
0 0.65 0.68 0.66 334
1 0.00 0.00 0.00 1
2 0.46 0.45 0.45 334
3 0.45 0.48 0.46 334
4 0.53 0.40 0.45 334
5 0.50 0.50 0.50 297
6 0.57 0.67 0.62 334
7 0.00 0.00 0.00 1
8 0.97 0.98 0.97 464
accuracy 0.61 2433
macro avg 0.46 0.46 0.46 2433
weighted avg 0.61 0.61 0.61 2433
Confusion Matrix:
[[228 0 19 33 42 11 0 0 1]
[ 0 0 0 0 0 0 0 0 1]
[ 27 0 149 33 25 41 55 0 4]
[ 35 0 33 159 21 33 52 0 1]
[ 53 0 45 65 132 22 10 0 7]
[ 6 0 46 31 18 148 48 0 0]
[ 3 0 28 33 7 39 223 0 1]
[ 0 0 0 0 0 1 0 0 0]
[ 1 0 5 0 5 0 0 0 453]]
===== Random Forest Classifier =====
Accuracy: 0.6543
Classification Report:
precision recall f1-score support
0 0.68 0.77 0.72 334
1 0.00 0.00 0.00 1
2 0.52 0.48 0.50 334
3 0.64 0.43 0.52 334
4 0.50 0.50 0.50 334
5 0.50 0.59 0.54 297
6 0.61 0.68 0.64 334
7 0.00 0.00 0.00 1
8 0.99 1.00 0.99 464
accuracy 0.65 2433
macro avg 0.49 0.49 0.49 2433
weighted avg 0.65 0.65 0.65 2433
Confusion Matrix:
[[258 0 15 5 43 13 0 0 0]
[ 0 0 0 0 0 0 0 0 1]
[ 25 0 159 20 38 31 60 0 1]
[ 32 0 33 144 39 33 52 0 1]
[ 61 0 32 20 166 46 7 0 2]
[ 2 0 42 13 38 174 28 0 0]
[ 0 0 23 23 9 51 227 0 1]
[ 0 0 0 0 0 1 0 0 0]
[ 0 0 0 0 0 0 0 0 464]]
===== Logistic Regression =====
Accuracy: 0.5730
Classification Report:
precision recall f1-score support
0 0.64 0.66 0.65 334
1 0.00 0.00 0.00 1
2 0.40 0.42 0.41 334
3 0.45 0.46 0.46 334
4 0.40 0.31 0.35 334
5 0.48 0.41 0.44 297
6 0.52 0.61 0.56 334
7 0.00 0.00 0.00 1
8 0.92 0.97 0.95 464
accuracy 0.57 2433
macro avg 0.42 0.43 0.42 2433
weighted avg 0.56 0.57 0.57 2433
Confusion Matrix:
[[221 0 31 27 33 17 3 0 2]
[ 0 0 0 0 0 0 0 0 1]
[ 25 0 139 32 33 41 56 0 8]
[ 23 0 38 154 34 20 60 0 5]
[ 62 0 52 56 105 21 20 0 18]
[ 10 0 54 31 31 122 48 0 1]
[ 4 0 32 42 16 34 204 0 2]
[ 0 0 0 0 0 1 0 0 0]
[ 3 0 4 0 8 0 0 0 449]]
===== K-Nearest Neighbors =====
Accuracy: 0.6021
Classification Report:
precision recall f1-score support
0 0.60 0.77 0.68 334
1 0.00 0.00 0.00 1
2 0.40 0.47 0.43 334
3 0.45 0.42 0.44 334
4 0.49 0.37 0.42 334
5 0.52 0.47 0.49 297
6 0.59 0.55 0.57 334
7 0.00 0.00 0.00 1
8 0.99 0.99 0.99 464
accuracy 0.60 2433
macro avg 0.45 0.45 0.45 2433
weighted avg 0.60 0.60 0.60 2433
Confusion Matrix:
[[258 0 20 21 27 8 0 0 0]
[ 0 0 0 0 0 0 0 0 1]
[ 35 0 157 33 33 28 48 0 0]
[ 43 0 48 141 26 27 48 0 1]
[ 79 0 43 41 125 35 11 0 0]
[ 6 0 71 24 34 140 22 0 0]
[ 4 0 50 54 10 30 185 0 1]
[ 0 0 0 0 0 1 0 0 0]
[ 2 0 1 0 2 0 0 0 459]]
The Random Forest Classifier achieved the highest accuracy of 65.43%.
The Random Forest Classifier performed best. It achieved an accuracy of 65.43%. It handled emotional speech patterns effectively. The emotion ‘happy’ was predicted with high precision.
SVM came next with 61.32% accuracy. It performed well for most emotions. But it struggled with labels that had very few samples. Meanwhile, KNN reached 60.21% accuracy. It was sensitive to nearby points and scaling. This slightly affected its consistency.
Logistic Regression gave the lowest accuracy - 57.30%. Its linear nature failed to capture the complexity of emotional patterns. Emotions like ‘surprised’ and ‘unknown’ had only one sample each. All models failed to classify them correctly.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1kYJRSyasdOQuXa2RN5g2RSAbvN3u4rUQ
804 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources