Speech Emotion Recognition Project Using ML

By Rohit Sharma

Updated on Jul 30, 2025 | 11 min read | 1.38K+ views

Share:

One of the most effective ways for people to communicate their ideas and feelings is through speech. Our tone, pitch, and intensity convey emotions that are far more than just words, from happiness to fear. However, can machines identify these feelings based solely on our speech? Speech Emotion Recognition (SER) seeks to accomplish just that.

The goal of this project is to develop a machine learning model that can identify emotions in speech audio recordings. We'll make use of well-known datasets like RAVDESS, CREMA-D, TESS, and SAVEE. We will teach our model to categorize emotions such as fear, sadness, anger, and happiness.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

What Should You Know Beforehand?

It is better to have at least some background in:

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library

Purpose

Python

Main programming language for coding the project

NumPy & Pandas

Numerical operations and data manipulation

Librosa

Audio processing and MFCC feature extraction

Matplotlib & Seaborn

Visualization of audio features and model performance

Scikit-learn

Building and evaluating ML models like SVM and Random Forest

Models That Will Be Utilized for Learning

Below are the models that we will be utilizing:

Model Name

Purpose / Why Used

Support Vector Machine (SVM)

Effective for high-dimensional data; works well for emotion classification tasks.

Random Forest Classifier

Reduces overfitting and improves accuracy through ensemble decision trees.

Logistic Regression

Acts as a strong baseline model for multi-class classification problems.

K-Nearest Neighbors (KNN)

Predicts emotions based on the labels of the closest data points in the feature space.

Time Taken and Difficulty Level

On average, it will take about 3 to 5 hours to complete. Duration may vary depending on your familiarity with Python, audio features, & ML concepts. It’s best for beginner-level.

How to Build a Speech Emotion Recognition Model

Let’s start building the project from scratch. We will start by:

  • Downloading the dataset from Kaggle using KaggleHub.
  • Loading audio files from RAVDESS, CREMA-D, TESS, and SAVEE.
  • Extracting MFCC features using Librosa.
  • Preparing the data for training our machine learning models.

Without any further delay, let’s start!

Step 1: Download the Dataset

We will use the kagglehub library to download the dataset directly into our Colab environment. Here is the code to do so:

import kagglehub
# Download the latest version of the speech emotion dataset
path = kagglehub.dataset_download("dmitrybabko/speech-emotion-recognition-en")
print("Path to dataset files:", path)

The dataset has been successfully downloaded and extracted.

Step 2: Import Required Libraries

In this step, we will import the following libraries:

  • Numerical computation (numpy, pandas)
  • Audio processing (librosa, soundfile)
  • Model building (scikit-learn)
  • Visualization (matplotlib, seaborn)
  • File handling (os, glob, warnings)

Here is the code to do so:

# Numerical and data handling
import numpy as np
import pandas as pd

# Audio processing
import librosa
import soundfile as sf

# File handling and preprocessing
import os
import glob
import warnings
warnings.filterwarnings('ignore')

# Model training
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

Step 3: Load and Organize Audio Files

In this step, we will load audio files from the downloaded dataset. The dataset contains folders for RAVDESS, CREMA-D, TESS, and SAVEE. Each folder contains .wav audio clips labeled by emotion.

We will:

  • Traverse each dataset folder
  • Extract audio paths
  • Map them to the correct emotion label

Here is the code to accomplish the same:

# Define emotion labels to be recognized
emotions = {
    "angry": "angry",
    "disgust": "disgust",
    "fear": "fearful",
    "happy": "happy",
    "neutral": "neutral",
    "sad": "sad",
    "surprise": "surprised"
}

# Path to your extracted dataset (adjust if needed)
DATASET_PATH = "/root/.cache/kagglehub/datasets/dmitrybabko/speech-emotion-recognition-en/versions/1/"

# Collect all .wav files
audio_files = glob.glob(os.path.join(DATASET_PATH, "**/*.wav"), recursive=True)
print(f"Total audio files found: {len(audio_files)}")

Output:

Total audio files found: 12162

The output tells us the number of all the audio clips across datasets. 

Step 4: Extract Features and Emotion Labels

In this step, we will extract meaningful features that our machine learning models can understand. We will use MFCC (Mel-Frequency Cepstral Coefficients). It is a powerful feature used in speech processing to capture the timbral texture of audio signals.

Use the code below to accomplish this:

import os
import librosa
import numpy as np
import pandas as pd
from tqdm import tqdm

# Path to the dataset
dataset_path = "/root/.cache/kagglehub/datasets/dmitrybabko/speech-emotion-recognition-en/versions/1/"

# Emotion mapping from filename keywords
emotion_map = {
    'ang': 'angry',
    'hap': 'happy',
    'sad': 'sad',
    'fea': 'fearful',
    'dis': 'disgust',
    'sur': 'surprised',
    'neu': 'neutral',
    'calm': 'calm'
}

# Helper function to extract emotion label from filename
def extract_emotion(filename):
    for key in emotion_map:
        if key in filename.lower():
            return emotion_map[key]
    return 'unknown'

# Prepare lists to hold features and labels
features = []
labels = []

print(" Extracting MFCC features...")

# Loop over all audio files
for root, dirs, files in os.walk(dataset_path):
    for file in tqdm(files):
        if file.endswith('.wav'):
            try:
                file_path = os.path.join(root, file)

                # Load the audio
                signal, sr = librosa.load(file_path, sr=22050)

                # Extract MFCCs
                mfccs = librosa.feature.mfcc(y=signal, sr=sr, n_mfcc=40)
                mfccs_mean = np.mean(mfccs.T, axis=0)  # Mean across time

                # Extract label
                label = extract_emotion(file)

                # Append to lists
                features.append(mfccs_mean)
                labels.append(label)
            except Exception as e:
                print(f" Error with {file_path}: {e}")

# Convert to DataFrame
X = pd.DataFrame(features)
y = pd.Series(labels)

print("\n Feature extraction complete!")
print("Shape of features (X):", X.shape)
print("Unique emotion labels:", y.unique())

Output:

Extracting MFCC features...

0it [00:00, ?it/s]

0it [00:00, ?it/s]

100%|██████████| 200/200 [00:21<00:00,  9.52it/s]

100%|██████████| 200/200 [00:04<00:00, 45.56it/s]

100%|██████████| 200/200 [00:02<00:00, 99.73it/s]

100%|██████████| 200/200 [00:02<00:00, 90.48it/s]

100%|██████████| 200/200 [00:02<00:00, 87.50it/s]

100%|██████████| 200/200 [00:02<00:00, 82.59it/s]

100%|██████████| 200/200 [00:04<00:00, 43.59it/s]

100%|██████████| 200/200 [00:02<00:00, 87.68it/s]

100%|██████████| 200/200 [00:02<00:00, 78.90it/s]

100%|██████████| 200/200 [00:02<00:00, 96.16it/s]

100%|██████████| 200/200 [00:02<00:00, 90.91it/s]

100%|██████████| 200/200 [00:04<00:00, 44.90it/s]

100%|██████████| 200/200 [00:02<00:00, 99.39it/s]

100%|██████████| 200/200 [00:02<00:00, 86.59it/s]

100%|██████████| 480/480 [00:09<00:00, 51.45it/s]

100%|██████████| 7442/7442 [01:52<00:00, 66.34it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

100%|██████████| 60/60 [00:00<00:00, 61.91it/s]

100%|██████████| 60/60 [00:00<00:00, 60.25it/s]

100%|██████████| 60/60 [00:01<00:00, 57.33it/s]

100%|██████████| 60/60 [00:01<00:00, 40.24it/s]

100%|██████████| 60/60 [00:02<00:00, 23.39it/s]

100%|██████████| 60/60 [00:01<00:00, 59.35it/s]

100%|██████████| 60/60 [00:00<00:00, 60.31it/s]

100%|██████████| 60/60 [00:00<00:00, 61.54it/s]

100%|██████████| 60/60 [00:00<00:00, 65.71it/s]

100%|██████████| 60/60 [00:00<00:00, 60.74it/s]

100%|██████████| 60/60 [00:00<00:00, 61.42it/s]

100%|██████████| 60/60 [00:00<00:00, 60.52it/s]

100%|██████████| 60/60 [00:00<00:00, 63.20it/s]

100%|██████████| 60/60 [00:00<00:00, 62.28it/s]

100%|██████████| 60/60 [00:00<00:00, 64.03it/s]

100%|██████████| 60/60 [00:02<00:00, 26.71it/s]

100%|██████████| 60/60 [00:01<00:00, 34.69it/s]

100%|██████████| 60/60 [00:01<00:00, 59.27it/s]

100%|██████████| 60/60 [00:00<00:00, 61.81it/s]

100%|██████████| 60/60 [00:00<00:00, 62.16it/s]

100%|██████████| 60/60 [00:00<00:00, 61.48it/s]

100%|██████████| 60/60 [00:00<00:00, 61.34it/s]

100%|██████████| 60/60 [00:00<00:00, 63.03it/s]

100%|██████████| 60/60 [00:00<00:00, 64.39it/s]

Feature extraction complete!

Shape of features (X): (12162, 40)

Unique emotion labels: ['neutral' 'surprised' 'fearful' 'unknown' 'calm' 'disgust' 'sad' 'happy' 'angry']

Now that feature extraction is complete and we have:

  • Total samples: 12,162
  • Feature shape: Each audio file is represented by 40 MFCC features
  • Emotion labels: 9 unique emotions, including 'neutral', 'happy', 'sad', 'angry', 'fearful', 'calm', 'disgust', 'surprised', and 'unknown'

Step 5: Encode Labels and Split the Data

In this step, we will convert emotion labels (like 'happy', 'sad', etc.) into numerical form and split the dataset. 

Here is the code to accomplish the same:

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Encode emotion labels into numeric format
le = LabelEncoder()
y_encoded = le.fit_transform(y)  # 'y' is the list of emotion labels

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Check the shape of splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Output:

X_train shape: (9729, 40)
X_test shape: (2433, 40)
y_train shape: (9729,)
y_test shape: (2433,)

What does the output tell?

  • X_train shape: (9729, 40): 9,729 training samples are there.  Each sample has 40 features.
  • X_test shape: (2433, 40): 2,433 test samples are there with 40 features each.
  • y_train shape: (9729,): You have 9,729 emotion labels.
  • y_test shape: (2433,): You have 2,433 labels for evaluation.

Step 6: Train and Evaluate Models

In this step, we will train the following supervised machine learning models:

  • Support Vector Machine (SVM)
  • Random Forest Classifier
  • Logistic Regression
  • K-Nearest Neighbors (KNN)

Let’s train them all in a single script and compare their accuracy. Use the code below:

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define models
models = {
    "Support Vector Machine": SVC(kernel='linear'),
    "Random Forest Classifier": RandomForestClassifier(n_estimators=100, random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5)
}

# Train, predict, and evaluate each model
for name, model in models.items():
    print(f"\n===== {name} =====")    
    # Training
    model.fit(X_train, y_train)    
    # Prediction
    y_pred = model.predict(X_test)    
    # Evaluation
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.4f}")   
    print("Classification Report:")
    print(classification_report(y_test, y_pred, zero_division=0)
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

Output:

===== Support Vector Machine =====

Accuracy: 0.6132

Classification Report:

              precision    recall  f1-score   support

           0       0.65      0.68      0.66       334

           1       0.00      0.00      0.00         1

           2       0.46      0.45      0.45       334

           3       0.45      0.48      0.46       334

           4       0.53      0.40      0.45       334

           5       0.50      0.50      0.50       297

           6       0.57      0.67      0.62       334

           7       0.00      0.00      0.00         1

           8       0.97      0.98      0.97       464

    accuracy                           0.61      2433

   macro avg       0.46      0.46      0.46      2433

weighted avg       0.61      0.61      0.61      2433

Confusion Matrix:

[[228   0  19  33  42  11   0   0   1]

 [  0   0   0   0   0   0   0   0   1]

 [ 27   0 149  33  25  41  55   0   4]

 [ 35   0  33 159  21  33  52   0   1]

 [ 53   0  45  65 132  22  10   0   7]

 [  6   0  46  31  18 148  48   0   0]

 [  3   0  28  33   7  39 223   0   1]

 [  0   0   0   0   0   1   0   0   0]

 [  1   0   5   0   5   0   0   0 453]]

===== Random Forest Classifier =====

Accuracy: 0.6543

Classification Report:

              precision    recall  f1-score   support

           0       0.68      0.77      0.72       334

           1       0.00      0.00      0.00         1

           2       0.52      0.48      0.50       334

           3       0.64      0.43      0.52       334

           4       0.50      0.50      0.50       334

           5       0.50      0.59      0.54       297

           6       0.61      0.68      0.64       334

           7       0.00      0.00      0.00         1

           8       0.99      1.00      0.99       464

    accuracy                           0.65      2433

   macro avg       0.49      0.49      0.49      2433

weighted avg       0.65      0.65      0.65      2433

Confusion Matrix:

[[258   0  15   5  43  13   0   0   0]

 [  0   0   0   0   0   0   0   0   1]

 [ 25   0 159  20  38  31  60   0   1]

 [ 32   0  33 144  39  33  52   0   1]

 [ 61   0  32  20 166  46   7   0   2]

 [  2   0  42  13  38 174  28   0   0]

 [  0   0  23  23   9  51 227   0   1]

 [  0   0   0   0   0   1   0   0   0]

 [  0   0   0   0   0   0   0   0 464]]

===== Logistic Regression =====

Accuracy: 0.5730

Classification Report:

              precision    recall  f1-score   support

           0       0.64      0.66      0.65       334

           1       0.00      0.00      0.00         1

           2       0.40      0.42      0.41       334

           3       0.45      0.46      0.46       334

           4       0.40      0.31      0.35       334

           5       0.48      0.41      0.44       297

           6       0.52      0.61      0.56       334

           7       0.00      0.00      0.00         1

           8       0.92      0.97      0.95       464

    accuracy                           0.57      2433

   macro avg       0.42      0.43      0.42      2433

weighted avg       0.56      0.57      0.57      2433

Confusion Matrix:

[[221   0  31  27  33  17   3   0   2]

 [  0   0   0   0   0   0   0   0   1]

 [ 25   0 139  32  33  41  56   0   8]

 [ 23   0  38 154  34  20  60   0   5]

 [ 62   0  52  56 105  21  20   0  18]

 [ 10   0  54  31  31 122  48   0   1]

 [  4   0  32  42  16  34 204   0   2]

 [  0   0   0   0   0   1   0   0   0]

 [  3   0   4   0   8   0   0   0 449]]

===== K-Nearest Neighbors =====

Accuracy: 0.6021

Classification Report:

              precision    recall  f1-score   support

           0       0.60      0.77      0.68       334

           1       0.00      0.00      0.00         1

           2       0.40      0.47      0.43       334

           3       0.45      0.42      0.44       334

           4       0.49      0.37      0.42       334

           5       0.52      0.47      0.49       297

           6       0.59      0.55      0.57       334

           7       0.00      0.00      0.00         1

           8       0.99      0.99      0.99       464

    accuracy                           0.60      2433

   macro avg       0.45      0.45      0.45      2433

weighted avg       0.60      0.60      0.60      2433

Confusion Matrix:

[[258   0  20  21  27   8   0   0   0]

 [  0   0   0   0   0   0   0   0   1]

 [ 35   0 157  33  33  28  48   0   0]

 [ 43   0  48 141  26  27  48   0   1]

 [ 79   0  43  41 125  35  11   0   0]

 [  6   0  71  24  34 140  22   0   0]

 [  4   0  50  54  10  30 185   0   1]

 [  0   0   0   0   0   1   0   0   0]

 [  2   0   1   0   2   0   0   0 459]]

The Random Forest Classifier achieved the highest accuracy of 65.43%.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Conclusion

The Random Forest Classifier performed best. It achieved an accuracy of 65.43%. It handled emotional speech patterns effectively. The emotion ‘happy’ was predicted with high precision.

SVM came next with 61.32% accuracy. It performed well for most emotions. But it struggled with labels that had very few samples. Meanwhile, KNN reached 60.21% accuracy. It was sensitive to nearby points and scaling. This slightly affected its consistency.

Logistic Regression gave the lowest accuracy - 57.30%. Its linear nature failed to capture the complexity of emotional patterns. Emotions like ‘surprised’ and ‘unknown’ had only one sample each. All models failed to classify them correctly.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1kYJRSyasdOQuXa2RN5g2RSAbvN3u4rUQ

Frequently Asked Questions (FAQs)

1. What is Speech Emotion Recognition?

2. Which algorithms are best for Speech Emotion Recognition?

3. What datasets are used for SER projects?

4. What tools and libraries are used in this project?

5. What are the key challenges in Speech Emotion Recognition?

Rohit Sharma

804 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months