Home
Blog
Data Science
Speech Emotion Recognition Project Using ML

Speech Emotion Recognition Project Using ML

Q: 1. What is Speech Emotion Recognition?

Speech Emotion Recognition (SER) is a process that uses machine learning and audio processing techniques to detect emotions such as happiness, sadness, anger, or fear from a person’s voice. It analyzes vocal features like tone, pitch, and energy to classify emotional states.

Q: 2. Which algorithms are best for Speech Emotion Recognition?

Algorithms like Support Vector Machines (SVM), Random Forests, and neural networks such as CNNs and LSTMs are commonly used. These models are trained on extracted audio features like MFCCs, chroma, and spectral contrast to learn emotion patterns.

Q: 3. What datasets are used for SER projects?

Some widely used datasets include: RAVDESS: Emotionally rich speech and song recordings TESS: Female speech covering various emotions SAVEE: British English male speakers EMO-DB: German emotional speech dataset These datasets help train and validate models across a range of emotions.

Q: 4. What tools and libraries are used in this project?

Python is typically used with libraries like Librosa for feature extraction, PyDub for audio processing, and TensorFlow, Keras, or Scikit-learn for building and training machine learning or deep learning models.

Q: 5. What are the key challenges in Speech Emotion Recognition?

Key challenges include variability in speaker accents, recording quality, background noise, and emotion overlap (e.g., fear and surprise sounding similar). These factors make it difficult for models to achieve high accuracy in real-world environments.

By Rohit Sharma

Updated on Jul 30, 2025 | 11 min read | 1.81K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty Level
How to Build a Speech Emotion Recognition Model
Conclusion

One of the most effective ways for people to communicate their ideas and feelings is through speech. Our tone, pitch, and intensity convey emotions that are far more than just words, from happiness to fear. However, can machines identify these feelings based solely on our speech? Speech Emotion Recognition (SER) seeks to accomplish just that.

The goal of this project is to develop a machine learning model that can identify emotions in speech audio recordings. We'll make use of well-known datasets like RAVDESS, CREMA-D, TESS, and SAVEE. We will teach our model to categorize emotions such as fear, sadness, anger, and happiness.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

Popular Data Science Programs

Post Graduate Certificate in Data Science PGD in Data Science MS in Data Science DevOps Course Online MSc in Data Science Program

What Should You Know Beforehand?

It is better to have at least some background in:

Basic Python - functions, loops, and libraries.
Machine Learning Basics - supervised learning, classification, and evaluation metrics
Audio Data Basics- what audio signals are, and how they can be represented (like spectrograms or MFCCs)
Librosa & Scikit-learn - audio feature extraction and building ML models

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library	Purpose
Python	Main programming language for coding the project
NumPy & Pandas	Numerical operations and data manipulation
Librosa	Audio processing and MFCC feature extraction
Matplotlib & Seaborn	Visualization of audio features and model performance
Scikit-learn	Building and evaluating ML models like SVM and Random Forest

Models That Will Be Utilized for Learning

Below are the models that we will be utilizing:

Model Name	Purpose / Why Used
Support Vector Machine (SVM)	Effective for high-dimensional data; works well for emotion classification tasks.
Random Forest Classifier	Reduces overfitting and improves accuracy through ensemble decision trees.
Logistic Regression	Acts as a strong baseline model for multi-class classification problems.
K-Nearest Neighbors (KNN)	Predicts emotions based on the labels of the closest data points in the feature space.

Time Taken and Difficulty Level

On average, it will take about 3 to 5 hours to complete. Duration may vary depending on your familiarity with Python, audio features, & ML concepts. It’s best for beginner-level.

How to Build a Speech Emotion Recognition Model

Let’s start building the project from scratch. We will start by:

Downloading the dataset from Kaggle using KaggleHub.
Loading audio files from RAVDESS, CREMA-D, TESS, and SAVEE.
Extracting MFCC features using Librosa.
Preparing the data for training our machine learning models.

Without any further delay, let’s start!

Step 1: Download the Dataset

We will use the kagglehub library to download the dataset directly into our Colab environment. Here is the code to do so:

import kagglehub
# Download the latest version of the speech emotion dataset
path = kagglehub.dataset_download("dmitrybabko/speech-emotion-recognition-en")
print("Path to dataset files:", path)

The dataset has been successfully downloaded and extracted.

Step 2: Import Required Libraries

In this step, we will import the following libraries:

Numerical computation (numpy, pandas)
Audio processing (librosa, soundfile)
Model building (scikit-learn)
Visualization (matplotlib, seaborn)
File handling (os, glob, warnings)

Here is the code to do so:

# Numerical and data handling
import numpy as np
import pandas as pd

# Audio processing
import librosa
import soundfile as sf

# File handling and preprocessing
import os
import glob
import warnings
warnings.filterwarnings('ignore')

# Model training
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

Step 3: Load and Organize Audio Files

In this step, we will load audio files from the downloaded dataset. The dataset contains folders for RAVDESS, CREMA-D, TESS, and SAVEE. Each folder contains .wav audio clips labeled by emotion.

We will:

Traverse each dataset folder
Extract audio paths
Map them to the correct emotion label

Here is the code to accomplish the same:

# Define emotion labels to be recognized
emotions = {
    "angry": "angry",
    "disgust": "disgust",
    "fear": "fearful",
    "happy": "happy",
    "neutral": "neutral",
    "sad": "sad",
    "surprise": "surprised"
}

# Path to your extracted dataset (adjust if needed)
DATASET_PATH = "/root/.cache/kagglehub/datasets/dmitrybabko/speech-emotion-recognition-en/versions/1/"

# Collect all .wav files
audio_files = glob.glob(os.path.join(DATASET_PATH, "**/*.wav"), recursive=True)
print(f"Total audio files found: {len(audio_files)}")

Output:

Total audio files found: 12162

The output tells us the number of all the audio clips across datasets.

Step 4: Extract Features and Emotion Labels

In this step, we will extract meaningful features that our machine learning models can understand. We will use MFCC (Mel-Frequency Cepstral Coefficients). It is a powerful feature used in speech processing to capture the timbral texture of audio signals.

Use the code below to accomplish this:

import os
import librosa
import numpy as np
import pandas as pd
from tqdm import tqdm

# Path to the dataset
dataset_path = "/root/.cache/kagglehub/datasets/dmitrybabko/speech-emotion-recognition-en/versions/1/"

# Emotion mapping from filename keywords
emotion_map = {
    'ang': 'angry',
    'hap': 'happy',
    'sad': 'sad',
    'fea': 'fearful',
    'dis': 'disgust',
    'sur': 'surprised',
    'neu': 'neutral',
    'calm': 'calm'
}

# Helper function to extract emotion label from filename
def extract_emotion(filename):
    for key in emotion_map:
        if key in filename.lower():
            return emotion_map[key]
    return 'unknown'

# Prepare lists to hold features and labels
features = []
labels = []

print(" Extracting MFCC features...")

# Loop over all audio files
for root, dirs, files in os.walk(dataset_path):
    for file in tqdm(files):
        if file.endswith('.wav'):
            try:
                file_path = os.path.join(root, file)

                # Load the audio
                signal, sr = librosa.load(file_path, sr=22050)

                # Extract MFCCs
                mfccs = librosa.feature.mfcc(y=signal, sr=sr, n_mfcc=40)
                mfccs_mean = np.mean(mfccs.T, axis=0)  # Mean across time

                # Extract label
                label = extract_emotion(file)

                # Append to lists
                features.append(mfccs_mean)
                labels.append(label)
            except Exception as e:
                print(f" Error with {file_path}: {e}")

# Convert to DataFrame
X = pd.DataFrame(features)
y = pd.Series(labels)

print("\n Feature extraction complete!")
print("Shape of features (X):", X.shape)
print("Unique emotion labels:", y.unique())

Output:

Extracting MFCC features...

0it [00:00, ?it/s]

100%|██████████| 200/200 [00:21<00:00, 9.52it/s]

100%|██████████| 200/200 [00:04<00:00, 45.56it/s]

100%|██████████| 200/200 [00:02<00:00, 99.73it/s]

100%|██████████| 200/200 [00:02<00:00, 90.48it/s]

100%|██████████| 200/200 [00:02<00:00, 87.50it/s]

100%|██████████| 200/200 [00:02<00:00, 82.59it/s]

100%|██████████| 200/200 [00:04<00:00, 43.59it/s]

100%|██████████| 200/200 [00:02<00:00, 87.68it/s]

100%|██████████| 200/200 [00:02<00:00, 78.90it/s]

100%|██████████| 200/200 [00:02<00:00, 96.16it/s]

100%|██████████| 200/200 [00:02<00:00, 90.91it/s]

100%|██████████| 200/200 [00:04<00:00, 44.90it/s]

100%|██████████| 200/200 [00:02<00:00, 99.39it/s]

100%|██████████| 200/200 [00:02<00:00, 86.59it/s]

100%|██████████| 480/480 [00:09<00:00, 51.45it/s]

100%|██████████| 7442/7442 [01:52<00:00, 66.34it/s]

0it [00:00, ?it/s]

100%|██████████| 60/60 [00:00<00:00, 61.91it/s]

100%|██████████| 60/60 [00:00<00:00, 60.25it/s]

100%|██████████| 60/60 [00:01<00:00, 57.33it/s]

100%|██████████| 60/60 [00:01<00:00, 40.24it/s]

100%|██████████| 60/60 [00:02<00:00, 23.39it/s]

100%|██████████| 60/60 [00:01<00:00, 59.35it/s]

100%|██████████| 60/60 [00:00<00:00, 60.31it/s]

100%|██████████| 60/60 [00:00<00:00, 61.54it/s]

100%|██████████| 60/60 [00:00<00:00, 65.71it/s]

100%|██████████| 60/60 [00:00<00:00, 60.74it/s]

100%|██████████| 60/60 [00:00<00:00, 61.42it/s]

100%|██████████| 60/60 [00:00<00:00, 60.52it/s]

100%|██████████| 60/60 [00:00<00:00, 63.20it/s]

100%|██████████| 60/60 [00:00<00:00, 62.28it/s]

100%|██████████| 60/60 [00:00<00:00, 64.03it/s]

100%|██████████| 60/60 [00:02<00:00, 26.71it/s]

100%|██████████| 60/60 [00:01<00:00, 34.69it/s]

100%|██████████| 60/60 [00:01<00:00, 59.27it/s]

100%|██████████| 60/60 [00:00<00:00, 61.81it/s]

100%|██████████| 60/60 [00:00<00:00, 62.16it/s]

100%|██████████| 60/60 [00:00<00:00, 61.48it/s]

100%|██████████| 60/60 [00:00<00:00, 61.34it/s]

100%|██████████| 60/60 [00:00<00:00, 63.03it/s]

100%|██████████| 60/60 [00:00<00:00, 64.39it/s]

Feature extraction complete!

Shape of features (X): (12162, 40)

Unique emotion labels: ['neutral' 'surprised' 'fearful' 'unknown' 'calm' 'disgust' 'sad' 'happy' 'angry']

Now that feature extraction is complete and we have:

Total samples: 12,162
Feature shape: Each audio file is represented by 40 MFCC features
Emotion labels: 9 unique emotions, including 'neutral', 'happy', 'sad', 'angry', 'fearful', 'calm', 'disgust', 'surprised', and 'unknown'

Step 5: Encode Labels and Split the Data

In this step, we will convert emotion labels (like 'happy', 'sad', etc.) into numerical form and split the dataset.

Here is the code to accomplish the same:

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Encode emotion labels into numeric format
le = LabelEncoder()
y_encoded = le.fit_transform(y)  # 'y' is the list of emotion labels

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Check the shape of splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Output:

X_train shape: (9729, 40)
X_test shape: (2433, 40)
y_train shape: (9729,)
y_test shape: (2433,)

What does the output tell?

X_train shape: (9729, 40): 9,729 training samples are there. Each sample has 40 features.
X_test shape: (2433, 40): 2,433 test samples are there with 40 features each.
y_train shape: (9729,): You have 9,729 emotion labels.
y_test shape: (2433,): You have 2,433 labels for evaluation.

Step 6: Train and Evaluate Models

In this step, we will train the following supervised machine learning models:

Support Vector Machine (SVM)
Random Forest Classifier
Logistic Regression
K-Nearest Neighbors (KNN)

Let’s train them all in a single script and compare their accuracy. Use the code below:

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define models
models = {
    "Support Vector Machine": SVC(kernel='linear'),
    "Random Forest Classifier": RandomForestClassifier(n_estimators=100, random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5)
}

# Train, predict, and evaluate each model
for name, model in models.items():
    print(f"\n===== {name} =====")    
    # Training
    model.fit(X_train, y_train)    
    # Prediction
    y_pred = model.predict(X_test)    
    # Evaluation
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.4f}")   
    print("Classification Report:")
    print(classification_report(y_test, y_pred, zero_division=0)
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

Output:

===== Support Vector Machine =====

Accuracy: 0.6132

Classification Report:

precision recall f1-score support

0 0.65 0.68 0.66 334

1 0.00 0.00 0.00 1

2 0.46 0.45 0.45 334

3 0.45 0.48 0.46 334

4 0.53 0.40 0.45 334

5 0.50 0.50 0.50 297

6 0.57 0.67 0.62 334

7 0.00 0.00 0.00 1

8 0.97 0.98 0.97 464

accuracy 0.61 2433

macro avg 0.46 0.46 0.46 2433

weighted avg 0.61 0.61 0.61 2433

Confusion Matrix:

[[228 0 19 33 42 11 0 0 1]

[ 0 0 0 0 0 0 0 0 1]

[ 27 0 149 33 25 41 55 0 4]

[ 35 0 33 159 21 33 52 0 1]

[ 53 0 45 65 132 22 10 0 7]

[ 6 0 46 31 18 148 48 0 0]

[ 3 0 28 33 7 39 223 0 1]

[ 0 0 0 0 0 1 0 0 0]

[ 1 0 5 0 5 0 0 0 453]]

===== Random Forest Classifier =====

Accuracy: 0.6543

Classification Report:

precision recall f1-score support

0 0.68 0.77 0.72 334

1 0.00 0.00 0.00 1

2 0.52 0.48 0.50 334

3 0.64 0.43 0.52 334

4 0.50 0.50 0.50 334

5 0.50 0.59 0.54 297

6 0.61 0.68 0.64 334

7 0.00 0.00 0.00 1

8 0.99 1.00 0.99 464

accuracy 0.65 2433

macro avg 0.49 0.49 0.49 2433

weighted avg 0.65 0.65 0.65 2433

Confusion Matrix:

[[258 0 15 5 43 13 0 0 0]

[ 0 0 0 0 0 0 0 0 1]

[ 25 0 159 20 38 31 60 0 1]

[ 32 0 33 144 39 33 52 0 1]

[ 61 0 32 20 166 46 7 0 2]

[ 2 0 42 13 38 174 28 0 0]

[ 0 0 23 23 9 51 227 0 1]

[ 0 0 0 0 0 1 0 0 0]

[ 0 0 0 0 0 0 0 0 464]]

===== Logistic Regression =====

Accuracy: 0.5730

Classification Report:

precision recall f1-score support

0 0.64 0.66 0.65 334

1 0.00 0.00 0.00 1

2 0.40 0.42 0.41 334

3 0.45 0.46 0.46 334

4 0.40 0.31 0.35 334

5 0.48 0.41 0.44 297

6 0.52 0.61 0.56 334

7 0.00 0.00 0.00 1

8 0.92 0.97 0.95 464

accuracy 0.57 2433

macro avg 0.42 0.43 0.42 2433

weighted avg 0.56 0.57 0.57 2433

Confusion Matrix:

[[221 0 31 27 33 17 3 0 2]

[ 0 0 0 0 0 0 0 0 1]

[ 25 0 139 32 33 41 56 0 8]

[ 23 0 38 154 34 20 60 0 5]

[ 62 0 52 56 105 21 20 0 18]

[ 10 0 54 31 31 122 48 0 1]

[ 4 0 32 42 16 34 204 0 2]

[ 0 0 0 0 0 1 0 0 0]

[ 3 0 4 0 8 0 0 0 449]]

===== K-Nearest Neighbors =====

Accuracy: 0.6021

Classification Report:

precision recall f1-score support

0 0.60 0.77 0.68 334

1 0.00 0.00 0.00 1

2 0.40 0.47 0.43 334

3 0.45 0.42 0.44 334

4 0.49 0.37 0.42 334

5 0.52 0.47 0.49 297

6 0.59 0.55 0.57 334

7 0.00 0.00 0.00 1

8 0.99 0.99 0.99 464

accuracy 0.60 2433

macro avg 0.45 0.45 0.45 2433

weighted avg 0.60 0.60 0.60 2433

Confusion Matrix:

[[258 0 20 21 27 8 0 0 0]

[ 0 0 0 0 0 0 0 0 1]

[ 35 0 157 33 33 28 48 0 0]

[ 43 0 48 141 26 27 48 0 1]

[ 79 0 43 41 125 35 11 0 0]

[ 6 0 71 24 34 140 22 0 0]

[ 4 0 50 54 10 30 185 0 1]

[ 0 0 0 0 0 1 0 0 0]

[ 2 0 1 0 2 0 0 0 459]]

The Random Forest Classifier achieved the highest accuracy of 65.43%.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Conclusion

The Random Forest Classifier performed best. It achieved an accuracy of 65.43%. It handled emotional speech patterns effectively. The emotion ‘happy’ was predicted with high precision.

SVM came next with 61.32% accuracy. It performed well for most emotions. But it struggled with labels that had very few samples. Meanwhile, KNN reached 60.21% accuracy. It was sensitive to nearby points and scaling. This slightly affected its consistency.

Logistic Regression gave the lowest accuracy - 57.30%. Its linear nature failed to capture the complexity of emotional patterns. Emotions like ‘surprised’ and ‘unknown’ had only one sample each. All models failed to classify them correctly.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1kYJRSyasdOQuXa2RN5g2RSAbvN3u4rUQ

Frequently Asked Questions (FAQs)

1. What is Speech Emotion Recognition?

Speech Emotion Recognition (SER) is a process that uses machine learning and audio processing techniques to detect emotions such as happiness, sadness, anger, or fear from a person’s voice. It analyzes vocal features like tone, pitch, and energy to classify emotional states.

2. Which algorithms are best for Speech Emotion Recognition?

Algorithms like Support Vector Machines (SVM), Random Forests, and neural networks such as CNNs and LSTMs are commonly used. These models are trained on extracted audio features like MFCCs, chroma, and spectral contrast to learn emotion patterns.

3. What datasets are used for SER projects?

Some widely used datasets include:

RAVDESS: Emotionally rich speech and song recordings
TESS: Female speech covering various emotions
SAVEE: British English male speakers
EMO-DB: German emotional speech dataset

These datasets help train and validate models across a range of emotions.

4. What tools and libraries are used in this project?

Python is typically used with libraries like Librosa for feature extraction, PyDub for audio processing, and TensorFlow, Keras, or Scikit-learn for building and training machine learning or deep learning models.

5. What are the key challenges in Speech Emotion Recognition?

Key challenges include variability in speaker accents, recording quality, background noise, and emotion overlap (e.g., fear and surprise sounding similar). These factors make it difficult for models to achieve high accuracy in real-world environments.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources