Social Media Sentiment Analysis with Machine Learning Techniques

By Rohit Sharma

Updated on Jul 30, 2025 | 8 min read | 1.53K+ views

Share:

Understanding how people feel about a topic in real time can shape products, politics, and public opinion. 

In this project, you’ll perform social media sentiment analysis using real-world posts. You’ll clean raw text data, extract meaningful features, and train powerful models like Naïve Bayes and SVM to classify sentiments as positive, negative, or neutral.

Accelerate your data science career with upGrad’s top-rated Online Data Science Courses. Learn Python, Machine Learning, AI, SQL, Tableau, and more, taught by industry experts. Build real-world skills and get job-ready. Start learning today!

Turn your ideas into real-world skills. Dive into our top Python Data Science Projects and start building today.

What Should You Know to Build This Project Successfully?

Before starting your Social media sentiment analysis project, it’s important to be familiar with these key concepts and tools:

  • Python programming (You’ll use Python throughout for data processing, visualization, and modeling.)
  • Pandas and Numpy (These libraries help you handle time series data, perform calculations, and structure your dataset for modeling.)
  • Matplotlib or Seaborn (visualizing sentiment distributions and trends)
  • Scikit‑learn basics (training classifiers like Naïve Bayes or SVM, making predictions, and evaluating models using accuracy, precision, recall, and F1 score)
  • Intro to NLP concepts like tokenization, stopword removal, stemming, and vectorization (TF-IDF or word embeddings)
  • Optional: Familiarity with deep learning basics (if you want to implement LSTM using TensorFlow or Keras)

Also Read: 15+ Top Natural Language Processing Techniques To Learn in 2025

Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:

Behind the Scenes: Tools That Power Social Media Sentiment Analysis

To build this social media sentiment analysis project, you’ll use a solid mix of Python libraries focused on natural language processing, machine learning, and data visualization:

Tool / Library

Purpose

Python Core language for scripting and automation
Google Colab Cloud-based platform to run notebooks without setup
Pandas Loads, cleans, and processes text datasets efficiently
NumPy Supports numerical operations during preprocessing and modeling
Matplotlib / Seaborn Visualizes sentiment distributions, word frequencies, and trends
Scikit-learn Trains and evaluates models like Naïve Bayes and SVM
NLTK / spaCy Performs tokenization, stopword removal, and lemmatization
VADER Quickly classifies sentiment using a rule-based lexicon

Also Read: Top 25 NLP Libraries for Python for Effective Text Analysis

How Long Will It Take and What Can You Expect?

You can complete this social media sentiment analysis project in 4 to 5 hours. It’s ideal for beginners who have some hands-on experience with Python and want to dive into real-world natural language processing tasks.

Smart Insights: Techniques That Power Social Media Sentiment Analysis

To build an effective sentiment analysis model for social media, you’ll apply essential techniques that help convert raw text into meaningful insights:

  • Text Cleaning & Preprocessing: Remove noise like URLs, emojis, stopwords, and special characters to clean up user posts.
  • TF-IDF Vectorization: Transform text data into numerical features that machine learning models can understand.
  • Machine Learning Models (Naïve BayesSVM): Train multiple models to classify posts as positive, negative, or neutral and compare their performance.

Also Read: Gaussian Naive Bayes: Understanding the Algorithm and Its Classifier Applications

How to Build a Social Media Sentiment Analysis Model

Let’s build this project from scratch with clear, step-by-step guidance:

  1. Load the Social Media Dataset
  2. Clean and Preprocess the Data
  3. Feature Extraction with TF-IDF
  4. Define Features and Target
  5. Split Data into Train and Test Sets
  6. Train Sentiment Classifiers
  7. Evaluate Model Accuracy

Without any further delay, let’s get started!

Step 1: Download the Dataset

Download the dataset from Kaggle, extract the ZIP file, and use the downloaded dataset file for the project.

Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.

Step 2: Upload and Read the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Once uploaded, use the following Python code to read and check the data and import the required libraries:

# Install necessary libraries
!pip install pandas scikit-learn nltk spacy vaderSentiment


# Import libraries
import pandas as pd
import nltk
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Load dataset
df = pd.read_csv('social.csv')

# Basic overview
print(df.head())

# Check sentiment distribution
print(df['Sentiment Label'].value_counts())

Output: 

                              Post ID  \

0  aa391375-7355-44b7-bcbf-97fb4e5a2ba3   
1  1c9ec98d-437a-48d9-9cba-bd5ad853c59a   
2  170e5b5b-1d9a-4d02-a957-93c4dbb18908   
3  aec53496-60ee-4a06-8821-093a04dc8770   
4  4eacddb7-990d-4056-8784-7e1d5c4d1404   

 

                                        Post Content Sentiment Label  \

0  Word who nor center everything better politica...         Neutral   
1  Begin administration population good president...        Positive   
2  Thousand total sign. Agree product relationshi...         Positive   
3  Individual from news third. Oil forget them di...            Neutral   
4  Time adult letter see reduce. Attention sudden...        Negative 

   
Number of Likes  Number of Shares  Number of Comments  User Follower Count  \

0              157               243                  64                 4921   
1               166                49                  121                  612   
2              185               224                 179                 9441   
3              851               369                  39                 6251   
4              709               356                  52                 1285   

 

    Post Date and Time Post Type Language  

0  2024-01-10 00:14:21     video        fr  
1  2024-02-03 00:20:11     image      es  
2  2024-07-25 14:20:23     video      de  
3  2024-02-20 09:15:09      text       de  
4  2024-03-01 04:17:35     image     de  

Sentiment Label

Neutral      682
Negative   675
Positive     643
Name: count, dtype: int64

Step 3: Text Cleaning and Lemmatization for Social Media Posts

To prepare social media posts for sentiment analysis, we clean the text by removing links, mentions, hashtags, special characters, and stopwords. We also apply lemmatization using spaCy to reduce words to their base forms.

Here is the code for this step:

import nltk
import spacy
import re

from nltk.corpus import stopwords
nltk.download('stopwords')

# Load stopwords and spaCy model
stop_words = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_sm')

# Text cleaning function
def clean_text(text):
    # Remove URLs, mentions, hashtags, non-alphabetic characters
    text = re.sub(r"http\S+|@\w+|#\w+|[^A-Za-z\s]", '', text.lower())
    doc = nlp(text)
    # Lemmatize and remove stopwords
    return ' '.join([token.lemma_ for token in doc if token.text not in stop_words and token.is_alpha])

# Apply cleaning function
df['clean_text'] = df['Post Content'].astype(str).apply(clean_text)

Conclusion:

This step results in a new column, clean_text that contains cleaned and lemmatized versions of the original posts, ready for vectorization and modeling.

Also Read: Stemming & Lemmatization in Python: Which One To Use?

Step 4: Feature Extraction Using TF-IDF

To convert cleaned text into numerical features for machine learning models, we use TF-IDF (Term Frequency–Inverse Document Frequency). It helps our sentiment classifier focus on the most meaningful terms.

Here is the code for this step:

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer with top 5000 features
tfidf = TfidfVectorizer(max_features=5000)

# Transform the cleaned text into TF-IDF vectors
X = tfidf.fit_transform(df['clean_text'])

# Define the target variable
y = df['Sentiment Label']

This step transforms each post into a feature vector based on the most significant 5000 terms, preparing the data for model training.

Also Read: Text Summarization in NLP: Key Concepts, Techniques, and Implementation

Step 5: Splitting Data into Training and Testing Sets

To evaluate how well our sentiment analysis model performs, we split the dataset into a training set (used to train the model) and a testing set (used to evaluate it). We use stratified sampling to maintain the proportion of sentiment labels in both sets.

Here is the code for this step:

from sklearn.model_selection import train_test_split

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))

Output: 

Training set size: 1600
Testing set size: 400

Step 6: Training Sentiment Classifiers (Naïve Bayes & SVM)

Now that the data is ready, we’ll train two popular machine learning classifiers to predict sentiment: Naïve Bayes and Support Vector Machine (SVM). After training, we’ll evaluate both using classification metrics.

Here is the code for this step:

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

# Naïve Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

# SVM
svm = LinearSVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)

# Evaluation
print("NB Results:\n", classification_report(y_test, y_pred_nb))
print("SVM Results:\n", classification_report(y_test, y_pred_svm))

Output:

NB Results:

                   precision    recall  f1-score   support

    Negative     0.39      0.40      0.39       135    
    Neutral       0.36      0.43      0.39       136
    Positive       0.30      0.23      0.26       129

 

    accuracy                                      0.35       400
    macro avg      0.35       0.35       0.35       400
weighted avg     0.35        0.35      0.35       400

 

SVM Results:

                   precision    recall  f1-score   support

    Negative      0.37      0.39      0.38       135
     Neutral       0.34      0.35      0.34       136
    Positive        0.34      0.32      0.33       129

 

    accuracy                                      0.35       400
   macro avg         0.35      0.35      0.35       400
weighted avg       0.35      0.35      0.35       400

 

Both models are now trained and evaluated. The classification report includes precision, recall, F1-score, and accuracy

Step 7:  Visualizing Sentiment Distribution

Before diving deeper, it's useful to understand the balance of sentiment classes in the dataset. Here's a quick plot showing the distribution of sentiment labels.

Here is the code:

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(data=df, x='Sentiment Label')
plt.title("Sentiment Distribution")
plt.show()

Output:

This plot helps you check whether the dataset is balanced or skewed toward certain sentiments, which can affect model performance.

Step 8:  Evaluating Model with a Confusion Matrix

To better understand how well the SVM classifier performed, you can visualize its predictions using a confusion matrix. 

It shows the number of correct and incorrect classifications for each sentiment class.

Here is the Code for this step:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred_svm, labels=svm.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=svm.classes_)
disp.plot(cmap='Blues')
plt.title("SVM Confusion Matrix")
plt.show()

Output:

The SVM model shows moderate performance but often confuses similar sentiments:

  • Correctly predicts: 52 Negative, 47 Neutral, 41 Positive
  • Misclassifies many Neutral and Positive posts
  • Needs improvement in distinguishing close sentiment classes

 To improve this and enhance your skills further in sentiment analysis, you can :

  • Use more training data
  • Try deep learning models like LSTM
  • Add sentiment-rich features like emojis, hashtags, or sentiment lexicons

This analysis gives you a clear direction for enhancing your model’s accuracy.

Also Read: Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know

Final Conclusion

In this project, you built a complete social media sentiment analysis model using text preprocessing, TF-IDF, and classifiers like Naïve Bayes and SVM. The SVM model performed slightly better, though neutral sentiments were often misclassified. This project gave you practical experience in NLP and classification that you can now build on.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1NJ9H956op_L6nyLVuEzSA44yP1uMAiiD?usp=sharing

Frequently Asked Questions (FAQs)

1. What is social media sentiment analysis?

2. How does social media sentiment analysis work?

3. How to do social media sentiment analysis?

4. How accurate is the sentiment analysis using SVM and Naïve Bayes in this project?

5. Can this sentiment analysis model be applied to real-time social media data?

Rohit Sharma

804 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months