Social Media Sentiment Analysis with Machine Learning Techniques
By Rohit Sharma
Updated on Jul 30, 2025 | 8 min read | 1.53K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 30, 2025 | 8 min read | 1.53K+ views
Share:
Table of Contents
Understanding how people feel about a topic in real time can shape products, politics, and public opinion.
In this project, you’ll perform social media sentiment analysis using real-world posts. You’ll clean raw text data, extract meaningful features, and train powerful models like Naïve Bayes and SVM to classify sentiments as positive, negative, or neutral.
Accelerate your data science career with upGrad’s top-rated Online Data Science Courses. Learn Python, Machine Learning, AI, SQL, Tableau, and more, taught by industry experts. Build real-world skills and get job-ready. Start learning today!
Turn your ideas into real-world skills. Dive into our top Python Data Science Projects and start building today.
Before starting your Social media sentiment analysis project, it’s important to be familiar with these key concepts and tools:
Also Read: 15+ Top Natural Language Processing Techniques To Learn in 2025
Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:
To build this social media sentiment analysis project, you’ll use a solid mix of Python libraries focused on natural language processing, machine learning, and data visualization:
Tool / Library |
Purpose |
Python | Core language for scripting and automation |
Google Colab | Cloud-based platform to run notebooks without setup |
Pandas | Loads, cleans, and processes text datasets efficiently |
NumPy | Supports numerical operations during preprocessing and modeling |
Matplotlib / Seaborn | Visualizes sentiment distributions, word frequencies, and trends |
Scikit-learn | Trains and evaluates models like Naïve Bayes and SVM |
NLTK / spaCy | Performs tokenization, stopword removal, and lemmatization |
VADER | Quickly classifies sentiment using a rule-based lexicon |
Also Read: Top 25 NLP Libraries for Python for Effective Text Analysis
You can complete this social media sentiment analysis project in 4 to 5 hours. It’s ideal for beginners who have some hands-on experience with Python and want to dive into real-world natural language processing tasks.
To build an effective sentiment analysis model for social media, you’ll apply essential techniques that help convert raw text into meaningful insights:
Also Read: Gaussian Naive Bayes: Understanding the Algorithm and Its Classifier Applications
Let’s build this project from scratch with clear, step-by-step guidance:
Without any further delay, let’s get started!
Download the dataset from Kaggle, extract the ZIP file, and use the downloaded dataset file for the project.
Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
Once uploaded, use the following Python code to read and check the data and import the required libraries:
# Install necessary libraries
!pip install pandas scikit-learn nltk spacy vaderSentiment
# Import libraries
import pandas as pd
import nltk
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# Load dataset
df = pd.read_csv('social.csv')
# Basic overview
print(df.head())
# Check sentiment distribution
print(df['Sentiment Label'].value_counts())
Output:
Post ID \
0 aa391375-7355-44b7-bcbf-97fb4e5a2ba3
1 1c9ec98d-437a-48d9-9cba-bd5ad853c59a
2 170e5b5b-1d9a-4d02-a957-93c4dbb18908
3 aec53496-60ee-4a06-8821-093a04dc8770
4 4eacddb7-990d-4056-8784-7e1d5c4d1404
Post Content Sentiment Label \
0 Word who nor center everything better politica... Neutral
1 Begin administration population good president... Positive
2 Thousand total sign. Agree product relationshi... Positive
3 Individual from news third. Oil forget them di... Neutral
4 Time adult letter see reduce. Attention sudden... Negative
Number of Likes Number of Shares Number of Comments User Follower Count \
0 157 243 64 4921
1 166 49 121 612
2 185 224 179 9441
3 851 369 39 6251
4 709 356 52 1285
Post Date and Time Post Type Language
0 2024-01-10 00:14:21 video fr
1 2024-02-03 00:20:11 image es
2 2024-07-25 14:20:23 video de
3 2024-02-20 09:15:09 text de
4 2024-03-01 04:17:35 image de
Sentiment Label
Neutral 682
Negative 675
Positive 643
Name: count, dtype: int64
To prepare social media posts for sentiment analysis, we clean the text by removing links, mentions, hashtags, special characters, and stopwords. We also apply lemmatization using spaCy to reduce words to their base forms.
Here is the code for this step:
import nltk
import spacy
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
# Load stopwords and spaCy model
stop_words = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_sm')
# Text cleaning function
def clean_text(text):
# Remove URLs, mentions, hashtags, non-alphabetic characters
text = re.sub(r"http\S+|@\w+|#\w+|[^A-Za-z\s]", '', text.lower())
doc = nlp(text)
# Lemmatize and remove stopwords
return ' '.join([token.lemma_ for token in doc if token.text not in stop_words and token.is_alpha])
# Apply cleaning function
df['clean_text'] = df['Post Content'].astype(str).apply(clean_text)
Conclusion:
This step results in a new column, clean_text that contains cleaned and lemmatized versions of the original posts, ready for vectorization and modeling.
Also Read: Stemming & Lemmatization in Python: Which One To Use?
To convert cleaned text into numerical features for machine learning models, we use TF-IDF (Term Frequency–Inverse Document Frequency). It helps our sentiment classifier focus on the most meaningful terms.
Here is the code for this step:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize TF-IDF Vectorizer with top 5000 features
tfidf = TfidfVectorizer(max_features=5000)
# Transform the cleaned text into TF-IDF vectors
X = tfidf.fit_transform(df['clean_text'])
# Define the target variable
y = df['Sentiment Label']
This step transforms each post into a feature vector based on the most significant 5000 terms, preparing the data for model training.
Also Read: Text Summarization in NLP: Key Concepts, Techniques, and Implementation
To evaluate how well our sentiment analysis model performs, we split the dataset into a training set (used to train the model) and a testing set (used to evaluate it). We use stratified sampling to maintain the proportion of sentiment labels in both sets.
Here is the code for this step:
from sklearn.model_selection import train_test_split
# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))
Output:
Training set size: 1600
Testing set size: 400
Now that the data is ready, we’ll train two popular machine learning classifiers to predict sentiment: Naïve Bayes and Support Vector Machine (SVM). After training, we’ll evaluate both using classification metrics.
Here is the code for this step:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
# Naïve Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
# SVM
svm = LinearSVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
# Evaluation
print("NB Results:\n", classification_report(y_test, y_pred_nb))
print("SVM Results:\n", classification_report(y_test, y_pred_svm))
Output:
NB Results:
precision recall f1-score support
Negative 0.39 0.40 0.39 135
Neutral 0.36 0.43 0.39 136
Positive 0.30 0.23 0.26 129
accuracy 0.35 400
macro avg 0.35 0.35 0.35 400
weighted avg 0.35 0.35 0.35 400
SVM Results:
precision recall f1-score support
Negative 0.37 0.39 0.38 135
Neutral 0.34 0.35 0.34 136
Positive 0.34 0.32 0.33 129
accuracy 0.35 400
macro avg 0.35 0.35 0.35 400
weighted avg 0.35 0.35 0.35 400
Both models are now trained and evaluated. The classification report includes precision, recall, F1-score, and accuracy
Before diving deeper, it's useful to understand the balance of sentiment classes in the dataset. Here's a quick plot showing the distribution of sentiment labels.
Here is the code:
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(data=df, x='Sentiment Label')
plt.title("Sentiment Distribution")
plt.show()
Output:
Popular Data Science Programs
This plot helps you check whether the dataset is balanced or skewed toward certain sentiments, which can affect model performance.
To better understand how well the SVM classifier performed, you can visualize its predictions using a confusion matrix.
It shows the number of correct and incorrect classifications for each sentiment class.
Here is the Code for this step:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred_svm, labels=svm.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=svm.classes_)
disp.plot(cmap='Blues')
plt.title("SVM Confusion Matrix")
plt.show()
Output:
The SVM model shows moderate performance but often confuses similar sentiments:
To improve this and enhance your skills further in sentiment analysis, you can :
This analysis gives you a clear direction for enhancing your model’s accuracy.
Also Read: Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know
In this project, you built a complete social media sentiment analysis model using text preprocessing, TF-IDF, and classifiers like Naïve Bayes and SVM. The SVM model performed slightly better, though neutral sentiments were often misclassified. This project gave you practical experience in NLP and classification that you can now build on.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1NJ9H956op_L6nyLVuEzSA44yP1uMAiiD?usp=sharing
804 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources