DataFrames in Python: Why Every Data Scientist Is Obsessed!
By Rohit Sharma
Updated on Jul 08, 2025 | 21 min read | 7.99K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 08, 2025 | 21 min read | 7.99K+ views
Share:
Table of Contents
Did you know? Python’s popularity increased by 2.2% from April to May 2025, surpassing competitors like C++, C, and Java! This growing demand highlights the increasing reliance on Python’s powerful features, like DataFrames, making data analysis faster and more intuitive than ever. |
DataFrames in Python are two-dimensional, size-mutable, and labeled data structures provided by the Pandas library. They store data in rows and columns, similar to tables in databases or spreadsheets, allowing efficient data manipulation and analysis.
DataFrames can hold various data types and support operations like filtering, grouping, and aggregating, making them indispensable in data science.
In this blog, you’ll learn about DataFrames in Python, focusing on their creation, manipulation, and advanced usage for practical analysis.
Interested in learning more about DataFrames in Python? Enrol in upGrad’s Online Software Development Courses, featuring an updated curriculum on generative AI and specializations like full-stack development.
A Pandas DataFrame is a 2D labeled structure that can store data of different types (e.g., integers, floats, strings) across rows and columns. It is similar to a table in a database or an Excel spreadsheet, with labeled axes (rows and columns). The main components of a DataFrame are:
Understanding concepts like DataFrames is just the beginning. To advance in Python and build a successful tech career, continuous learning is essential. Here are some relevant courses that can help you in your learning journey:
Understanding DataFrames is an essential step in working with data in Python. To start, let’s ensure that Pandas is installed:
pip install pandas
Once installed, import Pandas into your Python script:
import pandas as pd
Let’s now look at how to create DataFrames in Python using Pandas, using sources such as dictionaries, lists, or external files like CSV and Excel.
Creating a DataFrame from a dictionary maps the dictionary keys to column labels while the corresponding values fill the columns. This approach is beneficial when dealing with structured data, as it allows direct mapping of data attributes (keys) to columns.
Code Example:
import pandas as pd
# Create a DataFrame from a dictionary
data = {'Name': ['Aman', 'Bhoomi', 'Chetan'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)
print(df)
Explanation:
Output:
Name Age
0 Aman 24
1 Bhoomi 27
2 Chetan 22
Also Read: Python Challenges for Beginners
Using a list of lists allows you to create a DataFrame where each inner list represents a row. Since the list of lists does not contain column labels, the columns parameter must be specified separately to define the DataFrame's structure.
Code Example:
import pandas as pd
# Create a DataFrame from a list of lists
data = [['Aman', 24], ['Bhoomi', 27], ['Chetan', 22]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
Explanation:
Output:
Name Age
0 Aman 24
1 Bhoomi 27
2 Chetan 22
Get a better understanding of Python with upGrad’s Learn Python Libraries: NumPy, Matplotlib & Pandas. Learn how to manipulate data using NumPy, visualize insights with Matplotlib, and analyze datasets with Pandas.
Also Read: A Comprehensive Guide to Pandas DataFrame astype()
Pandas allows you to read data directly from external files, such as CSV or Excel, into a DataFrame. This is particularly useful when working with large datasets stored externally.
Code Example:
import pandas as pd
# Create a DataFrame from a CSV file
df = pd.read_csv('data.csv')
print(df)
Explanation:
Output (Assuming the CSV contains columns Name, Age, Country):
Name Age Country
0 Aman 24 USA
1 Bhoomi 27 UK
2 Chetan 22 Canada
Note:
|
Also Read: Career Opportunities in Python: Everything You Need To Know [2025]
You can create a DataFrame from a NumPy array, where each row in the array becomes a row in the DataFrame. Columns must be defined explicitly, and this method is beneficial when working with numerical data.
Code Example:
import pandas as pd
import numpy as np
# Create a DataFrame from a NumPy array
data = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(data, columns=['Column1', 'Column2'])
print(df)
Explanation:
Output:
Column1 Column2
0 1 2
1 3 4
2 5 6
Also Read: Top 7 Data Types in Python: Examples, Differences, and Best Practices (2025)
When data is structured as a list of dictionaries, each dictionary represents a row in the DataFrame, with the dictionary keys becoming the column labels.
Code Example:
import pandas as pd
# Create a DataFrame from a list of dictionaries
data = [{'Name': 'Aman', 'Age': 24}, {'Name': 'Bhoomi', 'Age': 27}]
df = pd.DataFrame(data)
print(df)
Explanation:
Output:
Output:
Name Age
0 Aman 24
1 Bhoomi 27
Start your Python learning journey with upGrad’s Learn Basic Python Programming course! Build expertise in Python and Matplotlib through hands-on exercises. Ideal for beginners, plus earn a certification to advance your career upon completion!
Also Read: Inheritance in Python | Python Inheritance [With Example]
You can create a DataFrame from a Pandas Series by passing the series into the DataFrame() constructor. The Series’ name attribute becomes the column label in the resulting DataFrame.
Code Example:
import pandas as pd
# Create a DataFrame from a Series
series = pd.Series([1, 2, 3], name='Numbers')
df = pd.DataFrame(series)
print(df)
Explanation:
Output:
Numbers
0 1
1 2
2 3
In fields like ML, AI, and data analytics, DataFrames in Python play a crucial role in structuring data for model training and generating insights. With Python libraries like Pandas, data manipulation becomes efficient, simplifying tasks such as cleaning, transforming, and visualizing data.
Are you a full-stack developer wanting to integrate AI into your Python Coding? upGrad’s AI-Driven Full-Stack Development can help you. You’ll learn how to build AI-powered software using OpenAI, GitHub Copilot, Bolt AI & more.
Also Read: Top 36+ Python Projects for Beginners and Students to Explore in 2025
Let's explore key ways for inspecting and accessing data in pandas, enabling efficient exploration and extraction of insights from your DataFrames in Python.
DataFrame Basics involves understanding how to view and access data within a Pandas DataFrame. This includes viewing the structure, selecting specific rows and columns, and filtering data to extract meaningful insights.
Below are the key techniques for viewing and accessing data in DataFrames:
1. Viewing Data
Once you load a DataFrame, inspecting the data is the first step. You can use several methods to view and understand the structure of the DataFrame.
(a) head(): Displays the first 5 rows by default, allowing you to get a quick look at the top of the data. You can specify a number to display a custom number of rows.
Code Example:
import pandas as pd
data = {'Name': ['Aman', 'Bhoomi', 'Chetan'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)
# Display the first five rows (default behavior)
print(df.head())
Explanation: df.head() displays the first five rows of the DataFrame. You can specify the number of rows to display by passing an argument (e.g., df.head(2) to show the first two rows).
Output: This will show the first five rows of the DataFrame. If there are fewer than five rows, all of them will be displayed.
Name Age
0 Aman 24
1 Bhoomi 27
2 Chetan 22
(b) tail(): Displays the last five rows of the DataFrame, offering a look at the bottom of the data.
Code Example:
print(df.tail()) # Displays the last 5 rows
Explanation: df.tail() returns the last five rows. This is useful for inspecting the bottom of your dataset.
Output: Since the DataFrame has only three rows, the output will display all of them.
Name Age
0 Aman 24
1 Bhoomi 27
2 Chetan 22
(c) info(): Provides summary information about the DataFrame, including the number of non-null entries and the data type of each column.
Code Example:
df.info()
Explanation: The df.info() function provides a concise summary of the DataFrame, including the number of non-null entries and the data types of each column.
Output: This method helps you understand the structure of the DataFrame, including the number of missing values and the type of data in each column.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Age 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 143.0+ bytes
(d) describe(): Generates descriptive statistics for numerical columns, such as the mean, standard deviation, minimum, and maximum values.
Code Example:
print(df.describe())
Explanation: df.describe() returns statistical details like mean, standard deviation, and min/max values for numeric columns.
Output: It provides a quick summary of numeric data, including the count of non-null entries, the mean value, and the standard deviation.
Age
count 3.0
mean 24.3
std 2.52
min 22.0
25% 23.0
50% 24.0
75% 25.5
max 27.0
2. Accessing Data
You can access data from a DataFrame using the column name or by index, making it easy to retrieve specific parts of the dataset for further analysis.
(a) By Column Name: You can access a column in a DataFrame directly by specifying its name in square brackets. This allows you to isolate specific features of the data.
Code Example:
print(df['Name']) # Access the 'Name' column
Explanation: Accessing a DataFrame column by its name returns a Series with data from that column. This is useful when you need to work with specific variables in your dataset.
Output: This will return the data from the Name column.
0 Aman
1 Bhoomi
2 Chetan
Name: Name, dtype: object
(b) By Row Index (Using iloc and loc): You can access rows by their index position using iloc (integer location-based indexing) or loc (label-based indexing). This helps to isolate specific rows based on their position or label.
Code Example:
# Using iloc to access a specific row by index
print(df.iloc[0]) # Accesses the first row (index 0)
Explanation: iloc[0] accesses the first row (index 0) of the DataFrame, which is ideal when you need to retrieve data by row index.
Output: This function returns all data from the first row in a Series format.
Name Aman
Age 24
Name: 0, dtype: object
Code Example:
print(df.loc[0]) # Accesses the row with index label 0
Explanation: loc[0] accesses the row with the index label 0, which is the default integer index.
Output: Similar to iloc, but loc can also be used with custom index labels.
Name Aman
Age 24
Name: 0, dtype: object
3. Filtering Data
Pandas allows you to filter data based on specific conditions, enabling you to isolate rows that meet particular criteria. This functionality is essential for analyzing subsets of your data.
Code Example:
# Filter rows where the age is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)
Explanation:
Output: Since none of the rows in this DataFrame have an age greater than 30, the output will be an empty DataFrame.
Empty DataFrame
Columns: [Name, Age]
Index: []
The basic DataFrame operations are essential for efficient data exploration and analysis. Gaining proficiency in these techniques allows you to interact with large datasets and make informed decisions based on the data.
Ready to advance your Python skills? Gain expertise in Linux, Python foundation, AWS, Azure, and Google Cloud to create scalable solutions with upGrad’s Expert Cloud Engineer Bootcamp. Start building your job-ready portfolio today!
Also Read: Mastering Python Variables: Complete Guide with Examples
Let's explore how Pandas makes it easy to manipulate and merge data using a variety of built-in DataFrame operations.
DataFrame operations in Pandas enable powerful data manipulation, including data transformation, merging, and aggregation. These operations help to clean, reshape, and combine datasets for deeper analysis and more insightful results.
Below are some common DataFrame operations that help efficiently manipulate and merge datasets for better analysis.
1. Adding and Dropping Columns
Adding or dropping columns is a common operation for modifying the structure of a DataFrame. This is useful when you need to either expand your dataset with new data or remove unnecessary columns for focused analysis.
Code Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['Aman', 'Bhoomi', 'Chetan'],
'Age': [24, 27, 22]
})
# Adding a new column
df['Salary'] = [50000, 60000, 70000]
print(df)
# Dropping the 'Salary' column
df = df.drop('Salary', axis=1)
print(df)
Explanation:
Output:
Name Age Salary
0 Aman 24 50000
1 Bhoomi 27 60000
2 Chetan 22 70000
Name Age
0 Aman 24
1 Bhoomi 27
2 Chetan 22
2. Renaming Columns
Renaming columns allows you to update column names to be more descriptive or standardized. This operation is crucial when you are cleaning data or preparing it for analysis.
Code Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['Aman', 'Bhoomi', 'Chetan'],
'Age': [24, 27, 22]
})
# Renaming the column 'Name' to 'Full Name'
df = df.rename(columns={'Name': 'Full Name'})
print(df)
Explanation: The rename() function is used to change column names by passing a dictionary where the keys are the old names and the values are the new names.
Output: The Name column is renamed to Full Name, and the updated DataFrame is printed.
Full Name Age
0 Aman 24
1 Bhoomi 27
2 Chetan 22
3. Sorting Data
Sorting data helps organize the DataFrame based on specific criteria. Sorting by one or more columns enables easier data analysis, particularly when searching for trends or arranging data in a logical order.
Code Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['Aman', 'Bhoomi', 'Chetan'],
'Age': [24, 27, 22]
})
# Sorting the DataFrame by the 'Age' column in descending order
df = df.sort_values(by='Age', ascending=False)
print(df)
Explanation: The sort_values() function sorts the DataFrame by a specified column, with the ascending=False parameter used to sort in descending order.
Output: The rows are sorted by the Age column in descending order.
Name Age
1 Bhoomi 27
0 Aman 24
2 Chetan 22
4. Handling Missing Data
Handling missing data is essential to ensure data integrity and avoid errors during analysis. Pandas provides multiple methods, such as fillna() and dropna(), to manage missing values effectively.
Code Example:
import pandas as pd
# Create a DataFrame with missing values
df = pd.DataFrame({
'Name': ['Aman', 'Bhoomi', None],
'Age': [24, None, 22]
})
# Identify missing values
print(df.isnull())
# Fill missing values in the 'Age' column with the mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)
# Drop rows with any missing values
df = df.dropna()
print(df)
Explanation:
Output:
Name Age
0 False False
1 False True
2 True False
Name Age
0 Aman 24
1 Bhoomi 25.5
2 Chetan 22
Name Age
0 Aman 24.0
2 Chetan 22.0
5. Grouping Data
Grouping data is valuable when you want to aggregate information based on a specific feature. The groupby() method allows you to group data by a column and apply aggregation functions such as mean(), sum(), or count().
Code Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'City': ['New York', 'London', 'New York', 'London'],
'Age': [24, 27, 22, 30]
})
# Group by 'City' and calculate the mean of 'Age'
grouped_df = df.groupby('City').mean()
print(grouped_df)
Explanation: The groupby() function groups the data by the City column, and mean() function calculates the average value for the Age column in each group.
Output: The DataFrame is grouped by City, and the mean age for each city is calculated.
Age
City
London 28.500000
New York 23.000000
6. Merging DataFrames
Merging DataFrames is a common operation when you need to combine two datasets based on a common column or index. This operation is similar to SQL joins and is used to combine related data from different sources.
Code Example:
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Aman', 'Bhoomi', 'Chetan']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Salary': [50000, 60000, 70000]
})
# Merge the DataFrames on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
Explanation:
Output: The DataFrames are merged on the ID column, and the result contains only rows that have matching ID values in both DataFrames.
ID Name Salary
0 1 Aman 50000
1 2 Bhoomi 60000
These operations are essential for performing common data manipulations and merging tasks in Pandas. By using these techniques, you can easily clean, transform, and combine data to prepare it for analysis and interpretation.
Take the next step in your career with Python and Data Science! Enroll in upGrad's Professional Certificate Program in Data Science and AI. Gain expertise in Python, Excel, SQL, GitHub, and Power BI through 110+ hours of live sessions!
Also Read: Top 70 Python Interview Questions & Answers: Ultimate Guide 2025
Let's now explore advanced techniques and see how they can help streamline data processing and reveal more meaningful insights from complex datasets.
Advanced DataFrame techniques in Python enable complex data manipulation, transformation, and analysis, allowing you to handle large datasets and optimize performance. These methods are vital for effectively managing and analyzing large-scale data.
Below are some methods for tackling intricate data tasks and gaining deeper insights:
1. Pivot Tables
Pivot tables are used to summarize and aggregate data based on specific criteria. This is particularly helpful when you want to group data by one or more columns and calculate aggregated values, such as sum, mean, or count.
Code Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'City': ['New York', 'London', 'New York', 'London'],
'Age': [24, 27, 22, 30]
})
# Creating a pivot table to calculate the mean of 'Age' by 'City'
pivot_table = df.pivot_table(values='Age', index='City', aggfunc='mean')
print(pivot_table)
Explanation:
Output: The pivot table groups the data by 'City' and calculates the average age for each city.
Age
City
London 28.500000
New York 23.000000
2. Reshaping DataFrames
Reshaping data with functions like melt() and pivot() is used when you need to transform the data between wide and long formats. These functions help make data easier to work with when applying aggregation or analysis.
Code Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'City': ['New York', 'London', 'New York', 'London'],
'Age': [24, 27, 22, 30],
'Salary': [50000, 60000, 70000, 80000]
})
# Reshaping data using melt
melted_df = df.melt(id_vars=['City'], value_vars=['Age', 'Salary'])
print(melted_df)
Explanation:
Output: The melt() function converts the 'Age' and 'Salary' columns into a single column of values, with each row now representing a different combination of 'City' and the corresponding values.
City variable value
0 New York Age 24
1 London Age 27
2 New York Age 22
3 London Age 30
4 New York Salary 50000
5 London Salary 60000
6 New York Salary 70000
7 London Salary 80000
3. Handling Duplicates
In many datasets, you may encounter duplicate rows. Removing duplicates is essential to ensure the quality of the data before performing any analysis.
Code Example:
import pandas as pd
# Create a DataFrame with duplicate rows
df = pd.DataFrame({
'Name': ['Aman', 'Bhoomi', 'Aman', 'Chetan'],
'Age': [24, 27, 24, 22]
})
# Drop duplicates based on all columns
df_unique = df.drop_duplicates()
print(df_unique)
Explanation:
Output: The DataFrame now contains unique rows, removing the duplicate entry for 'Aman'.
Name Age
0 Aman 24
1 Bhoomi 27
3 Chetan 22
4. Applying Functions to DataFrames
You can apply custom functions to columns or rows in a DataFrame using the apply() method. This is particularly useful when you need to perform more complex operations or transformations on your data.
Code Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['Aman', 'Bhoomi', 'Chetan'],
'Age': [24, 27, 22]
})
# Apply a function to increase age by 5 years
df['Age'] = df['Age'].apply(lambda x: x + 5)
print(df)
Explanation: The apply() function is used to apply a lambda function to the Age column, increasing each value by 5.
Output: The Age values are updated by adding 5 to each value.
Name Age
0 Aman 29
1 Bhoomi 32
2 Chetan 27
5. DataFrame Aggregation with Multiple Functions
You can aggregate data using multiple functions simultaneously. This is helpful when you need to compute several statistics on your data, such as the mean, sum, and count, at once.
Code Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'City': ['New York', 'London', 'New York', 'London'],
'Age': [24, 27, 22, 30],
'Salary': [50000, 60000, 70000, 80000]
})
# Aggregate data using multiple functions
agg_df = df.groupby('City').agg({
'Age': ['mean', 'max', 'min'],
'Salary': ['sum', 'mean']
})
print(agg_df)
Explanation:
Output: The output displays the aggregated statistics for Age (mean, max, min) and Salary (sum, mean) for each city.
Age Salary
mean max min sum mean
City
London 28.500000 30 27 140000 70000.0
New York 23.000000 24 22 120000 60000.0
These advanced techniques allow you to manipulate and reshape data for more detailed analysis and reporting. Pivot tables provide powerful aggregation, while reshaping methods, such as melt(), allow for easier handling of long-format data.
Also Read: Python Cheat Sheet: From Fundamentals to Advanced Concepts for 2025
upGrad’s Exclusive Data Science Webinar for you –
How upGrad helps for your Data Science Career?
A DataFrame in Python is a two-dimensional structure that organizes data into rows and columns. It’s widely used for managing and analyzing datasets, providing a simple and effective way to manipulate data. Yet, many individuals struggle with efficiently handling complex or large datasets due to the challenges of data cleaning and processing.
To address these challenges, upGrad offers programs designed to improve your proficiency in Python and data manipulation. These programs equip you with the tools needed to work confidently with data and refine your technical expertise.
Here are some additional upGrad courses to help enhance your coding skills:
Curious about which Python software development course best fits your goals in 2025? Contact upGrad for personalized counseling and valuable insights, or visit your nearest upGrad offline center for more details.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://content.techgig.com/technology/python-dominates-2025-programming-landscape-with-unprecedented-popularity/articleshow/121134781.cms
763 articles published
Rohit Sharma shares insights, skill building advice, and practical tips tailored for professionals aiming to achieve their career goals.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources