SQL for Data Science: Functions, Queries, and Best Practices
By Rohit Sharma
Updated on Oct 28, 2025 | 21 min read | 8.15K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Oct 28, 2025 | 21 min read | 8.15K+ views
Share:
Table of Contents
SQL for data science is the foundation of modern analytics. It helps you extract, clean, and transform data stored in relational databases. With SQL, you can query millions of rows, join multiple tables, and generate insights that power machine learning models and business dashboards. Every data scientist relies on SQL to turn raw data into meaningful information.
In this guide, you’ll read more about SQL basics for data science, practical query examples, and the most-used SQL functions. You’ll also explore query optimization tips, best practices for efficient workflows, and real-world use cases that show how SQL drives data-driven decisions.
Step into the future with upGrad’s online Data Science Course. Experience flexible learning without classrooms, master real-world projects, and build job-ready skills. No limits—just opportunities for growth. Start your data science career today and stay ahead in the digital era.
SQL (Structured Query Language) is the standard language for communicating with relational databases. These databases store data in a structured format, organized into tables, columns, and rows. Think of it as a collection of highly organized spreadsheets, all linked together.
For a data scientist, SQL is indispensable. While Python and R are used for modeling, SQL is used for the critical first steps of the workflow.
Popular Data Science Programs
SQL lets you:
You’ll use SQL in data science for:
Mastering SQL for data science means you can work independently, without having to rely on an engineer to provide you with a CSV file for every new question you want to ask.
Also Read: Data Science for Beginners: Prerequisites, Learning Path, Career Opportunities and More
A common question is how SQL fits in with other data science tools. The key is to understand they serve different, complementary purposes. It’s not about SQL or Python; it's about SQL then Python.
Also Read: Python Pandas Tutorial: Everything Beginners Need to Know about Python Pandas
| Tool | Primary Purpose |
| SQL | Query and manipulate large-scale structured data |
| Python (Pandas) | In-depth analysis and machine learning modeling |
| Excel | Small-scale manual exploration and reporting |
| R | Statistical modeling and academic analysis |
The SQL language is broadly divided into three main sub-languages. While data scientists primarily use DML, understanding all three is beneficial.
DDL (Data Definition Language): Defines and manages the database structure (the "blueprint").
DML (Data Manipulation Language): Works with the data inside the tables (the "furniture and people"). This is the core of SQL for data science.
Also Read: Difference Between DDL and DML
DCL (Data Control Language): Manages permissions and access (the "security guard").
Example:
This DDL command creates the blueprint for a table to hold sales data.
SQL
CREATE TABLE sales_data (
order_id INT,
product_name VARCHAR(50),
sales_amount FLOAT,
region VARCHAR(30)
);
Output:
Table 'sales_data' created successfully.
To practice SQL for data science, you need a database and a tool to talk to it.
Databases:
Tools (Clients):
Workflow Tip: When working with large production databases, always start small. Use the LIMIT clause to query a subset of data (e.g., LIMIT 100) to check your query's logic and syntax before running it on the full, multi-billion-row dataset. This is a crucial best practice for any SQL for data science practitioner.
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
This section covers the core DML commands you will use daily. These queries are the heart of SQL for data science.
The SELECT, FROM, and WHERE clauses are the fundamental building blocks of every SQL query.
SQL
SELECT column1, column2
FROM table_name
WHERE condition;
Example:
Get the name, region, and sales amount for all orders over $1000.
SQL
SELECT customer_name, region, total_sales
FROM orders
WHERE total_sales > 1000;
Output:
| customer_name | region | total_sales |
| John Smith | North | 1200 |
| Priya Patel | South | 1500 |
| Kenji Tanaka | West | 2100 |
Key Points:
To enhance your filtering, you can use other operators in the WHERE clause:
You will rarely want to see just raw data. More often, you'll want to summarize it. This is where aggregations come in.
Example:
Get the total revenue for each region, but only show regions with more than $50,000 in revenue, and sort them from highest to lowest.
SQL
SELECT region, SUM(total_sales) AS total_revenue
FROM orders
GROUP BY region
HAVING SUM(total_sales) > 50000
ORDER BY total_revenue DESC;
Output:
| Region | Total Revenue |
| North | 230000 |
| South | 175000 |
| West | 95000 |
Use Cases:
This form of aggregation is fundamental to SQL for data science, as it turns raw transactional data into meaningful KPIs.
Also Read: Mastering SQL Aggregate Functions: A Comprehensive Guide
No dataset is an island. Data is almost always split across multiple tables to reduce redundancy (a concept called normalization). A JOIN clause is used to combine rows from two or more tables based on a related column between them. This is a critical skill for any SQL for data science task.
Types of joins:
Example:
You have a customers table and an orders table. You want to see the name of the customer who placed each order.
SQL
SELECT c.customer_name, o.order_id, o.total_sales
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id;
Output:
| customer_name | order_id | total_sales |
| Shubam | 1001 | 1200 |
| Priya Patel | 1002 | 1500 |
| Sonal Singh | 1003 | 2100 |
| Ashish | 1004 | 800 |
(Here, c and o are aliases, or short nicknames, for the table names to make the query more readable.)
If you wanted to see all customers, even those who haven't placed an order yet, you would use a LEFT JOIN instead.
A sub-query is a query nested inside another query. They are often used when one part of your question depends on the answer to another part.
Example:
Find the names of all customers who have placed an order over $5,000.
SQL
SELECT customer_name
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE total_sales > 5000
);
Output:
| customer_name |
| Mega Corp |
| Global Industries |
| Alpha Tech |
The "inner query" runs first, creating a temporary list of customer IDs. The "outer query" then retrieves the names for that list.
Set Operations:
These operators combine the results of two different SELECT statements.
Example:
Get a single list of all IDs that are either customers or leads.
SQL
SELECT customer_id FROM customers
UNION
SELECT customer_id FROM leads;
Output:
| customer_id |
| C1001 |
| C1002 |
| L2003 |
| C1003 |
| L2004 |
Also Read: Self Join SQL: A Comprehensive Overview
This is one of the most powerful concepts in modern SQL for data science. A regular GROUP BY query collapses rows. A window function performs a calculation across a set of rows (a "window") but preserves all the original rows.
They are defined by the OVER() clause.
Example:
Calculate the cumulative sales (running total) for each customer over time.
SQL
SELECT
customer_id,
order_date,
total_sales,
SUM(total_sales) OVER (PARTITION BY customer_id ORDER BY order_date) AS cumulative_sales
FROM orders;
Output:
| customer_id | order_date | total_sales | cumulative_sales |
| C1001 | 2025-01-15 | 500 | 500 |
| C1001 | 2025-02-10 | 800 | 1300 |
| C1001 | 2025-03-05 | 300 | 1600 |
| C1002 | 2025-01-20 | 1000 | 1000 |
| C1002 | 2025-03-10 | 1500 | 2500 |
Applications:
These advanced queries are what separate basic SQL use from true SQL for data science analysis.
Also Read: SQL Commands - A Comprehensive Guide
SQL comes with a rich library of functions to perform calculations and manipulate data. Using them effectively is key to cleaning data and engineering features.
These are your primary tools for data wrangling.
Numeric Functions:
String Functions:
Date Functions:
Example:
Get the full name of users, the year they placed an order, and their total sales for that year.
SQL
SELECT
CONCAT(first_name, ' ', last_name) AS full_name,
YEAR(order_date) AS order_year,
SUM(total_sales) AS yearly_sales
FROM orders
GROUP BY full_name, order_year;
Output:
| full_name | order_year | yearly_sales |
| Ashish | 2024 | 12000 |
| Ashish | 2025 | 8500 |
| Priya Patel | 2024 | 9000 |
| Priya Patel | 2025 | 11500 |
| Function | Purpose | Example |
| COUNT() | Count rows | COUNT(*) |
| AVG() | Average value | AVG(sales_amount) |
| CONCAT() | Join text | CONCAT(first_name, last_name) |
| DATEADD() | Add interval | DATEADD(DAY, 7, order_date) |
| SUBSTRING() | Extract text | SUBSTRING(product_code, 1, 3) |
Also Read: SQL HAVING with Examples
Data is rarely clean. You'll need to handle NULL values and apply conditional logic. This is a common task in SQL for data science.
CASE Expression:
The CASE statement is SQL's version of an IF-THEN-ELSE block. It lets you create new columns based on conditional logic.
Example:
Categorize sales into 'High', 'Medium', or 'Low' buckets.
SQL
SELECT
product_name,
total_sales,
CASE
WHEN total_sales > 500 THEN 'High'
WHEN total_sales BETWEEN 100 AND 500 THEN 'Medium'
ELSE 'Low'
END AS sales_category
FROM sales;
Output:
| product_name | total_sales | sales_category |
| Laptop | 1200 | High |
| Mouse | 80 | Low |
| Keyboard | 150 | Medium |
| Monitor | 450 | Medium |
| USB Cable | 25 | Low |
NULL Handling:
A NULL value represents missing or unknown data. It is not the same as 0 or an empty string, and it can break calculations (e.g., 10 + NULL is NULL).
| Input | Result After COALESCE(Input, 0) |
| NULL | 0 |
| 25 | 25 |
| 100 | 100 |
Also Read: Top 27 SQL Projects in 2025 With Source Code: For All Levels
This is where SQL for data science really shines, especially in feature engineering. These functions go beyond simple GROUP BY aggregates.
We already discussed window functions, but let's look at specific analytical functions used with the OVER() clause:
RANK() vs. DENSE_RANK() vs. ROW_NUMBER():
LAG() and LEAD():
Example:
Rank employees by salary within their department.
SQL
SELECT
employee_name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM employees;
Output:
| employee_name | department | salary | rank |
| Rahul | Engineering | 120000 | 1 |
| Sumit | Engineering | 110000 | 2 |
| Ankur | Sales | 100000 | 1 |
| Sandeep | Engineering | 95000 | 3 |
| Gaurabh | Sales | 95000 | 2 |
Used For:
Also Read: Top 25+ SQL Projects on GitHub You Should Explore in 2025
When you find yourself writing the same complex logic repeatedly, you can automate it.
Why Use Them?
Example (Syntax varies):
Create a function to calculate a standard 10% bonus.
SQL
CREATE FUNCTION calc_bonus(salary FLOAT)
RETURNS FLOAT
AS
BEGIN
RETURN salary * 0.1;
END;
Output:
Function 'calc_bonus' created successfully.
Now you can just SELECT employee_name, calc_bonus(salary) FROM employees;.
Writing a query that works is easy. Writing a query that is good, meaning it's readable, maintainable, and fast, is a hallmark of a senior data professional. These SQL basics for data science extend beyond just syntax.
Your code will be read by your future self and your teammates. Make it easy for them.
Bad Example:
SQL
select name,sum(sales) from table group by name;
Good Example (using formatting):
SQL
-- Calculate total sales for each person
SELECT
name,
SUM(sales) AS total_sales
FROM
table
GROUP BY
name;
In SQL for data science, you'll query tables with billions of rows. A bad query can take hours or cost thousands of dollars in cloud computing fees.
Also Read: Top 25 DBMS Projects [With Source Code] for Students in 2025
Example (Optimized):
This query filters for a specific date range before grouping, which is much more efficient.
SQL
SELECT customer_id, SUM(total_sales)
FROM orders
WHERE order_date >= '2025-01-01' -- Filter early
GROUP BY customer_id
LIMIT 100; -- Limit results
Output:
| customer_id | SUM(total_sales) |
| C1001 | 8800 |
| C1002 | 11500 |
| C1003 | 4200 |
| ... (97 more rows) | ... |
While data scientists don't always design databases, you must understand how they are designed to query them effectively.
| Model Type | Description | Use Case |
| Normalized | Data is split into many separate tables to reduce redundancy. | Transactional systems (e.g., an e-commerce checkout). |
| Denormalized | Data is merged into fewer, wider tables with redundancy. | Analytics systems (e.g., a data warehouse). |
| Star Schema | The most common analytics design. It has one central Fact table (containing metrics) connected to multiple Dimension tables (containing context). | Data warehouses, Business Intelligence. |
Example of a Star Schema:
This design is very fast for analytics because your queries are simple: you join the sales_facts table to the dimension tables you need, then aggregate. This structure is a key component of the SQL for data science ecosystem.
Aalso Read: 30 Data Science Project Ideas for Beginners in 2025
The standard SQL for data science workflow involves pulling data from SQL and moving it into a programming language like Python or R for modeling.
You use a library like sqlalchemy (Python) or RPostgreSQL (R) to create a connection to the database. Then, you execute your SQL query and load the results directly into a data structure like a Pandas DataFrame.
Python Example:
Python
import pandas as pd
import sqlalchemy
# Create a connection engine to the database
# (Replace with your database credentials)
engine = sqlalchemy.create_engine("mysql+pymysql://user:password@localhost/my_database")
# Write your SQL query
sql_query = """
SELECT * FROM sales_data
WHERE region = 'North';
"""
# Execute the query and load results into a Pandas DataFrame
df = pd.read_sql(sql_query, engine)
# Now you can use df for modeling, analysis, or visualization
print(df.head())
Output (from the print(df.head()) command):
order_id product_name sales_amount region
0 1001 Laptop 1200.0 North
1 1004 Mouse 80.0 North
2 1005 Keyboard 150.0 North
3 1008 Monitor 450.0 North
4 1010 USB Cable 25.0 North
Treat your SQL code like any other code.
Also Read: How to Use GitHub: A Beginner's Guide to Getting Started and Exploring Its Benefits in 2025
Data scientists often have access to sensitive Personally Identifiable Information (PII). It's your responsibility to handle it securely.
Here is how these concepts come together to solve real business problems.
Marketing Analytics – Customer Segmentation
Goal: Identify high-value customers for a targeted marketing campaign. A common method is RFM (Recency, Frequency, Monetary value).
Query:
This query groups customers, finds their average and total orders, and filters for those who are high-spenders.
SQL
SELECT
customer_id,
AVG(order_value) AS avg_spent,
COUNT(order_id) AS total_orders
FROM orders
GROUP BY customer_id
HAVING AVG(order_value) > 500;
Output:
| customer_id | avg_spent | total_orders |
| C1001 | 550.75 | 4 |
| C1003 | 720.50 | 2 |
| C1007 | 610.00 | 5 |
You can use these segments for personalized email offers, targeting high-value customers to retain them. This is a classic SQL for data science application.
Also Read: Top 20+ Data Science Techniques To Learn in 2025
Goal: Detect abnormal transactions that could be fraudulent.
Query:
This query uses a sub-query and statistical functions to find any transaction that is more than 3 standard deviations above the overall average.
SQL
SELECT user_id, transaction_id, amount
FROM transactions
WHERE amount > (
SELECT AVG(amount) + 3 * STDDEV(amount)
FROM transactions
);
Output:
| user_id | transaction_id | amount |
| U458 | T987234 | 5000.00 |
| U123 | T987456 | 7200.00 |
| U992 | T987789 | 5150.00 |
This identifies statistical outliers that can then be flagged for manual review. A more advanced version would calculate the average per user.
Also Read: Data Science in Finance: Career, Tools, and Trends
Goal: Monitor product stock movement to avoid stocking out.
Query:
This query uses a window function to calculate the rolling 7-day sum of quantity shipped for each product.
SQL
SELECT
product_id,
order_date,
quantity_shipped,
SUM(quantity_shipped) OVER (
PARTITION BY product_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7_day_shipped
FROM inventory;
Output:
| product_id | order_date | quantity_shipped | rolling_7_day_shipped |
| P100 | 2025-10-01 | 10 | 10 |
| P100 | 2025-10-02 | 12 | 22 |
| P100 | 2025-10-03 | 8 | 30 |
| ... | ... | ... | ... |
| P100 | 2025-10-07 | 15 | 70 |
| P100 | 2025-10-08 | 11 | 71 |
This helps smooth out daily noise and identify the true trend in product demand.
Goal: Prepare a "feature set" table that can be exported to Python for model training (e.g., to predict customer churn).
Query:
SQL can create powerful features directly in-database. This query builds a single row for each customer, summarizing their entire history.
SQL
SELECT
customer_id,
COUNT(order_id) AS feature_order_count,
AVG(order_value) AS feature_avg_spent,
MAX(order_date) AS feature_last_order
FROM orders
GROUP BY customer_id;
Output:
| customer_id | feature_order_count | feature_avg_spent | feature_last_order |
| C1001 | 4 | 550.75 | 2025-10-15 |
| C1002 | 1 | 200.00 | 2025-08-01 |
| C1003 | 2 | 720.50 | 2025-10-20 |
This new table (with one row per customer) is now perfectly formatted to be used as the "X" (features) in a machine learning model. This is a core workflow in SQL for data science.
Also Read: Customer Churn Prediction Project: From Data to Decisions
| Domain | Query Goal | Key Functions Used |
| Marketing | Customer Segmentation (RFM) | AVG, COUNT, MAX, GROUP BY, HAVING |
| Finance | Anomaly/Fraud Detection | AVG, STDDEV, Sub-query |
| Operations | Inventory Trend Analysis | SUM(), OVER(), ROWS BETWEEN (Window Function) |
| ML | Feature Engineering | COUNT, MAX, AVG, DATEDIFF, GROUP BY |
SQL for data science gives you complete control over data exploration, preparation, and analysis. You can query large datasets, apply analytical functions, and integrate SQL with Python or R for advanced modeling. When you follow structured query design, use efficient joins, and apply indexing wisely, your data workflows become faster and cleaner. Mastering SQL basics for data science helps you move from raw data to actionable insights with precision and speed.
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
SQL for data science is the practice of using SQL to extract, manipulate, aggregate, and prepare data from relational databases specifically for data science tasks like analysis, visualization, and machine learning. It's important because most of the world's business data lives in SQL databases, and SQL is the most efficient tool to access it.
While a software engineer might use SQL to build an application (using INSERT, UPDATE), a data scientist uses SQL primarily for analysis. This means a heavy focus on SELECT statements, complex JOINs, GROUP BY aggregations, and advanced analytical functions like window functions to understand and shape the data. The SQL for data science approach is investigative.
The core SQL basics for data science are:
Yes. SQL is a declarative query language, not a procedural programming language like Python. You declare what data you want, and the database figures out how to get it. Many data analysts and business intelligence professionals use SQL as their primary tool without ever writing Python or R.
Databases are designed to be "normalized," meaning data is split into related tables to avoid redundancy. JOINs are the mechanism to put that data back together. You can combine a users table with an orders table to see who bought what, or an orders table with a products table to analyze sales by category. This is essential for SQL for data science.
GROUP BY collapses many rows into a single summary row. For example, GROUP BY region would turn 1000 sales rows into just a few rows (one for each region). A window function performs a calculation across many rows but preserves all the original rows, adding the calculation as a new column.
Often, they can achieve the same result. A JOIN is generally faster and more readable. Use a JOIN when you need to add columns from another table. Use a sub-query when you need to filter based on a result from another table (e.g., WHERE user_id IN (SELECT ...)).
Follow these steps to optimize SQL queries for big data-
The Star Schema is the standard for analytics and data warehousing. It features a central Fact table (e.g., sales) containing quantitative metrics, surrounded by Dimension tables (e.g., products, users, dates) that provide descriptive context.
You use libraries like sqlalchemy or psycopg2 in Python, or RPostgreSQL in R. You establish a "connection" to the database using your credentials, then pass your SQL query as a string to the library. The library executes the query and returns the results, typically as a Pandas DataFrame (in Python) or a data.frame (in R).
SQL best practices for data science -
Use COALESCE(column_name, default_value) to replace NULLs with a sensible default (like 0 or 'N/A') before performing calculations. Be aware that COUNT(column_name) ignores NULLs, while COUNT(*) does not.
The most common causes are:
Use SUM() OVER (ORDER BY date) for running totals. Use AVG() OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for a 7-day rolling average. Use LAG(sales, 1) OVER (ORDER BY date) to get the previous day's sales to calculate day-over-day growth. This is a very common SQL for data science pattern.
SQL is excellent for creating features. You can use CASE statements for binning (e.g., 'Low', 'Medium', 'High'), date functions to extract features like 'day_of_week' or 'is_weekend', and GROUP BY to create aggregate features (e.g., avg_order_value_per_user). This is a core part of SQL for data science.
Use the GRANT and REVOKE commands (DCL) to give users the minimum permissions they need. For sensitive data, use database views that only show non-sensitive columns, or apply data masking functions.
They are critical. Almost every data science interview includes a technical SQL screen. You will be expected to write JOINs, GROUP BYs, and often a window function on a whiteboard or in a shared editor. They are testing your foundational ability to even get the data.
839 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources