SQL for Data Science: Functions, Queries, and Best Practices

By Rohit Sharma

Updated on Oct 28, 2025 | 21 min read | 8.15K+ views

Share:

SQL for data science is the foundation of modern analytics. It helps you extract, clean, and transform data stored in relational databases. With SQL, you can query millions of rows, join multiple tables, and generate insights that power machine learning models and business dashboards. Every data scientist relies on SQL to turn raw data into meaningful information. 

In this guide, you’ll read more about SQL basics for data science, practical query examples, and the most-used SQL functions. You’ll also explore query optimization tips, best practices for efficient workflows, and real-world use cases that show how SQL drives data-driven decisions. 

Step into the future with upGrad’s online Data Science Course. Experience flexible learning without classrooms, master real-world projects, and build job-ready skills. No limits—just opportunities for growth. Start your data science career today and stay ahead in the digital era. 

Understanding the Foundation: SQL Basics for Data Science 

SQL (Structured Query Language) is the standard language for communicating with relational databases. These databases store data in a structured format, organized into tables, columns, and rows. Think of it as a collection of highly organized spreadsheets, all linked together. 

For a data scientist, SQL is indispensable. While Python and R are used for modeling, SQL is used for the critical first steps of the workflow. 

SQL lets you: 

  • Query massive datasets to retrieve only the specific information you need. 
  • Update and modify data, often used in data cleaning or feature engineering. 
  • Analyze datasets by aggregating, filtering, and sorting information. 
  • Combine data from multiple tables to create a rich, unified view. 
  • Prepare data for downstream modeling or reporting. 

You’ll use SQL in data science for: 

  • Data Exploration: Performing initial hypothesis testing and understanding data distributions before loading data into a model. 
  • Data Extraction: Pulling specific training or test sets from a large production database. 
  • Aggregating Metrics: Calculating key performance indicators (KPIs) like daily active users, monthly revenue, or conversion rates from transactional databases. 
  • Cleaning and Transforming: Handling missing values, standardizing formats, and creating new features (e.g., binning users by age) directly within the database. 
  • Building Reproducible Analytics Pipelines: Writing SQL scripts that can be run repeatedly to refresh dashboards or feed data into models. 

Mastering SQL for data science means you can work independently, without having to rely on an engineer to provide you with a CSV file for every new question you want to ask. 

Also Read: Data Science for Beginners: Prerequisites, Learning Path, Career Opportunities and More 

SQL vs Other Tools 

A common question is how SQL fits in with other data science tools. The key is to understand they serve different, complementary purposes. It’s not about SQL or Python; it's about SQL then Python. 

  • SQL: Best for querying, filtering, joining, and aggregating data inside the database. It is designed to operate on large volumes of structured data efficiently. 
  • Python (with Pandas): Best for in-depth analysis, complex transformations, statistical modeling, and machine learning after data is extracted. Pandas is powerful but operates in-memory, so you use SQL to pull a manageable subset of data first. 
  • Excel: Best for small-scale manual exploration, quick visualizations, and sharing simple reports. It is not suitable for large datasets or reproducible analysis. 
  • R: A powerful language primarily focused on statistical modeling, analysis, and visualization. Like Python, you would typically use SQL for data science to extract the data before analyzing it in R. 

Also Read: Python Pandas Tutorial: Everything Beginners Need to Know about Python Pandas 

Tool  Primary Purpose 
SQL  Query and manipulate large-scale structured data 
Python (Pandas)  In-depth analysis and machine learning modeling 
Excel  Small-scale manual exploration and reporting 
R  Statistical modeling and academic analysis 

Core SQL Components 

The SQL language is broadly divided into three main sub-languages. While data scientists primarily use DML, understanding all three is beneficial. 

DDL (Data Definition Language): Defines and manages the database structure (the "blueprint"). 

  • CREATE TABLE: Makes a new table. 
  • ALTER TABLE: Modifies an existing table (e.g., adds a column). 
  • DROP TABLE: Deletes a table. 

DML (Data Manipulation Language): Works with the data inside the tables (the "furniture and people"). This is the core of SQL for data science. 

  • SELECT: Retrieves data from tables. 
  • INSERT: Adds new rows to a table. 
  • UPDATE: Modifies existing rows. 
  • DELETE: Removes rows. 

Also Read: Difference Between DDL and DML 

DCL (Data Control Language): Manages permissions and access (the "security guard"). 

  • GRANT: Gives a user permission to perform actions. 
  • REVOKE: Takes away a user's permission. 

Example: 

This DDL command creates the blueprint for a table to hold sales data. 

SQL 
CREATE TABLE sales_data ( 
  order_id INT, 
  product_name VARCHAR(50), 
  sales_amount FLOAT, 
  region VARCHAR(30) 
); 
 

Output: 

Table 'sales_data' created successfully. 
 

Setting Up Your SQL Environment 

To practice SQL for data science, you need a database and a tool to talk to it. 

Databases: 

  • PostgreSQL: A powerful, open-source, and feature-rich database popular in data-heavy applications. 
  • MySQL: The world's most popular open-source database, widely used in web applications. 
  • SQLite: A server-less, file-based database. It's excellent for learning, mobile apps, and small projects. 
  • Cloud Data Warehouses: For large-scale SQL for data science, you'll encounter cloud platforms like Google BigQuery, Amazon Redshift, or Snowflake, which are designed to query petabytes of data. 

Tools (Clients): 

  • DBeaver: A free, universal database tool that can connect to almost any database. 
  • MySQL Workbench: The official tool for MySQL. 
  • Jupyter Notebooks: Data scientists' favorite tool. You can run SQL queries directly in a notebook using libraries like sqlalchemy or "magic commands" like %sql. 

Workflow Tip: When working with large production databases, always start small. Use the LIMIT clause to query a subset of data (e.g., LIMIT 100) to check your query's logic and syntax before running it on the full, multi-billion-row dataset. This is a crucial best practice for any SQL for data science practitioner. 

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Essential SQL Queries for Data Science 

This section covers the core DML commands you will use daily. These queries are the heart of SQL for data science. 

SELECT, FROM, WHERE – Retrieving Data 

The SELECT, FROM, and WHERE clauses are the fundamental building blocks of every SQL query. 

  • SELECT chooses the columns you want to see. 
  • FROM defines the table you are pulling data from. 
  • WHERE filters the rows based on a specific condition. 
SQL 
SELECT column1, column2 
FROM table_name 
WHERE condition; 
 

Example: 

Get the name, region, and sales amount for all orders over $1000. 

SQL 
SELECT customer_name, region, total_sales 
FROM orders 
WHERE total_sales > 1000; 
 

Output: 

customer_name  region  total_sales 
John Smith  North  1200 
Priya Patel  South  1500 
Kenji Tanaka  West  2100 

Key Points: 

  • SELECT chooses columns. 
  • FROM defines the table. 
  • WHERE filters rows. 

To enhance your filtering, you can use other operators in the WHERE clause

  • = (equal to) 
  • != or <> (not equal to) 
  • >, < (greater/less than) 
  • >=, <= (greater/less than or equal to) 
  • BETWEEN (filters within a range) 
  • IN (filters against a list of values) 
  • LIKE (filters using pattern matching, e.g., WHERE name LIKE 'A%'
  • AND, OR (combines multiple conditions) 

GROUP BY, HAVING, ORDER BY – Aggregations and Sorting 

You will rarely want to see just raw data. More often, you'll want to summarize it. This is where aggregations come in. 

  • GROUP BY collapses rows with the same value in a specified column into a single row. It's used with aggregate functions like SUM(), COUNT(), AVG()
  • HAVING filters the groups after they have been aggregated. WHERE filters rows before aggregation. 
  • ORDER BY sorts the final result set. 

Example: 

Get the total revenue for each region, but only show regions with more than $50,000 in revenue, and sort them from highest to lowest. 

SQL 
SELECT region, SUM(total_sales) AS total_revenue 
FROM orders 
GROUP BY region 
HAVING SUM(total_sales) > 50000 
ORDER BY total_revenue DESC; 
 

Output: 

Region  Total Revenue 
North  230000 
South  175000 
West  95000 

Use Cases: 

  • Summarizing total revenue by product category or region. 
  • Calculating daily or monthly active users. 
  • Identifying top-performing salespeople or marketing channels. 

This form of aggregation is fundamental to SQL for data science, as it turns raw transactional data into meaningful KPIs. 

Also Read: Mastering SQL Aggregate Functions: A Comprehensive Guide 

JOINs and Combining Tables 

No dataset is an island. Data is almost always split across multiple tables to reduce redundancy (a concept called normalization). A JOIN clause is used to combine rows from two or more tables based on a related column between them. This is a critical skill for any SQL for data science task. 

Types of joins: 

  • INNER JOIN: Returns only the rows that have matching values in both tables. 
  • LEFT JOIN: Returns all rows from the left table, and the matched rows from the right table. If there's no match, the columns from the right table will be NULL
  • RIGHT JOIN: Returns all rows from the right table, and the matched rows from the left table. (Less common, as you can just use a LEFT JOIN and swap the tables). 
  • FULL JOIN (or FULL OUTER JOIN): Returns all rows when there is a match in either the left or right table. 

Example: 

You have a customers table and an orders table. You want to see the name of the customer who placed each order. 

SQL 
SELECT c.customer_name, o.order_id, o.total_sales 
FROM customers c 
INNER JOIN orders o 
ON c.customer_id = o.customer_id; 
 

Output: 

customer_name  order_id  total_sales 
Shubam   1001  1200 
Priya Patel  1002  1500 
Sonal Singh  1003  2100 
Ashish  1004  800 

(Here, c and o are aliases, or short nicknames, for the table names to make the query more readable.) 

If you wanted to see all customers, even those who haven't placed an order yet, you would use a LEFT JOIN instead. 

Sub-queries and Set Operations 

A sub-query is a query nested inside another query. They are often used when one part of your question depends on the answer to another part. 

Example: 

Find the names of all customers who have placed an order over $5,000. 

SQL 
SELECT customer_name 
FROM customers 
WHERE customer_id IN ( 
  SELECT customer_id 
  FROM orders 
  WHERE total_sales > 5000 
); 
 

Output: 

customer_name 
Mega Corp 
Global Industries 
Alpha Tech 

The "inner query" runs first, creating a temporary list of customer IDs. The "outer query" then retrieves the names for that list. 

Set Operations: 

These operators combine the results of two different SELECT statements. 

  • UNION: Combines the results of two queries and removes duplicate rows. 
  • UNION ALL: Combines the results of two queries and keeps all duplicate rows. (Faster) 
  • INTERSECT: Finds only the rows that are common to both queries. 
  • EXCEPT: Finds the rows that are in the first query but not in the second. 

Example: 

Get a single list of all IDs that are either customers or leads. 

SQL 
SELECT customer_id FROM customers 
UNION 
SELECT customer_id FROM leads; 
 

Output: 

customer_id 
C1001 
C1002 
L2003 
C1003 
L2004 

Also Read: Self Join SQL: A Comprehensive Overview 

Window Functions and Advanced Queries 

This is one of the most powerful concepts in modern SQL for data science. A regular GROUP BY query collapses rows. A window function performs a calculation across a set of rows (a "window") but preserves all the original rows. 

They are defined by the OVER() clause. 

Example: 

Calculate the cumulative sales (running total) for each customer over time. 

SQL 
SELECT 
  customer_id, 
  order_date, 
  total_sales, 
  SUM(total_sales) OVER (PARTITION BY customer_id ORDER BY order_date) AS cumulative_sales 
FROM orders; 
 

Output: 

customer_id  order_date  total_sales  cumulative_sales 
C1001  2025-01-15  500  500 
C1001  2025-02-10  800  1300 
C1001  2025-03-05  300  1600 
C1002  2025-01-20  1000  1000 
C1002  2025-03-10  1500  2500 
  • PARTITION BY customer_id: This "resets" the sum for each new customer. 
  • ORDER BY order_date: This tells the SUM to operate sequentially by date. 

Applications: 

  • Ranking: RANK(), DENSE_RANK(), and ROW_NUMBER() to find the top 10 products per category. 
  • Time-based Metrics: Calculating rolling 7-day or 30-day averages. 
  • Period-over-Period: Using LAG() or LEAD() to compare a row's value to the previous or next row's value (e.g., month-over-month growth). 

These advanced queries are what separate basic SQL use from true SQL for data science analysis. 

Also Read: SQL Commands - A Comprehensive Guide 

Functions in SQL for Data Science 

SQL comes with a rich library of functions to perform calculations and manipulate data. Using them effectively is key to cleaning data and engineering features. 

Built-in Numeric, String and Date Functions 

These are your primary tools for data wrangling. 

Numeric Functions: 

  • SUM(): Total sum of a column. 
  • AVG(): Average value of a column. 
  • MAX(): Maximum value. 
  • MIN(): Minimum value. 
  • COUNT(): Counts the number of rows. COUNT(*) counts all rows, while COUNT(column_name) counts non-NULL rows. 
  • ROUND(): Rounds a number to a specified decimal place. 

String Functions: 

  • CONCAT(): Joins two or more strings together. 
  • SUBSTRING(): Extracts a part of a string. 
  • UPPER(), LOWER(): Converts a string to uppercase or lowercase (great for standardizing data). 
  • TRIM(): Removes whitespace from the beginning and end of a string. 
  • LENGTH(): Returns the length of a string. 
  • REPLACE(): Replaces occurrences of a substring with another. 

Date Functions: 

  • DATEADD() / INTERVAL: Adds a time interval (e.g., 7 days) to a date. (Syntax varies by SQL dialect). 
  • DATEDIFF(): Finds the difference between two dates. 
  • YEAR(), MONTH(), DAY(): Extracts parts of a date. 
  • NOW() / GETDATE(): Returns the current date and time. 
  • DATE_TRUNC(): (Common in PostgreSQL/BigQuery) Truncates a date to a specific part, like the first of the month or week. This is extremely useful for grouping by time periods. 

Example: 

Get the full name of users, the year they placed an order, and their total sales for that year. 

SQL 
SELECT 
  CONCAT(first_name, ' ', last_name) AS full_name, 
  YEAR(order_date) AS order_year, 
  SUM(total_sales) AS yearly_sales 
FROM orders 
GROUP BY full_name, order_year; 
 

Output: 

full_name  order_year  yearly_sales 
Ashish  2024  12000 
Ashish  2025  8500 
Priya Patel  2024  9000 
Priya Patel  2025  11500 
Function  Purpose  Example 
COUNT()  Count rows  COUNT(*) 
AVG()  Average value  AVG(sales_amount) 
CONCAT()  Join text  CONCAT(first_name, last_name) 
DATEADD()  Add interval  DATEADD(DAY, 7, order_date) 
SUBSTRING()  Extract text  SUBSTRING(product_code, 1, 3) 

Also Read: SQL HAVING with Examples 

Conditional Logic and Handling NULLs 

Data is rarely clean. You'll need to handle NULL values and apply conditional logic. This is a common task in SQL for data science. 

CASE Expression: 

The CASE statement is SQL's version of an IF-THEN-ELSE block. It lets you create new columns based on conditional logic. 

Example: 

Categorize sales into 'High', 'Medium', or 'Low' buckets. 

SQL 
SELECT 
  product_name, 
  total_sales, 
  CASE 
    WHEN total_sales > 500 THEN 'High' 
    WHEN total_sales BETWEEN 100 AND 500 THEN 'Medium' 
    ELSE 'Low' 
  END AS sales_category 
FROM sales; 
 

Output: 

product_name  total_sales  sales_category 
Laptop  1200  High 
Mouse  80  Low 
Keyboard  150  Medium 
Monitor  450  Medium 
USB Cable  25  Low 

NULL Handling: 

A NULL value represents missing or unknown data. It is not the same as 0 or an empty string, and it can break calculations (e.g., 10 + NULL is NULL). 

  • COALESCE(column, 0): This is the most important function for handling NULLs. It returns the first non-NULL value in a list. The example COALESCE(sales_amount, 0) replaces any missing sales amounts with 0. 
  • NULLIF(value1, value2): Returns NULL if the two values are equal, otherwise returns value1. This is useful for avoiding divide-by-zero errors (e.g., sales / NULLIF(quantity, 0)). 
Input  Result After COALESCE(Input, 0) 
NULL 
25  25 
100  100 

Also Read: Top 27 SQL Projects in 2025 With Source Code: For All Levels 

Aggregation and Analytical Functions 

This is where SQL for data science really shines, especially in feature engineering. These functions go beyond simple GROUP BY aggregates. 

We already discussed window functions, but let's look at specific analytical functions used with the OVER() clause: 

RANK() vs. DENSE_RANK() vs. ROW_NUMBER(): 

  • ROW_NUMBER(): Assigns a unique, sequential number to each row (1, 2, 3, 4). 
  • RANK(): Assigns a rank with gaps for ties (1, 2, 2, 4). 
  • DENSE_RANK(): Assigns a rank with no gaps for ties (1, 2, 2, 3). 

LAG() and LEAD(): 

  • LAG(salary, 1): Gets the salary from the previous row (defined by ORDER BY). 
  • LEAD(salary, 1): Gets the salary from the next row. 
  • This is perfect for calculating period-over-period change. 

Example: 

Rank employees by salary within their department. 

SQL 
SELECT 
  employee_name, 
  department, 
  salary, 
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank 
FROM employees; 
 

Output: 

employee_name  department  salary  rank 
Rahul  Engineering  120000 
Sumit  Engineering  110000 
Ankur  Sales  100000 
Sandeep  Engineering  95000 
Gaurabh  Sales  95000 

Used For: 

  • Trend tracking. 
  • Ranking data. 
  • Calculating cumulative metrics. 

Also Read: Top 25+ SQL Projects on GitHub You Should Explore in 2025 

User-Defined Functions (UDFs) and Stored Procedures 

When you find yourself writing the same complex logic repeatedly, you can automate it. 

  • User-Defined Function (UDF): A reusable function that takes inputs, performs logic, and returns a value or a table. 
  • Stored Procedure: A pre-compiled set of one or more SQL statements saved in the database. It can take inputs, run complex logic (like loops or transactions), and perform actions (like INSERT or UPDATE) without returning a value. 

Why Use Them? 

  • Automation: Automate repetitive logic. 
  • Consistency: Ensures everyone on the team calculates a metric (like "active user") the same way. 
  • Performance: Keeps data cleaning and transformation logic inside the database, which is often faster than pulling data into Python just to clean it. 

Example (Syntax varies): 

Create a function to calculate a standard 10% bonus. 

SQL 
CREATE FUNCTION calc_bonus(salary FLOAT) 
RETURNS FLOAT 
AS 
BEGIN 
  RETURN salary * 0.1; 
END; 
 

Output: 

Function 'calc_bonus' created successfully. 
 

Now you can just SELECT employee_name, calc_bonus(salary) FROM employees;

Best Practices: SQL in Data Science Projects 

Writing a query that works is easy. Writing a query that is good, meaning it's readable, maintainable, and fast, is a hallmark of a senior data professional. These SQL basics for data science extend beyond just syntax. 

Writing Maintainable and Readable Queries 

Your code will be read by your future self and your teammates. Make it easy for them. 

  • Use Proper Indentation and Formatting: Indent clauses and align functions. 
  • Comment Your Logic: Use -- for single-line comments or /* ... */ for multi-line comments to explain why you are doing something complex. 
  • Use Meaningful Aliases: Use AS to give columns and tables readable names (e.g., FROM customers AS c). 
  • Format Keywords: Use UPPERCASE for SQL keywords (SELECT, FROM, WHERE) and lowercase for table/column names. 
  • Use Common Table Expressions (CTEs): For long, complex queries, use CTEs (WITH...AS) to break the logic into named, sequential steps. This is far more readable than a mess of nested sub-queries. 

Bad Example: 

SQL 
select name,sum(sales) from table group by name; 
 

Good Example (using formatting): 

SQL 
-- Calculate total sales for each person 
SELECT 
  name, 
  SUM(sales) AS total_sales 
FROM 
  table 
GROUP BY 
  name; 
 

Performance Optimization 

In SQL for data science, you'll query tables with billions of rows. A bad query can take hours or cost thousands of dollars in cloud computing fees. 

  • Avoid SELECT *: Only select the columns you actually need. This reduces the amount of data the database has to read and send over the network. 
  • Use Indexes: An index is like a book's index. It helps the database find rows with a specific value in a column much faster. Work with database administrators to ensure indexes exist on columns you frequently filter (e.g., user_id, order_date). 
  • Filter Before Joins: Apply WHERE clauses to your tables before joining them if possible. This reduces the number of rows that need to be compared in the join, which is often the most expensive part of a query. 
  • Limit Rows: When exploring, always use LIMIT 100 or TOP 100 to get a sample of the data. 

Also Read: Top 25 DBMS Projects [With Source Code] for Students in 2025 

Example (Optimized): 

This query filters for a specific date range before grouping, which is much more efficient. 

SQL 
SELECT customer_id, SUM(total_sales) 
FROM orders 
WHERE order_date >= '2025-01-01' -- Filter early 
GROUP BY customer_id 
LIMIT 100; -- Limit results 
 

Output: 

customer_id  SUM(total_sales) 
C1001  8800 
C1002  11500 
C1003  4200 
... (97 more rows)  ... 

Data Modeling and Schema Design 

While data scientists don't always design databases, you must understand how they are designed to query them effectively. 

Model Type  Description  Use Case 
Normalized  Data is split into many separate tables to reduce redundancy.  Transactional systems (e.g., an e-commerce checkout). 
Denormalized  Data is merged into fewer, wider tables with redundancy.  Analytics systems (e.g., a data warehouse). 
Star Schema  The most common analytics design. It has one central Fact table (containing metrics) connected to multiple Dimension tables (containing context).  Data warehouses, Business Intelligence. 

Example of a Star Schema: 

  • Fact Table: sales_facts (sales_id, date_id, product_id, region_id, revenue, quantity) 
  • Dimension Tables: dim_date, dim_product, dim_region 

This design is very fast for analytics because your queries are simple: you join the sales_facts table to the dimension tables you need, then aggregate. This structure is a key component of the SQL for data science ecosystem. 

Aalso Read: 30 Data Science Project Ideas for Beginners in 2025 

Integrating SQL and Other Tools 

The standard SQL for data science workflow involves pulling data from SQL and moving it into a programming language like Python or R for modeling. 

You use a library like sqlalchemy (Python) or RPostgreSQL (R) to create a connection to the database. Then, you execute your SQL query and load the results directly into a data structure like a Pandas DataFrame. 

Python Example: 

Python 
import pandas as pd 
import sqlalchemy 
 
# Create a connection engine to the database 
# (Replace with your database credentials) 
engine = sqlalchemy.create_engine("mysql+pymysql://user:password@localhost/my_database") 
 
# Write your SQL query 
sql_query = """ 
    SELECT * FROM sales_data 
    WHERE region = 'North'; 
""" 
 
# Execute the query and load results into a Pandas DataFrame 
df = pd.read_sql(sql_query, engine) 
 
# Now you can use df for modeling, analysis, or visualization 
print(df.head()) 
 

Output (from the print(df.head()) command): 

  order_id product_name  sales_amount region 
0      1001       Laptop        1200.0  North 
1      1004        Mouse          80.0  North 
2      1005     Keyboard         150.0  North 
3      1008      Monitor         450.0  North 
4      1010    USB Cable          25.0  North 
 

Version Control, Testing, and Collaboration 

Treat your SQL code like any other code. 

  • Version Control: Save your important SQL queries and scripts in Git. This allows you to track changes, collaborate with teammates, and revert to old versions if something breaks. 
  • Testing: Test your queries. This means validating the results. If you calculate total revenue, does it match the dashboard? If you join two tables, do the row counts make sense? Check for duplicates, NULLs, and extreme outliers. 
  • Collaboration: Use a shared query notebook or tool (like DBeaver, Databricks, or Jupyter) to review logic with peers. 

Also Read: How to Use GitHub: A Beginner's Guide to Getting Started and Exploring Its Benefits in 2025 

Security and Governance 

Data scientists often have access to sensitive Personally Identifiable Information (PII). It's your responsibility to handle it securely. 

  • Apply Access Controls: Use GRANT and REVOKE to ensure users only have permission to access the data they need (Principle of Least Privilege). 
  • Use Masking: For sensitive data (like credit cards or social security numbers), don't SELECT the raw column. Use functions to mask it or select a non-sensitive version. 
  • Keep Audit Logs: Be aware that in most companies, every query you run is logged. Always act professionally and ethically with data access. 

Real-World Use Cases of SQL for Data Science 

Here is how these concepts come together to solve real business problems. 

Marketing Analytics – Customer Segmentation 

Goal: Identify high-value customers for a targeted marketing campaign. A common method is RFM (Recency, Frequency, Monetary value). 

Query: 

This query groups customers, finds their average and total orders, and filters for those who are high-spenders. 

SQL 
SELECT 
  customer_id, 
  AVG(order_value) AS avg_spent, 
  COUNT(order_id) AS total_orders 
FROM orders 
GROUP BY customer_id 
HAVING AVG(order_value) > 500; 
 

Output: 

customer_id  avg_spent  total_orders 
C1001  550.75 
C1003  720.50 
C1007  610.00 

You can use these segments for personalized email offers, targeting high-value customers to retain them. This is a classic SQL for data science application. 

Also Read: Top 20+ Data Science Techniques To Learn in 2025 

Financial Analytics – Fraud Detection 

Goal: Detect abnormal transactions that could be fraudulent. 

Query: 

This query uses a sub-query and statistical functions to find any transaction that is more than 3 standard deviations above the overall average. 

SQL 
SELECT user_id, transaction_id, amount 
FROM transactions 
WHERE amount > ( 
  SELECT AVG(amount) + 3 * STDDEV(amount) 
  FROM transactions 
); 
 

Output: 

user_id  transaction_id  amount 
U458  T987234  5000.00 
U123  T987456  7200.00 
U992  T987789  5150.00 

This identifies statistical outliers that can then be flagged for manual review. A more advanced version would calculate the average per user

Also Read: Data Science in Finance: Career, Tools, and Trends 

Operations & Supply Chain – Inventory Trend Analysis 

Goal: Monitor product stock movement to avoid stocking out. 

Query: 

This query uses a window function to calculate the rolling 7-day sum of quantity shipped for each product. 

SQL 
SELECT 
  product_id, 
  order_date, 
  quantity_shipped, 
  SUM(quantity_shipped) OVER ( 
    PARTITION BY product_id 
    ORDER BY order_date 
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW 
  ) AS rolling_7_day_shipped 
FROM inventory; 
 

Output: 

product_id  order_date  quantity_shipped  rolling_7_day_shipped 
P100  2025-10-01  10  10 
P100  2025-10-02  12  22 
P100  2025-10-03  30 
...  ...  ...  ... 
P100  2025-10-07  15  70 
P100  2025-10-08  11  71 

This helps smooth out daily noise and identify the true trend in product demand. 

Machine Learning Pipelines – Feature Engineering 

Goal: Prepare a "feature set" table that can be exported to Python for model training (e.g., to predict customer churn). 

Query: 

SQL can create powerful features directly in-database. This query builds a single row for each customer, summarizing their entire history. 

SQL 
SELECT 
  customer_id, 
  COUNT(order_id) AS feature_order_count, 
  AVG(order_value) AS feature_avg_spent, 
  MAX(order_date) AS feature_last_order 
FROM orders 
GROUP BY customer_id; 
 

Output: 

customer_id  feature_order_count  feature_avg_spent  feature_last_order 
C1001  550.75  2025-10-15 
C1002  200.00  2025-08-01 
C1003  720.50  2025-10-20 

This new table (with one row per customer) is now perfectly formatted to be used as the "X" (features) in a machine learning model. This is a core workflow in SQL for data science

Also Read: Customer Churn Prediction Project: From Data to Decisions 

Domain  Query Goal  Key Functions Used 
Marketing  Customer Segmentation (RFM)  AVG, COUNT, MAX, GROUP BY, HAVING 
Finance  Anomaly/Fraud Detection  AVG, STDDEV, Sub-query 
Operations  Inventory Trend Analysis  SUM(), OVER(), ROWS BETWEEN (Window Function) 
ML  Feature Engineering  COUNT, MAX, AVG, DATEDIFF, GROUP BY 

Conclusion 

SQL for data science gives you complete control over data exploration, preparation, and analysis. You can query large datasets, apply analytical functions, and integrate SQL with Python or R for advanced modeling. When you follow structured query design, use efficient joins, and apply indexing wisely, your data workflows become faster and cleaner. Mastering SQL basics for data science helps you move from raw data to actionable insights with precision and speed. 

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Frequently Asked Questions

1. What is SQL for data science and why is it important?

SQL for data science is the practice of using SQL to extract, manipulate, aggregate, and prepare data from relational databases specifically for data science tasks like analysis, visualization, and machine learning. It's important because most of the world's business data lives in SQL databases, and SQL is the most efficient tool to access it. 

2. How does SQL differ when used for data science?

While a software engineer might use SQL to build an application (using INSERT, UPDATE), a data scientist uses SQL primarily for analysis. This means a heavy focus on SELECT statements, complex JOINs, GROUP BY aggregations, and advanced analytical functions like window functions to understand and shape the data. The SQL for data science approach is investigative. 

3. What are SQL basics for data science a beginner should start with?

The core SQL basics for data science are: 

  • SELECT, FROM, WHERE to retrieve and filter data. 
  • GROUP BY, HAVING, and aggregate functions (COUNT, SUM, AVG) to summarize data. 
  • JOIN (especially INNER and LEFT) to combine data from multiple tables. 
  • ORDER BY and LIMIT to sort and sample results. 

4. Can SQL be used without programming?

Yes. SQL is a declarative query language, not a procedural programming language like Python. You declare what data you want, and the database figures out how to get it. Many data analysts and business intelligence professionals use SQL as their primary tool without ever writing Python or R. 

5. Which SQL functions are most useful for data scientists?

  • Aggregates: COUNT(), SUM(), AVG(), MAX(), MIN()
  • Window Functions: ROW_NUMBER(), RANK(), LAG(), LEAD(), SUM() OVER()
  • Conditional: CASE...WHEN...END
  • Cleaning: COALESCE(), TRIM(), LOWER()
  • Date: DATE_TRUNC(), DATEDIFF(), EXTRACT()

6. How do JOINs help in combining datasets?

Databases are designed to be "normalized," meaning data is split into related tables to avoid redundancy. JOINs are the mechanism to put that data back together. You can combine a users table with an orders table to see who bought what, or an orders table with a products table to analyze sales by category. This is essential for SQL for data science. 

7. What’s the difference between GROUP BY and window functions?

GROUP BY collapses many rows into a single summary row. For example, GROUP BY region would turn 1000 sales rows into just a few rows (one for each region). A window function performs a calculation across many rows but preserves all the original rows, adding the calculation as a new column. 

8. When should you use sub-queries or joins?

Often, they can achieve the same result. A JOIN is generally faster and more readable. Use a JOIN when you need to add columns from another table. Use a sub-query when you need to filter based on a result from another table (e.g., WHERE user_id IN (SELECT ...)). 

9. How to optimize SQL queries for big data?

Follow these steps to optimize SQL queries for big data- 

  • SELECT only the columns you need. 
  • Filter with WHERE as early as possible. 
  • Ensure the columns you filter or join on are indexed
  • Use EXPLAIN PLAN to see how the database is executing your query and find bottlenecks. 
  • Avoid JOINs on large, un-indexed string columns. This is a key performance skill in SQL for data science

10. What schema design supports analytics best?

The Star Schema is the standard for analytics and data warehousing. It features a central Fact table (e.g., sales) containing quantitative metrics, surrounded by Dimension tables (e.g., products, users, dates) that provide descriptive context. 

11. How can SQL connect with Python or R?

You use libraries like sqlalchemy or psycopg2 in Python, or RPostgreSQL in R. You establish a "connection" to the database using your credentials, then pass your SQL query as a string to the library. The library executes the query and returns the results, typically as a Pandas DataFrame (in Python) or a data.frame (in R). 

12. What are the main SQL best practices for data science?

SQL best practices for data science -

  • Write readable, formatted, and commented code. 
  • Use Common Table Expressions (CTEs) for complex logic. 
  • Never use SELECT * in production code. 
  • Test and validate your query results. 
  • Store your queries in Git for version control. 
  • Understand the underlying data model (like the Star Schema). 

13. How to handle NULLs effectively?

Use COALESCE(column_name, default_value) to replace NULLs with a sensible default (like 0 or 'N/A') before performing calculations. Be aware that COUNT(column_name) ignores NULLs, while COUNT(*) does not. 

14. What causes SQL performance issues?

The most common causes are: 

  • Querying tables without proper indexes on the filter (WHERE) or join (ON) columns. 
  • Joining multiple, very large tables. 
  • Using SELECT * on wide tables. 
  • Poorly written sub-queries or correlated sub-queries that run for every single row. 

15. How to apply SQL window functions for time analysis?

Use SUM() OVER (ORDER BY date) for running totals. Use AVG() OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for a 7-day rolling average. Use LAG(sales, 1) OVER (ORDER BY date) to get the previous day's sales to calculate day-over-day growth. This is a very common SQL for data science pattern. 

16. How does SQL support feature engineering?

SQL is excellent for creating features. You can use CASE statements for binning (e.g., 'Low', 'Medium', 'High'), date functions to extract features like 'day_of_week' or 'is_weekend', and GROUP BY to create aggregate features (e.g., avg_order_value_per_user). This is a core part of SQL for data science. 

17. How to secure and govern SQL data access?

Use the GRANT and REVOKE commands (DCL) to give users the minimum permissions they need. For sensitive data, use database views that only show non-sensitive columns, or apply data masking functions. 

18. How do SQL basics for data science help in job interviews?

They are critical. Almost every data science interview includes a technical SQL screen. You will be expected to write JOINs, GROUP BYs, and often a window function on a whiteboard or in a shared editor. They are testing your foundational ability to even get the data. 

19. What are top tools to practice SQL for data science?

  • Online Platforms: LeetCode, HackerRank, and StrataScratch have excellent SQL for data science problems. 
  • Local Setup: Install PostgreSQL and DBeaver on your machine. Load some public datasets (e.g., from Kaggle) and practice. 
  • In-Browser: SQL Fiddle or db-fiddle. 

20. How do you measure query effectiveness?

  • Accuracy: Does it return the correct, validated data? 
  • Performance: How fast does it run? (e.g., 2 seconds vs. 2 hours). Use EXPLAIN ANALYZE to see the query plan and cost. 
  • Readability: Can a teammate understand and modify your query? 

Rohit Sharma

839 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in DS & AI

360° Career Support

Executive PG Program

12 Months