Common Table Expressions (CTEs) in SQL#

In this tutorial, you will learn about Common Table Expressions (CTEs) in SQL and how they can simplify your code. CTEs are temporary result sets that you can reference within other SELECT, INSERT, UPDATE, or DELETE statements. They are a powerful tool for constructing complex queries in a readable and user-friendly way. By the end of this tutorial, you will understand how to create and use CTEs in your own SQL queries.

Set up and data access#

Important

Note: The –save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below.

This code installs JupySQL, and DuckDB in your environment. We will be using these moving forward.

%pip install jupysql jupysql-plugin --quiet

We continue to work with the Bank and Marketing data set.

Important

Source: UCI Machine Learning Repository

URL: https://archive-beta.ics.uci.edu/dataset/222/bank+marketing

Data Citation

Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

We can use the following function to extract the downloaded data from the UCI repository.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()

Initialize a DuckDB Instance

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True

Load the data

%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')
Count

We confirm the table was loaded

%sqlcmd tables
Name
bank

We can use JupySQL’s Table Explorer to take a quick look at the table.

%sqlcmd explore --table bank

Simple CTE#

Let’s create a CTE that finds the average balance for each job type.

%%sql
WITH Job_Avg_Balance AS (
    SELECT job, AVG(balance) AS AverageBalance
    FROM bank
    GROUP BY job
)
SELECT * 
FROM Job_Avg_Balance;
job AverageBalance
unemployed 1089.421875
services 1103.9568345323742
management 1766.9287925696594
blue-collar 1085.161733615222
self-employed 1392.4098360655737
technician 1330.99609375
entrepreneur 1645.125
admin. 1226.73640167364
student 1543.8214285714287
housemaid 2083.8035714285716
Truncated to displaylimit of 10.

In this example, the WITH keyword indicates the start of the CTE, which we call Job_Avg_Balance.

Within the Job_Avg_Balance CTE, we form a query that returns the average balance for each job by computing the average of balance and grouping by job.

After closing the WITH statement, we then select all columns from the Job_Avg_Balance CTE. This returns two columns: job and AverageBalance. The final SELECT statement then retrieves the data from the CTE.

With the JupySQL magics %sql, %%sql and the --save option, you can furthermore save your CTE for later use on a different code cell:

%%sql --save avg_balance_by_job
WITH Job_Avg_Balance AS (
    SELECT job, AVG(balance) AS AverageBalance
    FROM bank
    GROUP BY job
)
SELECT * 
FROM Job_Avg_Balance;
job AverageBalance
unemployed 1089.421875
services 1103.9568345323742
management 1766.9287925696594
blue-collar 1085.161733615222
self-employed 1392.4098360655737
technician 1330.99609375
entrepreneur 1645.125
admin. 1226.73640167364
student 1543.8214285714287
housemaid 2083.8035714285716
Truncated to displaylimit of 10.
%%sql
SELECT * FROM avg_balance_by_job
Generating CTE with stored snippets: 'avg_balance_by_job'
job AverageBalance
unemployed 1089.421875
services 1103.9568345323742
management 1766.9287925696594
blue-collar 1085.161733615222
self-employed 1392.4098360655737
technician 1330.99609375
entrepreneur 1645.125
admin. 1226.73640167364
student 1543.8214285714287
housemaid 2083.8035714285716
Truncated to displaylimit of 10.

Multiple CTEs#

You can use multiple CTEs in a single query. Let’s find the average balance per job type and average campaign per job type.

%%sql
WITH Job_Avg_Balance AS (
    SELECT job, AVG(balance) AS AverageBalance
    FROM bank
    GROUP BY job
),
Job_Avg_Campaign AS (
    SELECT job, AVG(campaign) AS AverageCampaign
    FROM bank
    GROUP BY job
)
SELECT * 
FROM Job_Avg_Balance, Job_Avg_Campaign 
WHERE Job_Avg_Balance.job = Job_Avg_Campaign.job;
job AverageBalance job_1 AverageCampaign
unemployed 1089.421875 unemployed 2.6796875
services 1103.9568345323742 services 2.8225419664268587
management 1766.9287925696594 management 2.9731682146542826
blue-collar 1085.161733615222 blue-collar 2.846723044397463
self-employed 1392.4098360655737 self-employed 3.278688524590164
technician 1330.99609375 technician 2.7317708333333335
entrepreneur 1645.125 entrepreneur 2.5892857142857144
admin. 1226.73640167364 admin. 2.6317991631799162
student 1543.8214285714287 student 2.392857142857143
housemaid 2083.8035714285716 housemaid 2.5
Truncated to displaylimit of 10.

In this example, the first CTE is the same as in the previous example. The second CTE Job_Avg_Campaign returns two columns: job_1 and AverageCampaign. The final SELECT statement retrieves data from both CTEs.

You will notice job and job_1 in the final result. This is because we are doing a Cartesian product (cross join) between two CTEs and both have a column named job. We can avoid this by explicitly specifying the columns you want to select in our final SELECT statement instead of using SELECT *.

%%sql
WITH Job_Avg_Balance AS (
    SELECT job, AVG(balance) AS AverageBalance
    FROM bank
    GROUP BY job
),
Job_Avg_Campaign AS (
    SELECT job, AVG(campaign) AS AverageCampaign
    FROM bank
    GROUP BY job
)
SELECT Job_Avg_Balance.job, AverageBalance, AverageCampaign
FROM Job_Avg_Balance
JOIN Job_Avg_Campaign 
ON Job_Avg_Balance.job = Job_Avg_Campaign.job;
job AverageBalance AverageCampaign
unemployed 1089.421875 2.6796875
services 1103.9568345323742 2.8225419664268587
management 1766.9287925696594 2.9731682146542826
blue-collar 1085.161733615222 2.846723044397463
self-employed 1392.4098360655737 3.278688524590164
technician 1330.99609375 2.7317708333333335
entrepreneur 1645.125 2.5892857142857144
admin. 1226.73640167364 2.6317991631799162
student 1543.8214285714287 2.392857142857143
housemaid 2083.8035714285716 2.5
Truncated to displaylimit of 10.

Recursive CTEs#

A recursive CTE is one in which an initial CTE is repeatedly executed to return subsets of data until the complete result set is obtained.

The given dataset doesn’t lend itself to a recursive CTE, as these are generally used for hierarchical or recursive data problems, which the bank dataset does not present.

Here is an example of a recursive CTE

%%sql
WITH RECURSIVE numbers AS (
    SELECT 1 AS value
    UNION ALL
    SELECT value + 1 FROM numbers WHERE value < 10
)
SELECT * FROM numbers;
value
1
2
3
4
5
6
7
8
9
10
Truncated to displaylimit of 10.

This will output a list of numbers from 1 to 10. The CTE works as follows:

  • The WITH RECURSIVE clause marks the start of the recursive CTE.

  • The SELECT 1 AS value is the “anchor member” of the CTE and provides the base result set for the recursion to start.

  • The UNION ALL clause is used to combine the results of the anchor member with the results of the “recursive member”, which is SELECT value + 1 FROM numbers WHERE value < 10.

  • The recursion continues until value < 10 returns false, at which point the CTE stops executing.

Using CTEs to modify information in a table#

Important

Without a unique identifier for each row in your table, performing UPDATE or DELETE operations using a CTE would be risky because they could affect more rows than you intend.

It’s generally not recommended to use UPDATE or DELETE without a unique identifier or precise condition to pinpoint exactly which rows you want to affect. We’re going to load data from the Joining Data in SQL tutorial as this data has tables with unique ID’s.

_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'

Let’s work on a separate DuckDB instance

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True
Connecting and switching to connection duckdb:///bank_data.duck.db
%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
Count

Let’s take a look at the entries

%sqlcmd explore --table s1.district

Let’s say district_id is a unique identifier for each district.

CTE with UPDATE#

When working with UPDATE or DELETE operations, it is highly recommended to check your data before and after the operations. This helps to verify that the operations were successful and only affected the intended data.

Suppose we want to update the average_salary for districts that had more than 5000 committed crimes in 1996 to be increased by 10%.

Check data before the operation:

%%sql
SELECT district_id, average_salary
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000;
district_id average_salary
1 12541
4 9753
14 10045
24 8991
26 10787
34 9675
36 9198
40 9317
41 9832
47 9538
Truncated to displaylimit of 10.

This is returning all entries that satisfy number of crimes committed in ‘96 exceeds 5000.

Perform the update:

%%sql
WITH High_Crime_Districts AS (
    SELECT district_id
    FROM s1.district
    WHERE no_of_committed_crimes_96 > 5000
)
UPDATE s1.district
SET average_salary = average_salary * 1.10
WHERE district_id IN (SELECT district_id FROM High_Crime_Districts);
Count
15

In this example, we are writing a CTE that selects the district_id entries that satisfy the condition no_of_committed_crimes_96 > 5000. We can all this CTE High_Crime_Districts.

We then use High_Crime_Districts to change entries in average salary by increasing it by 10% (average_salary = average_salary * 1.10), and ensure this is done in only those entries in the district_id found in the CTE High_Crime_Districts.

Check data after the operation:

%%sql
SELECT district_id, average_salary
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000;
district_id average_salary
1 13795
4 10728
14 11050
24 9890
26 11866
34 10643
36 10118
40 10249
41 10815
47 10492
Truncated to displaylimit of 10.

CTE with DELETE#

Suppose we want to delete records for all districts with unemployment rate in ‘96 greater than 4.

The query below selects only those entries for which unemployment_rate_96 is at least 4.

Check data before the operation:

%%sql
SELECT * 
FROM s1.district
WHERE unemployment_rate_96 > 4;
district_id district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
4 Kladno central Bohemia 149893 63 29 6 2 6 67.4 10728 4.64 5.05 109 5244 5892
5 Kolin central Bohemia 95616 65 30 4 1 6 51.4 9307 3.85 4.43 118 2616 3040
6 Kutna Hora central Bohemia 77963 60 23 4 2 4 51.5 8546 2.95 4.02 126 2640 3120
12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868
31 Tachov west Bohemia 51313 34 12 3 1 4 59.2 8930 3.52 4.2 108 1328 1452
33 Decin north Bohemia 133777 24 17 7 3 11 84.7 8705 5.75 7.61 116 4650 4859
34 Chomutov north Bohemia 125236 28 11 1 4 5 87.7 10643 6.43 7.68 100 5323 5190
36 Liberec north Bohemia 159617 29 19 8 1 9 85.2 10118 3.33 4.28 131 5796 6132
37 Litomerice north Bohemia 114006 71 26 6 2 9 62.3 9065 4.46 5.39 123 4147 4166
38 Louny north Bohemia 85852 41 23 4 2 4 59.8 8965 7.08 8.23 104 2653 2822
Truncated to displaylimit of 10.

Perform the update.

%%sql
WITH High_Unemployment_Districts AS (
    SELECT district_id
    FROM s1.district
    WHERE unemployment_rate_96 > 4
)
DELETE FROM s1.district
WHERE district_id IN (SELECT district_id FROM High_Unemployment_Districts);
Count
31

We create a CTE called High_Unemployment_Districts that selects only those district_id whose unemployment rate in ‘96 exceeds 4.

We then use the DELETE operation to remove all entries from the s1.district for only those district_id in the CTE High_Unemployment_Districts.

Check data after the operation:

%%sql
SELECT * 
FROM s1.district
WHERE unemployment_rate_96 > 4;
district_id district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96

In both these examples, the operation will only affect the rows that match the conditions specified in the CTEs. The UPDATE operation will increase the average_salary of high crime districts by 10%, and the DELETE operation will remove all districts with high unemployment rate.

Exercise 1 (Easy)#

What is a Common Table Expression (CTE), and what is its primary use in SQL queries?

Answers

A Common Table Expression (CTE) is a temporary result set that you can reference within another SELECT, INSERT, UPDATE, or DELETE statement. The main use of CTEs is to simplify complex SQL queries, particularly those involving multiple levels of subqueries. They make your SQL code more readable and maintainable.

Exercise 2 (Medium)#

Write a SQL query using a CTE that returns the total number of inhabitants for each region in the provided dataset. Save the CTE into a variable called region_inhabitants

Answers
%%sql --save region_inhabitants
WITH Region_Inhabitants AS (
    SELECT region, SUM(no_of_inhabitants) AS TotalInhabitants
    FROM s1.district
    GROUP BY region
)
SELECT * 
FROM Region_Inhabitants;
region TotalInhabitants
Prague 1204953
central Bohemia 673892
south Bohemia 700595
west Bohemia 807993
north Bohemia 193826
east Bohemia 1132172
south Moravia 1213345
north Moravia 182027

This CTE, Region_Inhabitants, groups the dataset by region and calculates the total number of inhabitants for each region using the SUM() function. The final SELECT statement retrieves all records from the CTE.

Exercise 3 (hard)#

Suppose we want to increase the average_salary by 10% for districts that had more than 5000 committed crimes in ‘96, and then delete districts with an unemployment rate in ‘96 less than 4. Write a SQL query using CTEs to accomplish this, and explain the importance of checking the data before and after these operations.

Answers

First, let’s check the data before the operation:

%%sql 
SELECT * 
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000 OR unemployment_rate_96 < 4;
district_id district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 13795 0.29 0.43 167 85677 99107
2 Benesov central Bohemia 88884 80 26 6 2 5 46.7 8507 1.67 1.85 132 2159 2674
3 Beroun central Bohemia 75232 55 26 4 1 5 41.7 8980 1.95 2.21 111 2824 2813
7 Melnik central Bohemia 94725 38 28 1 3 6 63.4 9920 2.26 2.87 130 4289 4846
8 Mlada Boleslav central Bohemia 112065 95 19 7 1 8 69.4 11277 1.25 1.44 127 5179 4987
9 Nymburk central Bohemia 81344 61 23 4 2 6 55.3 8899 3.39 3.97 149 2987 2487
10 Praha - vychod central Bohemia 92084 55 29 4 3 5 46.7 10124 0.56 0.54 141 3810 4316
11 Praha - zapad central Bohemia 75637 35 36 9 0 7 36.5 9622 0.45 0.59 154 3475 3529
13 Rakovnik central Bohemia 53921 61 22 1 1 2 41.3 8598 2.77 3.26 123 1597 1875
14 Ceske Budejovice south Bohemia 177686 69 27 10 1 9 74.8 11050 1.42 1.71 135 6604 6295
Truncated to displaylimit of 10.

Next, we perform the update and delete operations:

%%sql
WITH High_Crime_Districts AS (
    SELECT district_id
    FROM s1.district
    WHERE no_of_committed_crimes_96 > 5000
)
UPDATE s1.district
SET average_salary = average_salary * 1.10
WHERE district_id IN (SELECT district_id FROM High_Crime_Districts);

WITH High_Unemployment_Districts AS (
    SELECT district_id
    FROM s1.district
    WHERE unemployment_rate_96 < 4
)
DELETE FROM s1.district
WHERE district_id IN (SELECT district_id FROM High_Unemployment_Districts);
Count
46

Finally, let’s check the data after the operation:

%%sql
SELECT * 
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000 OR unemployment_rate_96 < 4;
district_id district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96

The importance of checking the data before and after the operations is to verify that the operations were successful and only affected the intended data. It’s generally not recommended to use UPDATE or DELETE without a unique identifier or precise condition to pinpoint exactly which rows you want to affect. Checking the data before and after helps to prevent or identify potential mistakes or unexpected results in the data modification process.

Delete tables

%%sql
DROP TABLE IF EXISTS s1.district;
DROP TABLE IF EXISTS bank;
DROP SCHEMA s1;
Success

Summary#

In this tutorial we learned:

  1. Examples of simple and multiple CTEs, as well as how to join multiple CTEs.

  2. An introduction to Recursive CTEs with an example of generating a sequence of numbers.

  3. A demonstration of how to use CTEs in conjunction with UPDATE and DELETE commands, as well as the importance of checking data before and after these operations.

  4. A caution against performing UPDATE or DELETE operations without a unique identifier or precise condition.

In the next chapter, we will show how you can visualize the results of your SQL queries.