Common Table Expressions (CTEs) in SQL#
In this tutorial, you will learn about Common Table Expressions (CTEs) in SQL and how they can simplify your code. CTEs are temporary result sets that you can reference within other SELECT
, INSERT
, UPDATE
, or DELETE
statements. They are a powerful tool for constructing complex queries in a readable and user-friendly way. By the end of this tutorial, you will understand how to create and use CTEs in your own SQL queries.
Set up and data access#
Important
Note: The –save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below.
This code installs JupySQL, and DuckDB in your environment. We will be using these moving forward.
%pip install jupysql jupysql-plugin --quiet
We continue to work with the Bank and Marketing data set.
Important
Source: UCI Machine Learning Repository
URL: https://archive-beta.ics.uci.edu/dataset/222/bank+marketing
Data Citation
Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.
We can use the following function to extract the downloaded data from the UCI repository.
import sys
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()
Initialize a DuckDB Instance
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Load the data
%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')
Count |
---|
We confirm the table was loaded
%sqlcmd tables
Name |
---|
bank |
We can use JupySQL’s Table Explorer to take a quick look at the table.
%sqlcmd explore --table bank
Simple CTE#
Let’s create a CTE that finds the average balance for each job type.
%%sql
WITH Job_Avg_Balance AS (
SELECT job, AVG(balance) AS AverageBalance
FROM bank
GROUP BY job
)
SELECT *
FROM Job_Avg_Balance;
job | AverageBalance |
---|---|
unemployed | 1089.421875 |
services | 1103.9568345323742 |
management | 1766.9287925696594 |
blue-collar | 1085.161733615222 |
self-employed | 1392.4098360655737 |
technician | 1330.99609375 |
entrepreneur | 1645.125 |
admin. | 1226.73640167364 |
student | 1543.8214285714287 |
housemaid | 2083.8035714285716 |
In this example, the WITH
keyword indicates the start of the CTE, which we call Job_Avg_Balance
.
Within the Job_Avg_Balance
CTE, we form a query that returns the average balance for each job by computing the average of balance
and grouping by job
.
After closing the WITH
statement, we then select all columns from the Job_Avg_Balance
CTE. This returns two columns: job
and AverageBalance
. The final SELECT
statement then retrieves the data from the CTE.
With the JupySQL
magics %sql, %%sql
and the --save
option, you can furthermore save your CTE for later use on a different code cell:
%%sql --save avg_balance_by_job
WITH Job_Avg_Balance AS (
SELECT job, AVG(balance) AS AverageBalance
FROM bank
GROUP BY job
)
SELECT *
FROM Job_Avg_Balance;
job | AverageBalance |
---|---|
unemployed | 1089.421875 |
services | 1103.9568345323742 |
management | 1766.9287925696594 |
blue-collar | 1085.161733615222 |
self-employed | 1392.4098360655737 |
technician | 1330.99609375 |
entrepreneur | 1645.125 |
admin. | 1226.73640167364 |
student | 1543.8214285714287 |
housemaid | 2083.8035714285716 |
%%sql
SELECT * FROM avg_balance_by_job
job | AverageBalance |
---|---|
unemployed | 1089.421875 |
services | 1103.9568345323742 |
management | 1766.9287925696594 |
blue-collar | 1085.161733615222 |
self-employed | 1392.4098360655737 |
technician | 1330.99609375 |
entrepreneur | 1645.125 |
admin. | 1226.73640167364 |
student | 1543.8214285714287 |
housemaid | 2083.8035714285716 |
Multiple CTEs#
You can use multiple CTEs in a single query. Let’s find the average balance per job type and average campaign per job type.
%%sql
WITH Job_Avg_Balance AS (
SELECT job, AVG(balance) AS AverageBalance
FROM bank
GROUP BY job
),
Job_Avg_Campaign AS (
SELECT job, AVG(campaign) AS AverageCampaign
FROM bank
GROUP BY job
)
SELECT *
FROM Job_Avg_Balance, Job_Avg_Campaign
WHERE Job_Avg_Balance.job = Job_Avg_Campaign.job;
job | AverageBalance | job_1 | AverageCampaign |
---|---|---|---|
unemployed | 1089.421875 | unemployed | 2.6796875 |
services | 1103.9568345323742 | services | 2.8225419664268587 |
management | 1766.9287925696594 | management | 2.9731682146542826 |
blue-collar | 1085.161733615222 | blue-collar | 2.846723044397463 |
self-employed | 1392.4098360655737 | self-employed | 3.278688524590164 |
technician | 1330.99609375 | technician | 2.7317708333333335 |
entrepreneur | 1645.125 | entrepreneur | 2.5892857142857144 |
admin. | 1226.73640167364 | admin. | 2.6317991631799162 |
student | 1543.8214285714287 | student | 2.392857142857143 |
housemaid | 2083.8035714285716 | housemaid | 2.5 |
In this example, the first CTE is the same as in the previous example. The second CTE Job_Avg_Campaign
returns two columns: job_1
and AverageCampaign
. The final SELECT statement retrieves data from both CTEs.
You will notice job
and job_1
in the final result. This is because we are doing a Cartesian product (cross join) between two CTEs and both have a column named job
. We can avoid this by explicitly specifying the columns you want to select in our final SELECT
statement instead of using SELECT *
.
%%sql
WITH Job_Avg_Balance AS (
SELECT job, AVG(balance) AS AverageBalance
FROM bank
GROUP BY job
),
Job_Avg_Campaign AS (
SELECT job, AVG(campaign) AS AverageCampaign
FROM bank
GROUP BY job
)
SELECT Job_Avg_Balance.job, AverageBalance, AverageCampaign
FROM Job_Avg_Balance
JOIN Job_Avg_Campaign
ON Job_Avg_Balance.job = Job_Avg_Campaign.job;
job | AverageBalance | AverageCampaign |
---|---|---|
unemployed | 1089.421875 | 2.6796875 |
services | 1103.9568345323742 | 2.8225419664268587 |
management | 1766.9287925696594 | 2.9731682146542826 |
blue-collar | 1085.161733615222 | 2.846723044397463 |
self-employed | 1392.4098360655737 | 3.278688524590164 |
technician | 1330.99609375 | 2.7317708333333335 |
entrepreneur | 1645.125 | 2.5892857142857144 |
admin. | 1226.73640167364 | 2.6317991631799162 |
student | 1543.8214285714287 | 2.392857142857143 |
housemaid | 2083.8035714285716 | 2.5 |
Recursive CTEs#
A recursive CTE is one in which an initial CTE is repeatedly executed to return subsets of data until the complete result set is obtained.
The given dataset doesn’t lend itself to a recursive CTE, as these are generally used for hierarchical or recursive data problems, which the bank dataset does not present.
Here is an example of a recursive CTE
%%sql
WITH RECURSIVE numbers AS (
SELECT 1 AS value
UNION ALL
SELECT value + 1 FROM numbers WHERE value < 10
)
SELECT * FROM numbers;
value |
---|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
This will output a list of numbers from 1 to 10. The CTE works as follows:
The
WITH RECURSIVE
clause marks the start of the recursive CTE.The
SELECT 1 AS value
is the “anchor member” of the CTE and provides the base result set for the recursion to start.The
UNION ALL
clause is used to combine the results of the anchor member with the results of the “recursive member”, which isSELECT value + 1 FROM numbers WHERE value < 10
.The recursion continues until
value < 10
returns false, at which point the CTE stops executing.
Using CTEs to modify information in a table#
Important
Without a unique identifier for each row in your table, performing UPDATE or DELETE operations using a CTE would be risky because they could affect more rows than you intend.
It’s generally not recommended to use UPDATE or DELETE without a unique identifier or precise condition to pinpoint exactly which rows you want to affect. We’re going to load data from the Joining Data in SQL tutorial as this data has tables with unique ID’s.
_ = banking.MarketData(
"https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip", # noqa E501
"expanded_data",
)
_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'
Let’s work on a separate DuckDB instance
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
Count |
---|
Let’s take a look at the entries
%sqlcmd explore --table s1.district
Let’s say district_id
is a unique identifier for each district.
CTE with UPDATE#
When working with UPDATE
or DELETE
operations, it is highly recommended to check your data before and after the operations. This helps to verify that the operations were successful and only affected the intended data.
Suppose we want to update the average_salary
for districts that had more than 5000 committed crimes in 1996 to be increased by 10%.
Check data before the operation:
%%sql
SELECT district_id, average_salary
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000;
district_id | average_salary |
---|---|
1 | 12541 |
4 | 9753 |
14 | 10045 |
24 | 8991 |
26 | 10787 |
34 | 9675 |
36 | 9198 |
40 | 9317 |
41 | 9832 |
47 | 9538 |
This is returning all entries that satisfy number of crimes committed in ‘96 exceeds 5000.
Perform the update:
%%sql
WITH High_Crime_Districts AS (
SELECT district_id
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000
)
UPDATE s1.district
SET average_salary = average_salary * 1.10
WHERE district_id IN (SELECT district_id FROM High_Crime_Districts);
Count |
---|
15 |
In this example, we are writing a CTE that selects the district_id
entries that satisfy the condition no_of_committed_crimes_96 > 5000
. We can all this CTE High_Crime_Districts
.
We then use High_Crime_Districts
to change entries in average salary by increasing it by 10% (average_salary = average_salary * 1.10
), and ensure this is done in only those entries in the district_id
found in the CTE High_Crime_Districts
.
Check data after the operation:
%%sql
SELECT district_id, average_salary
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000;
district_id | average_salary |
---|---|
1 | 13795 |
4 | 10728 |
14 | 11050 |
24 | 9890 |
26 | 11866 |
34 | 10643 |
36 | 10118 |
40 | 10249 |
41 | 10815 |
47 | 10492 |
CTE with DELETE#
Suppose we want to delete records for all districts with unemployment rate in ‘96 greater than 4.
The query below selects only those entries for which unemployment_rate_96
is at least 4.
Check data before the operation:
%%sql
SELECT *
FROM s1.district
WHERE unemployment_rate_96 > 4;
district_id | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | Kladno | central Bohemia | 149893 | 63 | 29 | 6 | 2 | 6 | 67.4 | 10728 | 4.64 | 5.05 | 109 | 5244 | 5892 |
5 | Kolin | central Bohemia | 95616 | 65 | 30 | 4 | 1 | 6 | 51.4 | 9307 | 3.85 | 4.43 | 118 | 2616 | 3040 |
6 | Kutna Hora | central Bohemia | 77963 | 60 | 23 | 4 | 2 | 4 | 51.5 | 8546 | 2.95 | 4.02 | 126 | 2640 | 3120 |
12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 |
31 | Tachov | west Bohemia | 51313 | 34 | 12 | 3 | 1 | 4 | 59.2 | 8930 | 3.52 | 4.2 | 108 | 1328 | 1452 |
33 | Decin | north Bohemia | 133777 | 24 | 17 | 7 | 3 | 11 | 84.7 | 8705 | 5.75 | 7.61 | 116 | 4650 | 4859 |
34 | Chomutov | north Bohemia | 125236 | 28 | 11 | 1 | 4 | 5 | 87.7 | 10643 | 6.43 | 7.68 | 100 | 5323 | 5190 |
36 | Liberec | north Bohemia | 159617 | 29 | 19 | 8 | 1 | 9 | 85.2 | 10118 | 3.33 | 4.28 | 131 | 5796 | 6132 |
37 | Litomerice | north Bohemia | 114006 | 71 | 26 | 6 | 2 | 9 | 62.3 | 9065 | 4.46 | 5.39 | 123 | 4147 | 4166 |
38 | Louny | north Bohemia | 85852 | 41 | 23 | 4 | 2 | 4 | 59.8 | 8965 | 7.08 | 8.23 | 104 | 2653 | 2822 |
Perform the update.
%%sql
WITH High_Unemployment_Districts AS (
SELECT district_id
FROM s1.district
WHERE unemployment_rate_96 > 4
)
DELETE FROM s1.district
WHERE district_id IN (SELECT district_id FROM High_Unemployment_Districts);
Count |
---|
31 |
We create a CTE called High_Unemployment_Districts
that selects only those district_id
whose unemployment rate in ‘96 exceeds 4.
We then use the DELETE
operation to remove all entries from the s1.district
for only those district_id
in the CTE High_Unemployment_Districts
.
Check data after the operation:
%%sql
SELECT *
FROM s1.district
WHERE unemployment_rate_96 > 4;
district_id | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|
In both these examples, the operation will only affect the rows that match the conditions specified in the CTEs. The UPDATE
operation will increase the average_salary
of high crime districts by 10%, and the DELETE
operation will remove all districts with high unemployment rate.
Exercise 1 (Easy)#
What is a Common Table Expression (CTE), and what is its primary use in SQL queries?
Answers
A Common Table Expression (CTE) is a temporary result set that you can reference within another SELECT, INSERT, UPDATE, or DELETE statement. The main use of CTEs is to simplify complex SQL queries, particularly those involving multiple levels of subqueries. They make your SQL code more readable and maintainable.
Exercise 2 (Medium)#
Write a SQL query using a CTE that returns the total number of inhabitants for each region in the provided dataset. Save the CTE into a variable called region_inhabitants
Answers
%%sql --save region_inhabitants
WITH Region_Inhabitants AS (
SELECT region, SUM(no_of_inhabitants) AS TotalInhabitants
FROM s1.district
GROUP BY region
)
SELECT *
FROM Region_Inhabitants;
region | TotalInhabitants |
---|---|
Prague | 1204953 |
central Bohemia | 673892 |
south Bohemia | 700595 |
west Bohemia | 807993 |
north Bohemia | 193826 |
east Bohemia | 1132172 |
south Moravia | 1213345 |
north Moravia | 182027 |
This CTE, Region_Inhabitants
, groups the dataset by region and calculates the total number of inhabitants for each region using the SUM()
function. The final SELECT statement retrieves all records from the CTE.
Exercise 3 (hard)#
Suppose we want to increase the average_salary by 10% for districts that had more than 5000 committed crimes in ‘96, and then delete districts with an unemployment rate in ‘96 less than 4. Write a SQL query using CTEs to accomplish this, and explain the importance of checking the data before and after these operations.
Answers
First, let’s check the data before the operation:
%%sql
SELECT *
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000 OR unemployment_rate_96 < 4;
district_id | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 13795 | 0.29 | 0.43 | 167 | 85677 | 99107 |
2 | Benesov | central Bohemia | 88884 | 80 | 26 | 6 | 2 | 5 | 46.7 | 8507 | 1.67 | 1.85 | 132 | 2159 | 2674 |
3 | Beroun | central Bohemia | 75232 | 55 | 26 | 4 | 1 | 5 | 41.7 | 8980 | 1.95 | 2.21 | 111 | 2824 | 2813 |
7 | Melnik | central Bohemia | 94725 | 38 | 28 | 1 | 3 | 6 | 63.4 | 9920 | 2.26 | 2.87 | 130 | 4289 | 4846 |
8 | Mlada Boleslav | central Bohemia | 112065 | 95 | 19 | 7 | 1 | 8 | 69.4 | 11277 | 1.25 | 1.44 | 127 | 5179 | 4987 |
9 | Nymburk | central Bohemia | 81344 | 61 | 23 | 4 | 2 | 6 | 55.3 | 8899 | 3.39 | 3.97 | 149 | 2987 | 2487 |
10 | Praha - vychod | central Bohemia | 92084 | 55 | 29 | 4 | 3 | 5 | 46.7 | 10124 | 0.56 | 0.54 | 141 | 3810 | 4316 |
11 | Praha - zapad | central Bohemia | 75637 | 35 | 36 | 9 | 0 | 7 | 36.5 | 9622 | 0.45 | 0.59 | 154 | 3475 | 3529 |
13 | Rakovnik | central Bohemia | 53921 | 61 | 22 | 1 | 1 | 2 | 41.3 | 8598 | 2.77 | 3.26 | 123 | 1597 | 1875 |
14 | Ceske Budejovice | south Bohemia | 177686 | 69 | 27 | 10 | 1 | 9 | 74.8 | 11050 | 1.42 | 1.71 | 135 | 6604 | 6295 |
Next, we perform the update and delete operations:
%%sql
WITH High_Crime_Districts AS (
SELECT district_id
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000
)
UPDATE s1.district
SET average_salary = average_salary * 1.10
WHERE district_id IN (SELECT district_id FROM High_Crime_Districts);
WITH High_Unemployment_Districts AS (
SELECT district_id
FROM s1.district
WHERE unemployment_rate_96 < 4
)
DELETE FROM s1.district
WHERE district_id IN (SELECT district_id FROM High_Unemployment_Districts);
Count |
---|
46 |
Finally, let’s check the data after the operation:
%%sql
SELECT *
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000 OR unemployment_rate_96 < 4;
district_id | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|
The importance of checking the data before and after the operations is to verify that the operations were successful and only affected the intended data. It’s generally not recommended to use UPDATE or DELETE without a unique identifier or precise condition to pinpoint exactly which rows you want to affect. Checking the data before and after helps to prevent or identify potential mistakes or unexpected results in the data modification process.
Delete tables
%%sql
DROP TABLE IF EXISTS s1.district;
DROP TABLE IF EXISTS bank;
DROP SCHEMA s1;
Success |
---|
Summary#
In this tutorial we learned:
Examples of simple and multiple CTEs, as well as how to join multiple CTEs.
An introduction to Recursive CTEs with an example of generating a sequence of numbers.
A demonstration of how to use CTEs in conjunction with UPDATE and DELETE commands, as well as the importance of checking data before and after these operations.
A caution against performing
UPDATE
orDELETE
operations without a unique identifier or precise condition.
In the next chapter, we will show how you can visualize the results of your SQL queries.