Common Table Expressions (CTEs) in SQL

Common Table Expressions (CTEs) in SQL#

In this tutorial, you will learn about Common Table Expressions (CTEs) in SQL and how they can simplify your code. CTEs are temporary result sets that you can reference within other SELECT, INSERT, UPDATE, or DELETE statements. They are a powerful tool for constructing complex queries in a readable and user-friendly way. By the end of this tutorial, you will understand how to create and use CTEs in your own SQL queries.

Set up and data access#

Important

Note: The –save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below.

This code installs JupySQL, and DuckDB in your environment. We will be using these moving forward.

%pip install jupysql jupysql-plugin --quiet

We continue to work with the Bank and Marketing data set.

Important

Source: UCI Machine Learning Repository

URL: https://archive-beta.ics.uci.edu/dataset/222/bank+marketing

Data Citation

Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

We can use the following function to extract the downloaded data from the UCI repository.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()

Initialize a DuckDB Instance

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db

Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'

Settings changed:

Config	value
displaycon	False
feedback	True
autopandas	False
named_parameters	True

Load the data

%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')

Count

We confirm the table was loaded

%sqlcmd tables

Name
bank

We can use JupySQL’s Table Explorer to take a quick look at the table.

%sqlcmd explore --table bank

Simple CTE#

Let’s create a CTE that finds the average balance for each job type.

%%sql
WITH Job_Avg_Balance AS (
    SELECT job, AVG(balance) AS AverageBalance
    FROM bank
    GROUP BY job
)
SELECT * 
FROM Job_Avg_Balance;

job	AverageBalance
unemployed	1089.421875
services	1103.9568345323742
management	1766.9287925696594
blue-collar	1085.161733615222
self-employed	1392.4098360655737
technician	1330.99609375
entrepreneur	1645.125
admin.	1226.73640167364
student	1543.8214285714287
housemaid	2083.8035714285716

Truncated to displaylimit of 10.

In this example, the WITH keyword indicates the start of the CTE, which we call Job_Avg_Balance.

Within the Job_Avg_Balance CTE, we form a query that returns the average balance for each job by computing the average of balance and grouping by job.

After closing the WITH statement, we then select all columns from the Job_Avg_Balance CTE. This returns two columns: job and AverageBalance. The final SELECT statement then retrieves the data from the CTE.

With the JupySQL magics %sql, %%sql and the --save option, you can furthermore save your CTE for later use on a different code cell:

%%sql --save avg_balance_by_job
WITH Job_Avg_Balance AS (
    SELECT job, AVG(balance) AS AverageBalance
    FROM bank
    GROUP BY job
)
SELECT * 
FROM Job_Avg_Balance;

job	AverageBalance
unemployed	1089.421875
services	1103.9568345323742
management	1766.9287925696594
blue-collar	1085.161733615222
self-employed	1392.4098360655737
technician	1330.99609375
entrepreneur	1645.125
admin.	1226.73640167364
student	1543.8214285714287
housemaid	2083.8035714285716

Truncated to displaylimit of 10.

%%sql
SELECT * FROM avg_balance_by_job

Generating CTE with stored snippets: 'avg_balance_by_job'

job	AverageBalance
unemployed	1089.421875
services	1103.9568345323742
management	1766.9287925696594
blue-collar	1085.161733615222
self-employed	1392.4098360655737
technician	1330.99609375
entrepreneur	1645.125
admin.	1226.73640167364
student	1543.8214285714287
housemaid	2083.8035714285716

Truncated to displaylimit of 10.

Multiple CTEs#

You can use multiple CTEs in a single query. Let’s find the average balance per job type and average campaign per job type.

%%sql
WITH Job_Avg_Balance AS (
    SELECT job, AVG(balance) AS AverageBalance
    FROM bank
    GROUP BY job
),
Job_Avg_Campaign AS (
    SELECT job, AVG(campaign) AS AverageCampaign
    FROM bank
    GROUP BY job
)
SELECT * 
FROM Job_Avg_Balance, Job_Avg_Campaign 
WHERE Job_Avg_Balance.job = Job_Avg_Campaign.job;

job	AverageBalance	job_1	AverageCampaign
unemployed	1089.421875	unemployed	2.6796875
services	1103.9568345323742	services	2.8225419664268587
management	1766.9287925696594	management	2.9731682146542826
blue-collar	1085.161733615222	blue-collar	2.846723044397463
self-employed	1392.4098360655737	self-employed	3.278688524590164
technician	1330.99609375	technician	2.7317708333333335
entrepreneur	1645.125	entrepreneur	2.5892857142857144
admin.	1226.73640167364	admin.	2.6317991631799162
student	1543.8214285714287	student	2.392857142857143
housemaid	2083.8035714285716	housemaid	2.5

Truncated to displaylimit of 10.

In this example, the first CTE is the same as in the previous example. The second CTE Job_Avg_Campaign returns two columns: job_1 and AverageCampaign. The final SELECT statement retrieves data from both CTEs.

You will notice job and job_1 in the final result. This is because we are doing a Cartesian product (cross join) between two CTEs and both have a column named job. We can avoid this by explicitly specifying the columns you want to select in our final SELECT statement instead of using SELECT *.

%%sql
WITH Job_Avg_Balance AS (
    SELECT job, AVG(balance) AS AverageBalance
    FROM bank
    GROUP BY job
),
Job_Avg_Campaign AS (
    SELECT job, AVG(campaign) AS AverageCampaign
    FROM bank
    GROUP BY job
)
SELECT Job_Avg_Balance.job, AverageBalance, AverageCampaign
FROM Job_Avg_Balance
JOIN Job_Avg_Campaign 
ON Job_Avg_Balance.job = Job_Avg_Campaign.job;

job	AverageBalance	AverageCampaign
unemployed	1089.421875	2.6796875
services	1103.9568345323742	2.8225419664268587
management	1766.9287925696594	2.9731682146542826
blue-collar	1085.161733615222	2.846723044397463
self-employed	1392.4098360655737	3.278688524590164
technician	1330.99609375	2.7317708333333335
entrepreneur	1645.125	2.5892857142857144
admin.	1226.73640167364	2.6317991631799162
student	1543.8214285714287	2.392857142857143
housemaid	2083.8035714285716	2.5

Truncated to displaylimit of 10.

Recursive CTEs#

A recursive CTE is one in which an initial CTE is repeatedly executed to return subsets of data until the complete result set is obtained.

The given dataset doesn’t lend itself to a recursive CTE, as these are generally used for hierarchical or recursive data problems, which the bank dataset does not present.

Here is an example of a recursive CTE

%%sql
WITH RECURSIVE numbers AS (
    SELECT 1 AS value
    UNION ALL
    SELECT value + 1 FROM numbers WHERE value < 10
)
SELECT * FROM numbers;

value
1
2
3
4
5
6
7
8
9
10

Truncated to displaylimit of 10.

This will output a list of numbers from 1 to 10. The CTE works as follows:

The WITH RECURSIVE clause marks the start of the recursive CTE.
The SELECT 1 AS value is the “anchor member” of the CTE and provides the base result set for the recursion to start.
The UNION ALL clause is used to combine the results of the anchor member with the results of the “recursive member”, which is SELECT value + 1 FROM numbers WHERE value < 10.
The recursion continues until value < 10 returns false, at which point the CTE stops executing.

Using CTEs to modify information in a table#

Important

Without a unique identifier for each row in your table, performing UPDATE or DELETE operations using a CTE would be risky because they could affect more rows than you intend.

It’s generally not recommended to use UPDATE or DELETE without a unique identifier or precise condition to pinpoint exactly which rows you want to affect. We’re going to load data from the Joining Data in SQL tutorial as this data has tables with unique ID’s.

_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)

Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'

Let’s work on a separate DuckDB instance

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db

Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'

Settings changed:

Config	value
displaycon	False
feedback	True
autopandas	False
named_parameters	True

Connecting and switching to connection duckdb:///bank_data.duck.db

%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');

Count

Let’s take a look at the entries

%sqlcmd explore --table s1.district

Let’s say district_id is a unique identifier for each district.

CTE with UPDATE#

When working with UPDATE or DELETE operations, it is highly recommended to check your data before and after the operations. This helps to verify that the operations were successful and only affected the intended data.

Suppose we want to update the average_salary for districts that had more than 5000 committed crimes in 1996 to be increased by 10%.

Check data before the operation:

%%sql
SELECT district_id, average_salary
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000;

district_id	average_salary
1	12541
4	9753
14	10045
24	8991
26	10787
34	9675
36	9198
40	9317
41	9832
47	9538

Truncated to displaylimit of 10.

This is returning all entries that satisfy number of crimes committed in ‘96 exceeds 5000.

Perform the update:

%%sql
WITH High_Crime_Districts AS (
    SELECT district_id
    FROM s1.district
    WHERE no_of_committed_crimes_96 > 5000
)
UPDATE s1.district
SET average_salary = average_salary * 1.10
WHERE district_id IN (SELECT district_id FROM High_Crime_Districts);

Count
15

In this example, we are writing a CTE that selects the district_id entries that satisfy the condition no_of_committed_crimes_96 > 5000. We can all this CTE High_Crime_Districts.

We then use High_Crime_Districts to change entries in average salary by increasing it by 10% (average_salary = average_salary * 1.10), and ensure this is done in only those entries in the district_id found in the CTE High_Crime_Districts.

Check data after the operation:

%%sql
SELECT district_id, average_salary
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000;

district_id	average_salary
1	13795
4	10728
14	11050
24	9890
26	11866
34	10643
36	10118
40	10249
41	10815
47	10492

Truncated to displaylimit of 10.

CTE with DELETE#

Suppose we want to delete records for all districts with unemployment rate in ‘96 greater than 4.

The query below selects only those entries for which unemployment_rate_96 is at least 4.

Check data before the operation:

%%sql
SELECT * 
FROM s1.district
WHERE unemployment_rate_96 > 4;

district_id	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96
4	Kladno	central Bohemia	149893	63	29	6	2	6	67.4	10728	4.64	5.05	109	5244	5892
5	Kolin	central Bohemia	95616	65	30	4	1	6	51.4	9307	3.85	4.43	118	2616	3040
6	Kutna Hora	central Bohemia	77963	60	23	4	2	4	51.5	8546	2.95	4.02	126	2640	3120
12	Pribram	central Bohemia	107870	84	29	6	1	6	58.0	8754	3.83	4.31	137	3804	3868
31	Tachov	west Bohemia	51313	34	12	3	1	4	59.2	8930	3.52	4.2	108	1328	1452
33	Decin	north Bohemia	133777	24	17	7	3	11	84.7	8705	5.75	7.61	116	4650	4859
34	Chomutov	north Bohemia	125236	28	11	1	4	5	87.7	10643	6.43	7.68	100	5323	5190
36	Liberec	north Bohemia	159617	29	19	8	1	9	85.2	10118	3.33	4.28	131	5796	6132
37	Litomerice	north Bohemia	114006	71	26	6	2	9	62.3	9065	4.46	5.39	123	4147	4166
38	Louny	north Bohemia	85852	41	23	4	2	4	59.8	8965	7.08	8.23	104	2653	2822

Truncated to displaylimit of 10.

Perform the update.

%%sql
WITH High_Unemployment_Districts AS (
    SELECT district_id
    FROM s1.district
    WHERE unemployment_rate_96 > 4
)
DELETE FROM s1.district
WHERE district_id IN (SELECT district_id FROM High_Unemployment_Districts);

Count
31

We create a CTE called High_Unemployment_Districts that selects only those district_id whose unemployment rate in ‘96 exceeds 4.

We then use the DELETE operation to remove all entries from the s1.district for only those district_id in the CTE High_Unemployment_Districts.

Check data after the operation:

%%sql
SELECT * 
FROM s1.district
WHERE unemployment_rate_96 > 4;

district_id	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96

In both these examples, the operation will only affect the rows that match the conditions specified in the CTEs. The UPDATE operation will increase the average_salary of high crime districts by 10%, and the DELETE operation will remove all districts with high unemployment rate.

Exercise 1 (Easy)#

What is a Common Table Expression (CTE), and what is its primary use in SQL queries?

Answers

A Common Table Expression (CTE) is a temporary result set that you can reference within another SELECT, INSERT, UPDATE, or DELETE statement. The main use of CTEs is to simplify complex SQL queries, particularly those involving multiple levels of subqueries. They make your SQL code more readable and maintainable.

Exercise 2 (Medium)#

Write a SQL query using a CTE that returns the total number of inhabitants for each region in the provided dataset. Save the CTE into a variable called region_inhabitants

Answers

%%sql --save region_inhabitants
WITH Region_Inhabitants AS (
    SELECT region, SUM(no_of_inhabitants) AS TotalInhabitants
    FROM s1.district
    GROUP BY region
)
SELECT * 
FROM Region_Inhabitants;

region	TotalInhabitants
Prague	1204953
central Bohemia	673892
south Bohemia	700595
west Bohemia	807993
north Bohemia	193826
east Bohemia	1132172
south Moravia	1213345
north Moravia	182027

This CTE, Region_Inhabitants, groups the dataset by region and calculates the total number of inhabitants for each region using the SUM() function. The final SELECT statement retrieves all records from the CTE.

Exercise 3 (hard)#

Suppose we want to increase the average_salary by 10% for districts that had more than 5000 committed crimes in ‘96, and then delete districts with an unemployment rate in ‘96 less than 4. Write a SQL query using CTEs to accomplish this, and explain the importance of checking the data before and after these operations.

Answers

First, let’s check the data before the operation:

%%sql 
SELECT * 
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000 OR unemployment_rate_96 < 4;

district_id	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96
1	Hl.m. Praha	Prague	1204953	0	0	0	1	1	100.0	13795	0.29	0.43	167	85677	99107
2	Benesov	central Bohemia	88884	80	26	6	2	5	46.7	8507	1.67	1.85	132	2159	2674
3	Beroun	central Bohemia	75232	55	26	4	1	5	41.7	8980	1.95	2.21	111	2824	2813
7	Melnik	central Bohemia	94725	38	28	1	3	6	63.4	9920	2.26	2.87	130	4289	4846
8	Mlada Boleslav	central Bohemia	112065	95	19	7	1	8	69.4	11277	1.25	1.44	127	5179	4987
9	Nymburk	central Bohemia	81344	61	23	4	2	6	55.3	8899	3.39	3.97	149	2987	2487
10	Praha - vychod	central Bohemia	92084	55	29	4	3	5	46.7	10124	0.56	0.54	141	3810	4316
11	Praha - zapad	central Bohemia	75637	35	36	9	0	7	36.5	9622	0.45	0.59	154	3475	3529
13	Rakovnik	central Bohemia	53921	61	22	1	1	2	41.3	8598	2.77	3.26	123	1597	1875
14	Ceske Budejovice	south Bohemia	177686	69	27	10	1	9	74.8	11050	1.42	1.71	135	6604	6295

Truncated to displaylimit of 10.

Next, we perform the update and delete operations:

%%sql
WITH High_Crime_Districts AS (
    SELECT district_id
    FROM s1.district
    WHERE no_of_committed_crimes_96 > 5000
)
UPDATE s1.district
SET average_salary = average_salary * 1.10
WHERE district_id IN (SELECT district_id FROM High_Crime_Districts);

WITH High_Unemployment_Districts AS (
    SELECT district_id
    FROM s1.district
    WHERE unemployment_rate_96 < 4
)
DELETE FROM s1.district
WHERE district_id IN (SELECT district_id FROM High_Unemployment_Districts);

Count
46

Finally, let’s check the data after the operation:

%%sql
SELECT * 
FROM s1.district
WHERE no_of_committed_crimes_96 > 5000 OR unemployment_rate_96 < 4;

district_id	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96

The importance of checking the data before and after the operations is to verify that the operations were successful and only affected the intended data. It’s generally not recommended to use UPDATE or DELETE without a unique identifier or precise condition to pinpoint exactly which rows you want to affect. Checking the data before and after helps to prevent or identify potential mistakes or unexpected results in the data modification process.

Delete tables

%%sql
DROP TABLE IF EXISTS s1.district;
DROP TABLE IF EXISTS bank;
DROP SCHEMA s1;

Success

Summary#

In this tutorial we learned:

Examples of simple and multiple CTEs, as well as how to join multiple CTEs.
An introduction to Recursive CTEs with an example of generating a sequence of numbers.
A demonstration of how to use CTEs in conjunction with UPDATE and DELETE commands, as well as the importance of checking data before and after these operations.
A caution against performing UPDATE or DELETE operations without a unique identifier or precise condition.

In the next chapter, we will show how you can visualize the results of your SQL queries.

Common Table Expressions (CTEs) in SQL

Contents

Common Table Expressions (CTEs) in SQL#

Set up and data access#

Simple CTE#

Multiple CTEs#

Recursive CTEs#

Using CTEs to modify information in a table#

CTE with UPDATE#

CTE with DELETE#

Exercise 1 (Easy)#

Exercise 2 (Medium)#

Exercise 3 (hard)#

Summary#