Using subqueries#
A subquery is a technique that essentially embeds a query within another query. It enables users to run a query that has some kind of relation to another query in the same SQL statement.
Let’s demonstrate how powerful subqueries are by continuing with our banking data.
Install - execute this once.#
Important
Note: The --save
and %sqlcmd
features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql pandas --quiet
Load the data#
Important
Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.
This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.
import sys
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.MarketData(
"https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip", # noqa E501
"expanded_data",
)
_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'
If you ran the above cell, you should have a folder expanded_data
in your current directory that contains the .csv
files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv
, account.csv
, district.csv
.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Important
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Creating Tables#
Let’s start off with loading three of the eight .csv
files from the expanded_data
folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1
in which we will store the tables. Here we use the CREATE TABLE
syntax in DuckDB to ingest four of the eight .csv
files. The read_csv_auto
is a function that helps SQL understand our local .csv
file for creation into our database.
%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
Count |
---|
Queries#
Let’s first load each table for reference and to better understand their contents.
%%sql
SELECT *
FROM s1.account
LIMIT 5
account_id | district_id | frequency | date |
---|---|---|---|
576 | 55 | POPLATEK MESICNE | 930101 |
3818 | 74 | POPLATEK MESICNE | 930101 |
704 | 55 | POPLATEK MESICNE | 930101 |
2378 | 16 | POPLATEK MESICNE | 930101 |
2632 | 24 | POPLATEK MESICNE | 930102 |
%%sql
SELECT *
FROM s1.district
LIMIT 5
district_id | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 |
2 | Benesov | central Bohemia | 88884 | 80 | 26 | 6 | 2 | 5 | 46.7 | 8507 | 1.67 | 1.85 | 132 | 2159 | 2674 |
3 | Beroun | central Bohemia | 75232 | 55 | 26 | 4 | 1 | 5 | 41.7 | 8980 | 1.95 | 2.21 | 111 | 2824 | 2813 |
4 | Kladno | central Bohemia | 149893 | 63 | 29 | 6 | 2 | 6 | 67.4 | 9753 | 4.64 | 5.05 | 109 | 5244 | 5892 |
5 | Kolin | central Bohemia | 95616 | 65 | 30 | 4 | 1 | 6 | 51.4 | 9307 | 3.85 | 4.43 | 118 | 2616 | 3040 |
%%sql
SELECT *
FROM s1.loan
LIMIT 5
loan_id | account_id | date | amount | duration | payments | status |
---|---|---|---|---|---|---|
5314 | 1787 | 930705 | 96396 | 12 | 8033.0 | B |
5316 | 1801 | 930711 | 165960 | 36 | 4610.0 | A |
6863 | 9188 | 930728 | 127080 | 60 | 2118.0 | A |
5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
7240 | 11013 | 930906 | 274740 | 60 | 4579.0 | A |
Subquery in SELECT#
The query belows results in the average loan amount for each account_id
by having a subquery in the SELECT
clause.
%%sql
SELECT s1.account.account_id, s1.account.date,
(SELECT AVG(s1.loan.amount)
FROM s1.loan
WHERE s1.loan.account_id = s1.account.account_id
) AS average_loan_amount
FROM s1.account
WHERE average_loan_amount IS NOT NULL
ORDER BY average_loan_amount
LIMIT 5
account_id | date | average_loan_amount |
---|---|---|
3354 | 931105 | 4980.0 |
6061 | 970308 | 5148.0 |
8330 | 930328 | 7656.0 |
8268 | 960919 | 8616.0 |
6453 | 960727 | 10944.0 |
The above SQL statement is quite verbose. Let’s instead utilize aliases to make the query more readable.
%%sql
SELECT a.account_id,
(SELECT AVG(l.amount)
FROM s1.loan AS l
WHERE l.account_id = a.account_id
) AS average_loan_amount
FROM s1.account AS a
WHERE average_loan_amount IS NOT NULL
ORDER BY average_loan_amount
account_id | average_loan_amount |
---|---|
3354 | 4980.0 |
6061 | 5148.0 |
8330 | 7656.0 |
8268 | 8616.0 |
6453 | 10944.0 |
5033 | 11400.0 |
6040 | 11736.0 |
666 | 12540.0 |
7559 | 12792.0 |
3637 | 14028.0 |
Let’s first focus on the “outer query” of this statement. The outer query is everything in the SQL statement besides the second argument in the SELECT
clause. This outer query results in just the account_id
from the s1.account
table. The “inner query” is the second argument in the SELECT
statement. This query is called for each row from the outer query, which finds the average loan amount for each account by relation in the inner query’s WHERE
clause.
In these datasets, each account actually only has one loan. We solely specify the AVG()
function for demonstration purposes. We also specify IS NOT NULL
because only 682 accounts actually have loans. The remaining accounts would have a NULL
value.
Subquery in FROM#
In the query below, we find the average loan amount by district name while utilizing a subquery in a FROM
clause.
%%sql
SELECT d.district_name,
ROUND(AVG(inner_query.amount),2) AS average_loan_amount
FROM
(SELECT l.loan_id,
l.amount,
a.district_id
FROM s1.loan AS l
INNER JOIN s1.account AS a
ON l.account_id = a.account_id
) AS inner_query
INNER JOIN s1.district AS d
ON inner_query.district_id = d.district_id
GROUP BY d.district_name
district_name | average_loan_amount |
---|---|
Chrudim | 178960.5 |
Cesky Krumlov | 83475.43 |
Brno - venkov | 123223.2 |
Brno - mesto | 168725.0 |
Frydek - Mistek | 128359.5 |
Pribram | 162399.0 |
Hl.m. Praha | 153957.29 |
Karvina | 127492.5 |
Ostrava - mesto | 163011.0 |
Jicin | 121623.43 |
This query demonstrates how to have a subquery act as another table when performing an INNER JOIN
in the outer query. There is also an INNER JOIN
in the inner query that creates a joined table between the s1.account
and s1.loan
tables. This subquery then gives us the necessary information to join with our s1.district
table.
The query shown above can be easily implemented using JupySQL’s --save
feature. This feature essentially allows us to save a query and use it for future use. We implement the --save
feature below by recreating the previous query.
We first --save
the subquery as loan_amount_district_id
.
%%sql --save loan_amount_district_id
SELECT l.loan_id,
l.amount,
a.district_id
FROM s1.loan AS l
INNER JOIN s1.account AS a
ON l.account_id = a.account_id
loan_id | amount | district_id |
---|---|---|
6077 | 79608 | 44 |
7284 | 52788 | 15 |
7121 | 21924 | 55 |
5754 | 23052 | 54 |
6895 | 41904 | 68 |
6202 | 65184 | 54 |
6316 | 76908 | 12 |
5325 | 105804 | 12 |
6888 | 39576 | 1 |
6647 | 208320 | 1 |
Then call the saved subquery and use it to complete our overall query.
%%sql
SELECT d.district_name,
ROUND(AVG(inner_query.amount),2) AS average_loan_amount
FROM (SELECT *
FROM loan_amount_district_id) AS inner_query
INNER JOIN s1.district AS d
ON inner_query.district_id = d.district_id
GROUP BY d.district_name
district_name | average_loan_amount |
---|---|
Chrudim | 178960.5 |
Cesky Krumlov | 83475.43 |
Brno - venkov | 123223.2 |
Brno - mesto | 168725.0 |
Frydek - Mistek | 128359.5 |
Pribram | 162399.0 |
Hl.m. Praha | 153957.29 |
Karvina | 127492.5 |
Ostrava - mesto | 163011.0 |
Jicin | 121623.43 |
Subquery in WHERE#
This example calls a subquery from a statement’s WHERE
clause.
%%sql
SELECT *
FROM s1.loan
WHERE amount > (
SELECT AVG(amount)
FROM s1.loan
WHERE status = 'A'
)
loan_id | account_id | date | amount | duration | payments | status |
---|---|---|---|---|---|---|
5314 | 1787 | 930705 | 96396 | 12 | 8033.0 | B |
5316 | 1801 | 930711 | 165960 | 36 | 4610.0 | A |
6863 | 9188 | 930728 | 127080 | 60 | 2118.0 | A |
5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
7240 | 11013 | 930906 | 274740 | 60 | 4579.0 | A |
6111 | 5428 | 930924 | 174744 | 24 | 7281.0 | B |
7235 | 10973 | 931013 | 154416 | 48 | 3217.0 | A |
5997 | 4894 | 931104 | 117024 | 24 | 4876.0 | A |
6228 | 6034 | 931201 | 464520 | 60 | 7742.0 | B |
6356 | 6701 | 931208 | 95400 | 36 | 2650.0 | A |
With this query, we are able to see all loans that are greater than the average loan amount for ‘A’ status loans.
Let’s again demonstrate how we can use --save
to recreate this query.
%%sql --save avg_A_amount
SELECT AVG(amount)
FROM s1.loan
WHERE status = 'A'
avg(amount) |
---|
91641.45812807881 |
%%sql
SELECT *
FROM s1.loan
WHERE amount > (
SELECT *
FROM avg_A_amount)
loan_id | account_id | date | amount | duration | payments | status |
---|---|---|---|---|---|---|
5314 | 1787 | 930705 | 96396 | 12 | 8033.0 | B |
5316 | 1801 | 930711 | 165960 | 36 | 4610.0 | A |
6863 | 9188 | 930728 | 127080 | 60 | 2118.0 | A |
5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
7240 | 11013 | 930906 | 274740 | 60 | 4579.0 | A |
6111 | 5428 | 930924 | 174744 | 24 | 7281.0 | B |
7235 | 10973 | 931013 | 154416 | 48 | 3217.0 | A |
5997 | 4894 | 931104 | 117024 | 24 | 4876.0 | A |
6228 | 6034 | 931201 | 464520 | 60 | 7742.0 | B |
6356 | 6701 | 931208 | 95400 | 36 | 2650.0 | A |
Note that we do not have to assign an alias to our saved query because none of its variables are in the outer query’s SELECT
clause.
Examining Our Snippets with JupySQL#
To examine all of our saved queries with --save
, we can use %sqlcmd snippets
.
%sqlcmd snippets
Stored snippets |
---|
loan_amount_district_id |
avg_A_amount |
Our saved queries (also known as snippets) can be deleted with the --delete
feature.
For more information on snippets, please visit JupySQL’s official documentation [here].(https://jupysql.ploomber.io/en/latest/api/magic-snippets.html)
Viewing Our Tables with JupySQL#
JupySQL also has features that allow for quick displays of tables and their columns. We can examine them with %sqlcmd tables
and %sqlcmd columns
.
%sqlcmd tables -s s1
Name |
---|
loan |
district |
account |
The above code cell shows all tables under the s1
schema. Recall that we assigned a schema to these tables at the beginning of this section.
%sqlcmd columns
allows us to examine each column in a table with added additional information.
%sqlcmd columns -s s1 -t account
name | type | nullable | default | autoincrement | comment |
---|---|---|---|---|---|
account_id | BIGINT | True | None | False | None |
district_id | BIGINT | True | None | False | None |
frequency | VARCHAR | True | None | False | None |
date | BIGINT | True | None | False | None |
In the above code-cell, we specify the schema and the table under the schema to examine.
Note: for now, JupySQL does not support the use of these magic commands with queries saved with --save
.
For more information on these magic commands, please visit JupySQL’s official documentation [here].(https://jupysql.ploomber.io/en/latest/api/magic-tables-columns.html)
You try: Use JupySQL to perform the queries and answer the questions.#
Question 1 (Easy):#
Query loans with an amount
greater than or equal to the AVG()
loan amount. Utilize the --save
feature to help you solve this question.
Answers
First, create a snippet with --save
containing the AVG()
loan amount.
%%sql --save average_loan_amount
SELECT AVG(amount)
FROM s1.loan
avg(amount) |
---|
151410.1759530792 |
Then we reference average_loan_amount
as a subquery to filter for loan id’s that are greater than or equal to this average.
%%sql
SELECT loan_id
FROM s1.loan
WHERE amount >= (SELECT *
FROM average_loan_amount)
loan_id |
---|
5316 |
7240 |
6111 |
7235 |
6228 |
7104 |
5170 |
7226 |
6087 |
7262 |
Question 2 (Medium):#
Find all account_id
’s that have a loan alongside with their loan status, their loan amount, and the average loan amount for their loan’s loan status. Order by account_id
.
Answers
The difficult part of this question lies in the second INNER JOIN
. This second INNER JOIN
uses a subquery to have the average amount of each ‘status’. This information is used when calling the last INNER JOIN
on the outer query’s ‘status’ variable.
%%sql
SELECT a.account_id,
l.status AS loan_status,
l.amount AS loan_amount,
ROUND(avg_status_loan.amount,2) AS average_loan_amount
FROM
s1.account AS a
INNER JOIN
s1.loan AS l ON a.account_id = l.account_id
INNER JOIN (SELECT status,
AVG(amount) AS amount
FROM s1.loan
GROUP BY status
) AS avg_status_loan ON l.status = avg_status_loan.status
ORDER BY a.account_id
account_id | loan_status | loan_amount | average_loan_amount |
---|---|---|---|
2 | A | 80952 | 91641.46 |
19 | B | 30276 | 140720.9 |
25 | A | 30276 | 91641.46 |
37 | D | 318480 | 249284.53 |
38 | C | 110736 | 171410.35 |
67 | A | 165960 | 91641.46 |
97 | A | 102876 | 91641.46 |
103 | D | 265320 | 249284.53 |
105 | C | 352704 | 171410.35 |
110 | C | 162576 | 171410.35 |
Question 3 (Hard):#
Query the district_id
and district_name
that have the highest amount of loans for each loan status type.
To answer this question, let’s quickly learn about UNION ALL
. UNION ALL
is a clause that concatenates rows on top of each other. An example is shown below.
Using UNION ALL
, we can achieve the correct results. Try to incorporate the --save
feature to make your query easier to read.
Answers
We find the district with the highest loan status type by first joining the three necessary tables s1.account
, s1.district
, and s1.loan
. These three tables are used in an inner query for each loan status type. Then, in each of these inner queries, we query the district id, district name, and the count of the first row after ordering by count. We finalize the query by “stacking” these results with UNION ALL
.
%%sql
SELECT district_id, district_name, 'A' as status, count
FROM
(SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'A'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1) as A_max
UNION ALL
SELECT district_id, district_name, 'B' as status, count
FROM
(SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'B'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1) as B_max
UNION ALL
SELECT district_id, district_name, 'C' as status, count
FROM
(SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'C'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1) as C_max
UNION ALL
SELECT district_id, district_name, 'D' as status, count
FROM
(SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'D'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1) as D_max
district_id | district_name | status | count |
---|---|---|---|
1 | Hl.m. Praha | A | 34 |
1 | Hl.m. Praha | D | 4 |
1 | Hl.m. Praha | C | 43 |
1 | Hl.m. Praha | B | 3 |
We can make the above query easier to understand by using JupySQL’s --save
feature.
%%sql --save A
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'A'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
district_id | district_name | count |
---|---|---|
1 | Hl.m. Praha | 34 |
%%sql --save B
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'B'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
district_id | district_name | count |
---|---|---|
1 | Hl.m. Praha | 3 |
%%sql --save C
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'C'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
district_id | district_name | count |
---|---|---|
1 | Hl.m. Praha | 43 |
%%sql --save D
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'D'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
district_id | district_name | count |
---|---|---|
1 | Hl.m. Praha | 4 |
We can then use these saved snippets and concatenate their results using UNION ALL
.
%%sql
SELECT *
FROM A
UNION ALL
SELECT *
FROM B
UNION ALL
SELECT *
FROM C
UNION ALL
SELECT *
FROM D
district_id | district_name | count |
---|---|---|
1 | Hl.m. Praha | 34 |
1 | Hl.m. Praha | 4 |
1 | Hl.m. Praha | 43 |
1 | Hl.m. Praha | 3 |
Question 4 (Bonus):#
Output the COUNT()
of of each unique ‘status’ variable under s1.loan
that are greater than the average of ‘A’ type loans. Have the outputs be only five columns for each ‘status’ type with a single value each alongside with the total number of loans. You must use one or more subqueries.
Hint CASE WHEN
is a clause that acts as a conditional statement when performing other SQL actions. Try to see how you can incorporate CASE WHEN
with COUNT()
to answer this question.
Answers
For each SELECT
argument, we are finding the count of each status using CASE WHEN
to have a “1” count when aggregating the ‘status’ values. This allows our query to have the correct counts for each value given that the loan amount is less than the average ‘A’ loan amount.
%%sql
SELECT COUNT(CASE WHEN status = 'A' THEN 1 END) AS A,
COUNT(CASE WHEN status = 'B' THEN 1 END) AS B,
COUNT(CASE WHEN status = 'C' THEN 1 END) AS C,
COUNT(CASE WHEN status = 'D' THEN 1 END) AS D,
COUNT(*) AS Total
FROM s1.loan
WHERE amount > (
SELECT AVG(amount)
FROM s1.loan
WHERE status = 'A'
)
A | B | C | D | Total |
---|---|---|---|---|
78 | 17 | 273 | 39 | 407 |
Delete tables
%%sql
DROP TABLE IF EXISTS s1.loan;
DROP TABLE IF EXISTS s1.account;
DROP TABLE IF EXISTS s1.district;
DROP SCHEMA s1;
Success |
---|
Wrapping Up#
In this section, we introduced subqueries and how they can be implemented with common clauses. Subqueries are an incredibly powerful and intuitive technique that can serve as an alternative to joins, as well as enable users to utilize the results of one query within another query.
We also explored how we can make subqueries more readable using JupySQL’s --save
feature. Alongside this, we introduced some additional JupySQL magic commands such as %sqlcmd tables
and %sqlcmd columns
to view current tables in our schema.
In the next section, you will learn how to implement more advanced join techniques to your queries.