Using subqueries#
A subquery is a technique that essentially embeds a query within another query. It enables users to run a query that has some kind of relation to another query in the same SQL statement.
Let’s demonstrate how powerful subqueries are by continuing with our banking data.
Install - execute this once.#
Important
Note: The --save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql pandas --quiet
Load the data#
Important
Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.
This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.
import sys
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.MarketData(
"https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip", # noqa E501
"expanded_data",
)
_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'
If you ran the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv, account.csv, district.csv.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Important
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
| Config | value |
|---|---|
| displaycon | False |
| feedback | True |
| autopandas | False |
| named_parameters | True |
Creating Tables#
Let’s start off with loading three of the eight .csv files from the expanded_data folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1 in which we will store the tables. Here we use the CREATE TABLE syntax in DuckDB to ingest four of the eight .csv files. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.
%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
| Count |
|---|
Queries#
Let’s first load each table for reference and to better understand their contents.
%%sql
SELECT *
FROM s1.account
LIMIT 5
| account_id | district_id | frequency | date |
|---|---|---|---|
| 576 | 55 | POPLATEK MESICNE | 930101 |
| 3818 | 74 | POPLATEK MESICNE | 930101 |
| 704 | 55 | POPLATEK MESICNE | 930101 |
| 2378 | 16 | POPLATEK MESICNE | 930101 |
| 2632 | 24 | POPLATEK MESICNE | 930102 |
%%sql
SELECT *
FROM s1.district
LIMIT 5
| district_id | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 |
| 2 | Benesov | central Bohemia | 88884 | 80 | 26 | 6 | 2 | 5 | 46.7 | 8507 | 1.67 | 1.85 | 132 | 2159 | 2674 |
| 3 | Beroun | central Bohemia | 75232 | 55 | 26 | 4 | 1 | 5 | 41.7 | 8980 | 1.95 | 2.21 | 111 | 2824 | 2813 |
| 4 | Kladno | central Bohemia | 149893 | 63 | 29 | 6 | 2 | 6 | 67.4 | 9753 | 4.64 | 5.05 | 109 | 5244 | 5892 |
| 5 | Kolin | central Bohemia | 95616 | 65 | 30 | 4 | 1 | 6 | 51.4 | 9307 | 3.85 | 4.43 | 118 | 2616 | 3040 |
%%sql
SELECT *
FROM s1.loan
LIMIT 5
| loan_id | account_id | date | amount | duration | payments | status |
|---|---|---|---|---|---|---|
| 5314 | 1787 | 930705 | 96396 | 12 | 8033.0 | B |
| 5316 | 1801 | 930711 | 165960 | 36 | 4610.0 | A |
| 6863 | 9188 | 930728 | 127080 | 60 | 2118.0 | A |
| 5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
| 7240 | 11013 | 930906 | 274740 | 60 | 4579.0 | A |
Subquery in SELECT#
The query belows results in the average loan amount for each account_id by having a subquery in the SELECT clause.
%%sql
SELECT s1.account.account_id, s1.account.date,
(SELECT AVG(s1.loan.amount)
FROM s1.loan
WHERE s1.loan.account_id = s1.account.account_id
) AS average_loan_amount
FROM s1.account
WHERE average_loan_amount IS NOT NULL
ORDER BY average_loan_amount
LIMIT 5
| account_id | date | average_loan_amount |
|---|---|---|
| 3354 | 931105 | 4980.0 |
| 6061 | 970308 | 5148.0 |
| 8330 | 930328 | 7656.0 |
| 8268 | 960919 | 8616.0 |
| 6453 | 960727 | 10944.0 |
The above SQL statement is quite verbose. Let’s instead utilize aliases to make the query more readable.
%%sql
SELECT a.account_id,
(SELECT AVG(l.amount)
FROM s1.loan AS l
WHERE l.account_id = a.account_id
) AS average_loan_amount
FROM s1.account AS a
WHERE average_loan_amount IS NOT NULL
ORDER BY average_loan_amount
| account_id | average_loan_amount |
|---|---|
| 3354 | 4980.0 |
| 6061 | 5148.0 |
| 8330 | 7656.0 |
| 8268 | 8616.0 |
| 6453 | 10944.0 |
| 5033 | 11400.0 |
| 6040 | 11736.0 |
| 666 | 12540.0 |
| 7559 | 12792.0 |
| 3637 | 14028.0 |
Let’s first focus on the “outer query” of this statement. The outer query is everything in the SQL statement besides the second argument in the SELECT clause. This outer query results in just the account_id from the s1.account table. The “inner query” is the second argument in the SELECT statement. This query is called for each row from the outer query, which finds the average loan amount for each account by relation in the inner query’s WHERE clause.
In these datasets, each account actually only has one loan. We solely specify the AVG() function for demonstration purposes. We also specify IS NOT NULL because only 682 accounts actually have loans. The remaining accounts would have a NULL value.
Subquery in FROM#
In the query below, we find the average loan amount by district name while utilizing a subquery in a FROM clause.
%%sql
SELECT d.district_name,
ROUND(AVG(inner_query.amount),2) AS average_loan_amount
FROM
(SELECT l.loan_id,
l.amount,
a.district_id
FROM s1.loan AS l
INNER JOIN s1.account AS a
ON l.account_id = a.account_id
) AS inner_query
INNER JOIN s1.district AS d
ON inner_query.district_id = d.district_id
GROUP BY d.district_name
| district_name | average_loan_amount |
|---|---|
| Chrudim | 178960.5 |
| Cesky Krumlov | 83475.43 |
| Brno - venkov | 123223.2 |
| Brno - mesto | 168725.0 |
| Frydek - Mistek | 128359.5 |
| Pribram | 162399.0 |
| Hl.m. Praha | 153957.29 |
| Karvina | 127492.5 |
| Ostrava - mesto | 163011.0 |
| Jicin | 121623.43 |
This query demonstrates how to have a subquery act as another table when performing an INNER JOIN in the outer query. There is also an INNER JOIN in the inner query that creates a joined table between the s1.account and s1.loan tables. This subquery then gives us the necessary information to join with our s1.district table.
The query shown above can be easily implemented using JupySQL’s --save feature. This feature essentially allows us to save a query and use it for future use. We implement the --save feature below by recreating the previous query.
We first --save the subquery as loan_amount_district_id.
%%sql --save loan_amount_district_id
SELECT l.loan_id,
l.amount,
a.district_id
FROM s1.loan AS l
INNER JOIN s1.account AS a
ON l.account_id = a.account_id
| loan_id | amount | district_id |
|---|---|---|
| 6077 | 79608 | 44 |
| 7284 | 52788 | 15 |
| 7121 | 21924 | 55 |
| 5754 | 23052 | 54 |
| 6895 | 41904 | 68 |
| 6202 | 65184 | 54 |
| 6316 | 76908 | 12 |
| 5325 | 105804 | 12 |
| 6888 | 39576 | 1 |
| 6647 | 208320 | 1 |
Then call the saved subquery and use it to complete our overall query.
%%sql
SELECT d.district_name,
ROUND(AVG(inner_query.amount),2) AS average_loan_amount
FROM (SELECT *
FROM loan_amount_district_id) AS inner_query
INNER JOIN s1.district AS d
ON inner_query.district_id = d.district_id
GROUP BY d.district_name
| district_name | average_loan_amount |
|---|---|
| Chrudim | 178960.5 |
| Cesky Krumlov | 83475.43 |
| Brno - venkov | 123223.2 |
| Brno - mesto | 168725.0 |
| Frydek - Mistek | 128359.5 |
| Pribram | 162399.0 |
| Hl.m. Praha | 153957.29 |
| Karvina | 127492.5 |
| Ostrava - mesto | 163011.0 |
| Jicin | 121623.43 |
Subquery in WHERE#
This example calls a subquery from a statement’s WHERE clause.
%%sql
SELECT *
FROM s1.loan
WHERE amount > (
SELECT AVG(amount)
FROM s1.loan
WHERE status = 'A'
)
| loan_id | account_id | date | amount | duration | payments | status |
|---|---|---|---|---|---|---|
| 5314 | 1787 | 930705 | 96396 | 12 | 8033.0 | B |
| 5316 | 1801 | 930711 | 165960 | 36 | 4610.0 | A |
| 6863 | 9188 | 930728 | 127080 | 60 | 2118.0 | A |
| 5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
| 7240 | 11013 | 930906 | 274740 | 60 | 4579.0 | A |
| 6111 | 5428 | 930924 | 174744 | 24 | 7281.0 | B |
| 7235 | 10973 | 931013 | 154416 | 48 | 3217.0 | A |
| 5997 | 4894 | 931104 | 117024 | 24 | 4876.0 | A |
| 6228 | 6034 | 931201 | 464520 | 60 | 7742.0 | B |
| 6356 | 6701 | 931208 | 95400 | 36 | 2650.0 | A |
With this query, we are able to see all loans that are greater than the average loan amount for ‘A’ status loans.
Let’s again demonstrate how we can use --save to recreate this query.
%%sql --save avg_A_amount
SELECT AVG(amount)
FROM s1.loan
WHERE status = 'A'
| avg(amount) |
|---|
| 91641.45812807881 |
%%sql
SELECT *
FROM s1.loan
WHERE amount > (
SELECT *
FROM avg_A_amount)
| loan_id | account_id | date | amount | duration | payments | status |
|---|---|---|---|---|---|---|
| 5314 | 1787 | 930705 | 96396 | 12 | 8033.0 | B |
| 5316 | 1801 | 930711 | 165960 | 36 | 4610.0 | A |
| 6863 | 9188 | 930728 | 127080 | 60 | 2118.0 | A |
| 5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
| 7240 | 11013 | 930906 | 274740 | 60 | 4579.0 | A |
| 6111 | 5428 | 930924 | 174744 | 24 | 7281.0 | B |
| 7235 | 10973 | 931013 | 154416 | 48 | 3217.0 | A |
| 5997 | 4894 | 931104 | 117024 | 24 | 4876.0 | A |
| 6228 | 6034 | 931201 | 464520 | 60 | 7742.0 | B |
| 6356 | 6701 | 931208 | 95400 | 36 | 2650.0 | A |
Note that we do not have to assign an alias to our saved query because none of its variables are in the outer query’s SELECT clause.
Examining Our Snippets with JupySQL#
To examine all of our saved queries with --save, we can use %sqlcmd snippets.
%sqlcmd snippets
| Stored snippets |
|---|
| loan_amount_district_id |
| avg_A_amount |
Our saved queries (also known as snippets) can be deleted with the --delete feature.
For more information on snippets, please visit JupySQL’s official documentation [here].(https://jupysql.ploomber.io/en/latest/api/magic-snippets.html)
Viewing Our Tables with JupySQL#
JupySQL also has features that allow for quick displays of tables and their columns. We can examine them with %sqlcmd tables and %sqlcmd columns.
%sqlcmd tables -s s1
| Name |
|---|
| loan |
| district |
| account |
The above code cell shows all tables under the s1 schema. Recall that we assigned a schema to these tables at the beginning of this section.
%sqlcmd columns allows us to examine each column in a table with added additional information.
%sqlcmd columns -s s1 -t account
| name | type | nullable | default | autoincrement | comment |
|---|---|---|---|---|---|
| account_id | BIGINT | True | None | False | None |
| district_id | BIGINT | True | None | False | None |
| frequency | VARCHAR | True | None | False | None |
| date | BIGINT | True | None | False | None |
In the above code-cell, we specify the schema and the table under the schema to examine.
Note: for now, JupySQL does not support the use of these magic commands with queries saved with --save.
For more information on these magic commands, please visit JupySQL’s official documentation [here].(https://jupysql.ploomber.io/en/latest/api/magic-tables-columns.html)
You try: Use JupySQL to perform the queries and answer the questions.#
Question 1 (Easy):#
Query loans with an amount greater than or equal to the AVG() loan amount. Utilize the --save feature to help you solve this question.
Answers
First, create a snippet with --save containing the AVG() loan amount.
%%sql --save average_loan_amount
SELECT AVG(amount)
FROM s1.loan
| avg(amount) |
|---|
| 151410.1759530792 |
Then we reference average_loan_amount as a subquery to filter for loan id’s that are greater than or equal to this average.
%%sql
SELECT loan_id
FROM s1.loan
WHERE amount >= (SELECT *
FROM average_loan_amount)
| loan_id |
|---|
| 5316 |
| 7240 |
| 6111 |
| 7235 |
| 6228 |
| 7104 |
| 5170 |
| 7226 |
| 6087 |
| 7262 |
Question 2 (Medium):#
Find all account_id’s that have a loan alongside with their loan status, their loan amount, and the average loan amount for their loan’s loan status. Order by account_id.
Answers
The difficult part of this question lies in the second INNER JOIN. This second INNER JOIN uses a subquery to have the average amount of each ‘status’. This information is used when calling the last INNER JOIN on the outer query’s ‘status’ variable.
%%sql
SELECT a.account_id,
l.status AS loan_status,
l.amount AS loan_amount,
ROUND(avg_status_loan.amount,2) AS average_loan_amount
FROM
s1.account AS a
INNER JOIN
s1.loan AS l ON a.account_id = l.account_id
INNER JOIN (SELECT status,
AVG(amount) AS amount
FROM s1.loan
GROUP BY status
) AS avg_status_loan ON l.status = avg_status_loan.status
ORDER BY a.account_id
| account_id | loan_status | loan_amount | average_loan_amount |
|---|---|---|---|
| 2 | A | 80952 | 91641.46 |
| 19 | B | 30276 | 140720.9 |
| 25 | A | 30276 | 91641.46 |
| 37 | D | 318480 | 249284.53 |
| 38 | C | 110736 | 171410.35 |
| 67 | A | 165960 | 91641.46 |
| 97 | A | 102876 | 91641.46 |
| 103 | D | 265320 | 249284.53 |
| 105 | C | 352704 | 171410.35 |
| 110 | C | 162576 | 171410.35 |
Question 3 (Hard):#
Query the district_id and district_name that have the highest amount of loans for each loan status type.
To answer this question, let’s quickly learn about UNION ALL. UNION ALL is a clause that concatenates rows on top of each other. An example is shown below.
Using UNION ALL, we can achieve the correct results. Try to incorporate the --save feature to make your query easier to read.
Answers
We find the district with the highest loan status type by first joining the three necessary tables s1.account, s1.district, and s1.loan. These three tables are used in an inner query for each loan status type. Then, in each of these inner queries, we query the district id, district name, and the count of the first row after ordering by count. We finalize the query by “stacking” these results with UNION ALL.
%%sql
SELECT district_id, district_name, 'A' as status, count
FROM
(SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'A'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1) as A_max
UNION ALL
SELECT district_id, district_name, 'B' as status, count
FROM
(SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'B'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1) as B_max
UNION ALL
SELECT district_id, district_name, 'C' as status, count
FROM
(SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'C'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1) as C_max
UNION ALL
SELECT district_id, district_name, 'D' as status, count
FROM
(SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'D'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1) as D_max
| district_id | district_name | status | count |
|---|---|---|---|
| 1 | Hl.m. Praha | A | 34 |
| 1 | Hl.m. Praha | D | 4 |
| 1 | Hl.m. Praha | C | 43 |
| 1 | Hl.m. Praha | B | 3 |
We can make the above query easier to understand by using JupySQL’s --save feature.
%%sql --save A
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'A'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
| district_id | district_name | count |
|---|---|---|
| 1 | Hl.m. Praha | 34 |
%%sql --save B
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'B'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
| district_id | district_name | count |
|---|---|---|
| 1 | Hl.m. Praha | 3 |
%%sql --save C
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'C'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
| district_id | district_name | count |
|---|---|---|
| 1 | Hl.m. Praha | 43 |
%%sql --save D
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'D'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
| district_id | district_name | count |
|---|---|---|
| 1 | Hl.m. Praha | 4 |
We can then use these saved snippets and concatenate their results using UNION ALL.
%%sql
SELECT *
FROM A
UNION ALL
SELECT *
FROM B
UNION ALL
SELECT *
FROM C
UNION ALL
SELECT *
FROM D
| district_id | district_name | count |
|---|---|---|
| 1 | Hl.m. Praha | 34 |
| 1 | Hl.m. Praha | 4 |
| 1 | Hl.m. Praha | 43 |
| 1 | Hl.m. Praha | 3 |
Question 4 (Bonus):#
Output the COUNT() of of each unique ‘status’ variable under s1.loan that are greater than the average of ‘A’ type loans. Have the outputs be only five columns for each ‘status’ type with a single value each alongside with the total number of loans. You must use one or more subqueries.
Hint CASE WHEN is a clause that acts as a conditional statement when performing other SQL actions. Try to see how you can incorporate CASE WHEN with COUNT() to answer this question.
Answers
For each SELECT argument, we are finding the count of each status using CASE WHEN to have a “1” count when aggregating the ‘status’ values. This allows our query to have the correct counts for each value given that the loan amount is less than the average ‘A’ loan amount.
%%sql
SELECT COUNT(CASE WHEN status = 'A' THEN 1 END) AS A,
COUNT(CASE WHEN status = 'B' THEN 1 END) AS B,
COUNT(CASE WHEN status = 'C' THEN 1 END) AS C,
COUNT(CASE WHEN status = 'D' THEN 1 END) AS D,
COUNT(*) AS Total
FROM s1.loan
WHERE amount > (
SELECT AVG(amount)
FROM s1.loan
WHERE status = 'A'
)
| A | B | C | D | Total |
|---|---|---|---|---|
| 78 | 17 | 273 | 39 | 407 |
Delete tables
%%sql
DROP TABLE IF EXISTS s1.loan;
DROP TABLE IF EXISTS s1.account;
DROP TABLE IF EXISTS s1.district;
DROP SCHEMA s1;
| Success |
|---|
Wrapping Up#
In this section, we introduced subqueries and how they can be implemented with common clauses. Subqueries are an incredibly powerful and intuitive technique that can serve as an alternative to joins, as well as enable users to utilize the results of one query within another query.
We also explored how we can make subqueries more readable using JupySQL’s --save feature. Alongside this, we introduced some additional JupySQL magic commands such as %sqlcmd tables and %sqlcmd columns to view current tables in our schema.
In the next section, you will learn how to implement more advanced join techniques to your queries.