Using subqueries

Using subqueries#

A subquery is a technique that essentially embeds a query within another query. It enables users to run a query that has some kind of relation to another query in the same SQL statement.

Let’s demonstrate how powerful subqueries are by continuing with our banking data.

Install - execute this once.#

Important

Note: The --save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql pandas --quiet

Load the data#

Important

Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.

This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)

Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'

If you ran the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv, account.csv, district.csv.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Important

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db

Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'

Settings changed:

Config	value
displaycon	False
feedback	True
autopandas	False
named_parameters	True

Creating Tables#

Let’s start off with loading three of the eight .csv files from the expanded_data folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1 in which we will store the tables. Here we use the CREATE TABLE syntax in DuckDB to ingest four of the eight .csv files. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');

Count

Queries#

Let’s first load each table for reference and to better understand their contents.

%%sql
SELECT *
FROM s1.account
LIMIT 5

account_id	district_id	frequency	date
576	55	POPLATEK MESICNE	930101
3818	74	POPLATEK MESICNE	930101
704	55	POPLATEK MESICNE	930101
2378	16	POPLATEK MESICNE	930101
2632	24	POPLATEK MESICNE	930102

%%sql
SELECT *
FROM s1.district
LIMIT 5

district_id	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96
1	Hl.m. Praha	Prague	1204953	0	0	0	1	1	100.0	12541	0.29	0.43	167	85677	99107
2	Benesov	central Bohemia	88884	80	26	6	2	5	46.7	8507	1.67	1.85	132	2159	2674
3	Beroun	central Bohemia	75232	55	26	4	1	5	41.7	8980	1.95	2.21	111	2824	2813
4	Kladno	central Bohemia	149893	63	29	6	2	6	67.4	9753	4.64	5.05	109	5244	5892
5	Kolin	central Bohemia	95616	65	30	4	1	6	51.4	9307	3.85	4.43	118	2616	3040

%%sql
SELECT *
FROM s1.loan
LIMIT 5

loan_id	account_id	date	amount	duration	payments	status
5314	1787	930705	96396	12	8033.0	B
5316	1801	930711	165960	36	4610.0	A
6863	9188	930728	127080	60	2118.0	A
5325	1843	930803	105804	36	2939.0	A
7240	11013	930906	274740	60	4579.0	A

Subquery in SELECT#

The query belows results in the average loan amount for each account_id by having a subquery in the SELECT clause.

%%sql
SELECT s1.account.account_id, s1.account.date, 
      (SELECT AVG(s1.loan.amount) 
      FROM s1.loan 
      WHERE s1.loan.account_id = s1.account.account_id
      ) AS average_loan_amount
FROM s1.account 
WHERE average_loan_amount IS NOT NULL
ORDER BY average_loan_amount
LIMIT 5

account_id	date	average_loan_amount
3354	931105	4980.0
6061	970308	5148.0
8330	930328	7656.0
8268	960919	8616.0
6453	960727	10944.0

The above SQL statement is quite verbose. Let’s instead utilize aliases to make the query more readable.

%%sql
SELECT a.account_id,
      (SELECT AVG(l.amount) 
      FROM s1.loan AS l 
      WHERE l.account_id = a.account_id
      ) AS average_loan_amount
FROM s1.account AS a
WHERE average_loan_amount IS NOT NULL
ORDER BY average_loan_amount

account_id	average_loan_amount
3354	4980.0
6061	5148.0
8330	7656.0
8268	8616.0
6453	10944.0
5033	11400.0
6040	11736.0
666	12540.0
7559	12792.0
3637	14028.0

Truncated to displaylimit of 10.

Let’s first focus on the “outer query” of this statement. The outer query is everything in the SQL statement besides the second argument in the SELECT clause. This outer query results in just the account_id from the s1.account table. The “inner query” is the second argument in the SELECT statement. This query is called for each row from the outer query, which finds the average loan amount for each account by relation in the inner query’s WHERE clause.

In these datasets, each account actually only has one loan. We solely specify the AVG() function for demonstration purposes. We also specify IS NOT NULL because only 682 accounts actually have loans. The remaining accounts would have a NULL value.

Subquery in FROM#

In the query below, we find the average loan amount by district name while utilizing a subquery in a FROM clause.

%%sql
SELECT d.district_name, 
       ROUND(AVG(inner_query.amount),2) AS average_loan_amount
FROM 
    (SELECT l.loan_id, 
            l.amount, 
            a.district_id 
    FROM s1.loan AS l 
    INNER JOIN s1.account AS a 
      ON l.account_id = a.account_id
    ) AS inner_query
INNER JOIN s1.district AS d
  ON inner_query.district_id = d.district_id
GROUP BY d.district_name

district_name	average_loan_amount
Chrudim	178960.5
Cesky Krumlov	83475.43
Brno - venkov	123223.2
Brno - mesto	168725.0
Frydek - Mistek	128359.5
Pribram	162399.0
Hl.m. Praha	153957.29
Karvina	127492.5
Ostrava - mesto	163011.0
Jicin	121623.43

Truncated to displaylimit of 10.

This query demonstrates how to have a subquery act as another table when performing an INNER JOIN in the outer query. There is also an INNER JOIN in the inner query that creates a joined table between the s1.account and s1.loan tables. This subquery then gives us the necessary information to join with our s1.district table.

The query shown above can be easily implemented using JupySQL’s --save feature. This feature essentially allows us to save a query and use it for future use. We implement the --save feature below by recreating the previous query.

We first --save the subquery as loan_amount_district_id.

%%sql --save loan_amount_district_id
SELECT l.loan_id, 
        l.amount, 
        a.district_id 
FROM s1.loan AS l 
INNER JOIN s1.account AS a 
    ON l.account_id = a.account_id

loan_id	amount	district_id
6077	79608	44
7284	52788	15
7121	21924	55
5754	23052	54
6895	41904	68
6202	65184	54
6316	76908	12
5325	105804	12
6888	39576	1
6647	208320	1

Truncated to displaylimit of 10.

Then call the saved subquery and use it to complete our overall query.

%%sql 
SELECT d.district_name, 
       ROUND(AVG(inner_query.amount),2) AS average_loan_amount
FROM (SELECT *
      FROM loan_amount_district_id) AS inner_query
INNER JOIN s1.district AS d 
    ON inner_query.district_id = d.district_id
GROUP BY d.district_name

Generating CTE with stored snippets: 'loan_amount_district_id'

district_name	average_loan_amount
Chrudim	178960.5
Cesky Krumlov	83475.43
Brno - venkov	123223.2
Brno - mesto	168725.0
Frydek - Mistek	128359.5
Pribram	162399.0
Hl.m. Praha	153957.29
Karvina	127492.5
Ostrava - mesto	163011.0
Jicin	121623.43

Truncated to displaylimit of 10.

Subquery in WHERE#

This example calls a subquery from a statement’s WHERE clause.

%%sql
SELECT *
FROM s1.loan
WHERE amount > (
  SELECT AVG(amount) 
  FROM s1.loan 
  WHERE status = 'A'
)

loan_id	account_id	date	amount	duration	payments	status
5314	1787	930705	96396	12	8033.0	B
5316	1801	930711	165960	36	4610.0	A
6863	9188	930728	127080	60	2118.0	A
5325	1843	930803	105804	36	2939.0	A
7240	11013	930906	274740	60	4579.0	A
6111	5428	930924	174744	24	7281.0	B
7235	10973	931013	154416	48	3217.0	A
5997	4894	931104	117024	24	4876.0	A
6228	6034	931201	464520	60	7742.0	B
6356	6701	931208	95400	36	2650.0	A

Truncated to displaylimit of 10.

With this query, we are able to see all loans that are greater than the average loan amount for ‘A’ status loans.

Let’s again demonstrate how we can use --save to recreate this query.

%%sql --save avg_A_amount
SELECT AVG(amount) 
FROM s1.loan 
WHERE status = 'A'

avg(amount)
91641.45812807881

%%sql
SELECT *
FROM s1.loan
WHERE amount > (
  SELECT *
  FROM avg_A_amount) 

Generating CTE with stored snippets: 'avg_A_amount'

loan_id	account_id	date	amount	duration	payments	status
5314	1787	930705	96396	12	8033.0	B
5316	1801	930711	165960	36	4610.0	A
6863	9188	930728	127080	60	2118.0	A
5325	1843	930803	105804	36	2939.0	A
7240	11013	930906	274740	60	4579.0	A
6111	5428	930924	174744	24	7281.0	B
7235	10973	931013	154416	48	3217.0	A
5997	4894	931104	117024	24	4876.0	A
6228	6034	931201	464520	60	7742.0	B
6356	6701	931208	95400	36	2650.0	A

Truncated to displaylimit of 10.

Note that we do not have to assign an alias to our saved query because none of its variables are in the outer query’s SELECT clause.

Examining Our Snippets with JupySQL#

To examine all of our saved queries with --save, we can use %sqlcmd snippets.

%sqlcmd snippets

Stored snippets
loan_amount_district_id
avg_A_amount

Our saved queries (also known as snippets) can be deleted with the --delete feature. For more information on snippets, please visit JupySQL’s official documentation [here].(https://jupysql.ploomber.io/en/latest/api/magic-snippets.html)

Viewing Our Tables with JupySQL#

JupySQL also has features that allow for quick displays of tables and their columns. We can examine them with %sqlcmd tables and %sqlcmd columns.

%sqlcmd tables -s s1

Name
loan
district
account

The above code cell shows all tables under the s1 schema. Recall that we assigned a schema to these tables at the beginning of this section.

%sqlcmd columns allows us to examine each column in a table with added additional information.

%sqlcmd columns -s s1 -t account

name	type	nullable	default	autoincrement	comment
account_id	BIGINT	True	None	False	None
district_id	BIGINT	True	None	False	None
frequency	VARCHAR	True	None	False	None
date	BIGINT	True	None	False	None

In the above code-cell, we specify the schema and the table under the schema to examine.

Note: for now, JupySQL does not support the use of these magic commands with queries saved with --save.

For more information on these magic commands, please visit JupySQL’s official documentation [here].(https://jupysql.ploomber.io/en/latest/api/magic-tables-columns.html)

You try: Use JupySQL to perform the queries and answer the questions.#

Question 1 (Easy):#

Query loans with an amount greater than or equal to the AVG() loan amount. Utilize the --save feature to help you solve this question.

Answers

First, create a snippet with --save containing the AVG() loan amount.

%%sql --save average_loan_amount
SELECT AVG(amount)
FROM s1.loan

avg(amount)
151410.1759530792

Then we reference average_loan_amount as a subquery to filter for loan id’s that are greater than or equal to this average.

%%sql
SELECT loan_id
FROM s1.loan
WHERE amount >= (SELECT * 
                FROM average_loan_amount)

Generating CTE with stored snippets: 'average_loan_amount'

loan_id
5316
7240
6111
7235
6228
7104
5170
7226
6087
7262

Truncated to displaylimit of 10.

Question 2 (Medium):#

Find all account_id’s that have a loan alongside with their loan status, their loan amount, and the average loan amount for their loan’s loan status. Order by account_id.

Answers

The difficult part of this question lies in the second INNER JOIN. This second INNER JOIN uses a subquery to have the average amount of each ‘status’. This information is used when calling the last INNER JOIN on the outer query’s ‘status’ variable.

%%sql
SELECT a.account_id, 
       l.status AS loan_status, 
       l.amount AS loan_amount, 
       ROUND(avg_status_loan.amount,2) AS average_loan_amount
FROM 
    s1.account AS a
INNER JOIN 
    s1.loan AS l ON a.account_id = l.account_id
INNER JOIN (SELECT status, 
                   AVG(amount) AS amount
            FROM s1.loan
            GROUP BY status
            ) AS avg_status_loan ON l.status = avg_status_loan.status
ORDER BY a.account_id

account_id	loan_status	loan_amount	average_loan_amount
2	A	80952	91641.46
19	B	30276	140720.9
25	A	30276	91641.46
37	D	318480	249284.53
38	C	110736	171410.35
67	A	165960	91641.46
97	A	102876	91641.46
103	D	265320	249284.53
105	C	352704	171410.35
110	C	162576	171410.35

Truncated to displaylimit of 10.

Question 3 (Hard):#

Query the district_id and district_name that have the highest amount of loans for each loan status type.

To answer this question, let’s quickly learn about UNION ALL. UNION ALL is a clause that concatenates rows on top of each other. An example is shown below.

Using UNION ALL, we can achieve the correct results. Try to incorporate the --save feature to make your query easier to read.

Answers

We find the district with the highest loan status type by first joining the three necessary tables s1.account, s1.district, and s1.loan. These three tables are used in an inner query for each loan status type. Then, in each of these inner queries, we query the district id, district name, and the count of the first row after ordering by count. We finalize the query by “stacking” these results with UNION ALL.

%%sql
SELECT district_id, district_name, 'A' as status, count
FROM
    (SELECT a.district_id, d.district_name, COUNT(*) as count
    FROM s1.account a
    JOIN s1.loan l ON a.account_id = l.account_id
    JOIN s1.district d ON a.district_id = d.district_id
    WHERE l.status = 'A'
    GROUP BY a.district_id, d.district_name
    ORDER BY count DESC
    LIMIT 1) as A_max
UNION ALL
SELECT district_id, district_name, 'B' as status, count
FROM
    (SELECT a.district_id, d.district_name, COUNT(*) as count
    FROM s1.account a
    JOIN s1.loan l ON a.account_id = l.account_id
    JOIN s1.district d ON a.district_id = d.district_id
    WHERE l.status = 'B'
    GROUP BY a.district_id, d.district_name
    ORDER BY count DESC
    LIMIT 1) as B_max
UNION ALL
SELECT district_id, district_name, 'C' as status, count
FROM
    (SELECT a.district_id, d.district_name, COUNT(*) as count
    FROM s1.account a
    JOIN s1.loan l ON a.account_id = l.account_id
    JOIN s1.district d ON a.district_id = d.district_id
    WHERE l.status = 'C'
    GROUP BY a.district_id, d.district_name
    ORDER BY count DESC
    LIMIT 1) as C_max
UNION ALL
SELECT district_id, district_name, 'D' as status, count
FROM
    (SELECT a.district_id, d.district_name, COUNT(*) as count
    FROM s1.account a
    JOIN s1.loan l ON a.account_id = l.account_id
    JOIN s1.district d ON a.district_id = d.district_id
    WHERE l.status = 'D'
    GROUP BY a.district_id, d.district_name
    ORDER BY count DESC
    LIMIT 1) as D_max

district_id	district_name	status	count
1	Hl.m. Praha	A	34
1	Hl.m. Praha	D	4
1	Hl.m. Praha	C	43
1	Hl.m. Praha	B	3

We can make the above query easier to understand by using JupySQL’s --save feature.

%%sql --save A
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'A'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1

district_id	district_name	count
1	Hl.m. Praha	34

%%sql --save B
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'B'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1

district_id	district_name	count
1	Hl.m. Praha	3

%%sql --save C
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'C'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1

district_id	district_name	count
1	Hl.m. Praha	43

%%sql --save D
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'D'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1

district_id	district_name	count
1	Hl.m. Praha	4

We can then use these saved snippets and concatenate their results using UNION ALL.

%%sql
SELECT * 
FROM A 
UNION ALL 
SELECT *
FROM B
UNION ALL
SELECT *
FROM C
UNION ALL
SELECT *
FROM D

Generating CTE with stored snippets: 'A', 'B', 'C', and 'D'

district_id	district_name	count
1	Hl.m. Praha	34
1	Hl.m. Praha	4
1	Hl.m. Praha	43
1	Hl.m. Praha	3

Question 4 (Bonus):#

Output the COUNT() of of each unique ‘status’ variable under s1.loan that are greater than the average of ‘A’ type loans. Have the outputs be only five columns for each ‘status’ type with a single value each alongside with the total number of loans. You must use one or more subqueries.

Hint CASE WHEN is a clause that acts as a conditional statement when performing other SQL actions. Try to see how you can incorporate CASE WHEN with COUNT() to answer this question.

Answers

For each SELECT argument, we are finding the count of each status using CASE WHEN to have a “1” count when aggregating the ‘status’ values. This allows our query to have the correct counts for each value given that the loan amount is less than the average ‘A’ loan amount.

%%sql 
SELECT COUNT(CASE WHEN status = 'A' THEN 1 END) AS A,
       COUNT(CASE WHEN status = 'B' THEN 1 END) AS B,
       COUNT(CASE WHEN status = 'C' THEN 1 END) AS C,
       COUNT(CASE WHEN status = 'D' THEN 1 END) AS D,
       COUNT(*) AS Total
FROM s1.loan
WHERE amount > (
        SELECT AVG(amount)
        FROM s1.loan
        WHERE status = 'A'
    )

A	B	C	D	Total
78	17	273	39	407

Delete tables

%%sql
DROP TABLE IF EXISTS s1.loan;
DROP TABLE IF EXISTS s1.account;
DROP TABLE IF EXISTS s1.district;
DROP SCHEMA s1;

Success

Wrapping Up#

In this section, we introduced subqueries and how they can be implemented with common clauses. Subqueries are an incredibly powerful and intuitive technique that can serve as an alternative to joins, as well as enable users to utilize the results of one query within another query.

We also explored how we can make subqueries more readable using JupySQL’s --save feature. Alongside this, we introduced some additional JupySQL magic commands such as %sqlcmd tables and %sqlcmd columns to view current tables in our schema.

In the next section, you will learn how to implement more advanced join techniques to your queries.

Using subqueries

Contents

Using subqueries#

Install - execute this once.#

Load the data#

Load Engine#

Creating Tables#

Queries#

Subquery in SELECT#

Subquery in FROM#

Subquery in WHERE#

Examining Our Snippets with JupySQL#

Viewing Our Tables with JupySQL#

You try: Use JupySQL to perform the queries and answer the questions.#

Question 1 (Easy):#

Question 2 (Medium):#

Question 3 (Hard):#

Question 4 (Bonus):#

Wrapping Up#