Using subqueries#

A subquery is a technique that essentially embeds a query within another query. It enables users to run a query that has some kind of relation to another query in the same SQL statement.

Let’s demonstrate how powerful subqueries are by continuing with our banking data.

Install - execute this once.#

Important

Note: The --save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql pandas --quiet

Load the data#

Important

Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.

This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'

If you ran the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv, account.csv, district.csv.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Important

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True

Creating Tables#

Let’s start off with loading three of the eight .csv files from the expanded_data folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1 in which we will store the tables. Here we use the CREATE TABLE syntax in DuckDB to ingest four of the eight .csv files. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
Count

Queries#

Let’s first load each table for reference and to better understand their contents.

%%sql
SELECT *
FROM s1.account
LIMIT 5
account_id district_id frequency date
576 55 POPLATEK MESICNE 930101
3818 74 POPLATEK MESICNE 930101
704 55 POPLATEK MESICNE 930101
2378 16 POPLATEK MESICNE 930101
2632 24 POPLATEK MESICNE 930102
%%sql
SELECT *
FROM s1.district
LIMIT 5
district_id district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107
2 Benesov central Bohemia 88884 80 26 6 2 5 46.7 8507 1.67 1.85 132 2159 2674
3 Beroun central Bohemia 75232 55 26 4 1 5 41.7 8980 1.95 2.21 111 2824 2813
4 Kladno central Bohemia 149893 63 29 6 2 6 67.4 9753 4.64 5.05 109 5244 5892
5 Kolin central Bohemia 95616 65 30 4 1 6 51.4 9307 3.85 4.43 118 2616 3040
%%sql
SELECT *
FROM s1.loan
LIMIT 5
loan_id account_id date amount duration payments status
5314 1787 930705 96396 12 8033.0 B
5316 1801 930711 165960 36 4610.0 A
6863 9188 930728 127080 60 2118.0 A
5325 1843 930803 105804 36 2939.0 A
7240 11013 930906 274740 60 4579.0 A

Subquery in SELECT#

The query belows results in the average loan amount for each account_id by having a subquery in the SELECT clause.

%%sql
SELECT s1.account.account_id, s1.account.date, 
      (SELECT AVG(s1.loan.amount) 
      FROM s1.loan 
      WHERE s1.loan.account_id = s1.account.account_id
      ) AS average_loan_amount
FROM s1.account 
WHERE average_loan_amount IS NOT NULL
ORDER BY average_loan_amount
LIMIT 5
account_id date average_loan_amount
3354 931105 4980.0
6061 970308 5148.0
8330 930328 7656.0
8268 960919 8616.0
6453 960727 10944.0

The above SQL statement is quite verbose. Let’s instead utilize aliases to make the query more readable.

%%sql
SELECT a.account_id,
      (SELECT AVG(l.amount) 
      FROM s1.loan AS l 
      WHERE l.account_id = a.account_id
      ) AS average_loan_amount
FROM s1.account AS a
WHERE average_loan_amount IS NOT NULL
ORDER BY average_loan_amount
account_id average_loan_amount
3354 4980.0
6061 5148.0
8330 7656.0
8268 8616.0
6453 10944.0
5033 11400.0
6040 11736.0
666 12540.0
7559 12792.0
3637 14028.0
Truncated to displaylimit of 10.

Let’s first focus on the “outer query” of this statement. The outer query is everything in the SQL statement besides the second argument in the SELECT clause. This outer query results in just the account_id from the s1.account table. The “inner query” is the second argument in the SELECT statement. This query is called for each row from the outer query, which finds the average loan amount for each account by relation in the inner query’s WHERE clause.

In these datasets, each account actually only has one loan. We solely specify the AVG() function for demonstration purposes. We also specify IS NOT NULL because only 682 accounts actually have loans. The remaining accounts would have a NULL value.

Subquery in FROM#

In the query below, we find the average loan amount by district name while utilizing a subquery in a FROM clause.

%%sql
SELECT d.district_name, 
       ROUND(AVG(inner_query.amount),2) AS average_loan_amount
FROM 
    (SELECT l.loan_id, 
            l.amount, 
            a.district_id 
    FROM s1.loan AS l 
    INNER JOIN s1.account AS a 
      ON l.account_id = a.account_id
    ) AS inner_query
INNER JOIN s1.district AS d
  ON inner_query.district_id = d.district_id
GROUP BY d.district_name
district_name average_loan_amount
Chrudim 178960.5
Cesky Krumlov 83475.43
Brno - venkov 123223.2
Brno - mesto 168725.0
Frydek - Mistek 128359.5
Pribram 162399.0
Hl.m. Praha 153957.29
Karvina 127492.5
Ostrava - mesto 163011.0
Jicin 121623.43
Truncated to displaylimit of 10.

This query demonstrates how to have a subquery act as another table when performing an INNER JOIN in the outer query. There is also an INNER JOIN in the inner query that creates a joined table between the s1.account and s1.loan tables. This subquery then gives us the necessary information to join with our s1.district table.

The query shown above can be easily implemented using JupySQL’s --save feature. This feature essentially allows us to save a query and use it for future use. We implement the --save feature below by recreating the previous query.

We first --save the subquery as loan_amount_district_id.

%%sql --save loan_amount_district_id
SELECT l.loan_id, 
        l.amount, 
        a.district_id 
FROM s1.loan AS l 
INNER JOIN s1.account AS a 
    ON l.account_id = a.account_id
loan_id amount district_id
6077 79608 44
7284 52788 15
7121 21924 55
5754 23052 54
6895 41904 68
6202 65184 54
6316 76908 12
5325 105804 12
6888 39576 1
6647 208320 1
Truncated to displaylimit of 10.

Then call the saved subquery and use it to complete our overall query.

%%sql 
SELECT d.district_name, 
       ROUND(AVG(inner_query.amount),2) AS average_loan_amount
FROM (SELECT *
      FROM loan_amount_district_id) AS inner_query
INNER JOIN s1.district AS d 
    ON inner_query.district_id = d.district_id
GROUP BY d.district_name
Generating CTE with stored snippets: 'loan_amount_district_id'
district_name average_loan_amount
Chrudim 178960.5
Cesky Krumlov 83475.43
Brno - venkov 123223.2
Brno - mesto 168725.0
Frydek - Mistek 128359.5
Pribram 162399.0
Hl.m. Praha 153957.29
Karvina 127492.5
Ostrava - mesto 163011.0
Jicin 121623.43
Truncated to displaylimit of 10.

Subquery in WHERE#

This example calls a subquery from a statement’s WHERE clause.

%%sql
SELECT *
FROM s1.loan
WHERE amount > (
  SELECT AVG(amount) 
  FROM s1.loan 
  WHERE status = 'A'
)
loan_id account_id date amount duration payments status
5314 1787 930705 96396 12 8033.0 B
5316 1801 930711 165960 36 4610.0 A
6863 9188 930728 127080 60 2118.0 A
5325 1843 930803 105804 36 2939.0 A
7240 11013 930906 274740 60 4579.0 A
6111 5428 930924 174744 24 7281.0 B
7235 10973 931013 154416 48 3217.0 A
5997 4894 931104 117024 24 4876.0 A
6228 6034 931201 464520 60 7742.0 B
6356 6701 931208 95400 36 2650.0 A
Truncated to displaylimit of 10.

With this query, we are able to see all loans that are greater than the average loan amount for ‘A’ status loans.

Let’s again demonstrate how we can use --save to recreate this query.

%%sql --save avg_A_amount
SELECT AVG(amount) 
FROM s1.loan 
WHERE status = 'A'
avg(amount)
91641.45812807881
%%sql
SELECT *
FROM s1.loan
WHERE amount > (
  SELECT *
  FROM avg_A_amount) 
Generating CTE with stored snippets: 'avg_A_amount'
loan_id account_id date amount duration payments status
5314 1787 930705 96396 12 8033.0 B
5316 1801 930711 165960 36 4610.0 A
6863 9188 930728 127080 60 2118.0 A
5325 1843 930803 105804 36 2939.0 A
7240 11013 930906 274740 60 4579.0 A
6111 5428 930924 174744 24 7281.0 B
7235 10973 931013 154416 48 3217.0 A
5997 4894 931104 117024 24 4876.0 A
6228 6034 931201 464520 60 7742.0 B
6356 6701 931208 95400 36 2650.0 A
Truncated to displaylimit of 10.

Note that we do not have to assign an alias to our saved query because none of its variables are in the outer query’s SELECT clause.

Examining Our Snippets with JupySQL#

To examine all of our saved queries with --save, we can use %sqlcmd snippets.

%sqlcmd snippets
Stored snippets
loan_amount_district_id
avg_A_amount

Our saved queries (also known as snippets) can be deleted with the --delete feature. For more information on snippets, please visit JupySQL’s official documentation [here].(https://jupysql.ploomber.io/en/latest/api/magic-snippets.html)

Viewing Our Tables with JupySQL#

JupySQL also has features that allow for quick displays of tables and their columns. We can examine them with %sqlcmd tables and %sqlcmd columns.

%sqlcmd tables -s s1
Name
loan
district
account

The above code cell shows all tables under the s1 schema. Recall that we assigned a schema to these tables at the beginning of this section.

%sqlcmd columns allows us to examine each column in a table with added additional information.

%sqlcmd columns -s s1 -t account
name type nullable default autoincrement comment
account_id BIGINT True None False None
district_id BIGINT True None False None
frequency VARCHAR True None False None
date BIGINT True None False None

In the above code-cell, we specify the schema and the table under the schema to examine.

Note: for now, JupySQL does not support the use of these magic commands with queries saved with --save.

For more information on these magic commands, please visit JupySQL’s official documentation [here].(https://jupysql.ploomber.io/en/latest/api/magic-tables-columns.html)

You try: Use JupySQL to perform the queries and answer the questions.#

Question 1 (Easy):#

Query loans with an amount greater than or equal to the AVG() loan amount. Utilize the --save feature to help you solve this question.

Answers

First, create a snippet with --save containing the AVG() loan amount.

%%sql --save average_loan_amount
SELECT AVG(amount)
FROM s1.loan
avg(amount)
151410.1759530792

Then we reference average_loan_amount as a subquery to filter for loan id’s that are greater than or equal to this average.

%%sql
SELECT loan_id
FROM s1.loan
WHERE amount >= (SELECT * 
                FROM average_loan_amount)
Generating CTE with stored snippets: 'average_loan_amount'
loan_id
5316
7240
6111
7235
6228
7104
5170
7226
6087
7262
Truncated to displaylimit of 10.

Question 2 (Medium):#

Find all account_id’s that have a loan alongside with their loan status, their loan amount, and the average loan amount for their loan’s loan status. Order by account_id.

Answers

The difficult part of this question lies in the second INNER JOIN. This second INNER JOIN uses a subquery to have the average amount of each ‘status’. This information is used when calling the last INNER JOIN on the outer query’s ‘status’ variable.

%%sql
SELECT a.account_id, 
       l.status AS loan_status, 
       l.amount AS loan_amount, 
       ROUND(avg_status_loan.amount,2) AS average_loan_amount
FROM 
    s1.account AS a
INNER JOIN 
    s1.loan AS l ON a.account_id = l.account_id
INNER JOIN (SELECT status, 
                   AVG(amount) AS amount
            FROM s1.loan
            GROUP BY status
            ) AS avg_status_loan ON l.status = avg_status_loan.status
ORDER BY a.account_id
account_id loan_status loan_amount average_loan_amount
2 A 80952 91641.46
19 B 30276 140720.9
25 A 30276 91641.46
37 D 318480 249284.53
38 C 110736 171410.35
67 A 165960 91641.46
97 A 102876 91641.46
103 D 265320 249284.53
105 C 352704 171410.35
110 C 162576 171410.35
Truncated to displaylimit of 10.

Question 3 (Hard):#

Query the district_id and district_name that have the highest amount of loans for each loan status type.

To answer this question, let’s quickly learn about UNION ALL. UNION ALL is a clause that concatenates rows on top of each other. An example is shown below.

Using UNION ALL, we can achieve the correct results. Try to incorporate the --save feature to make your query easier to read.

Answers

We find the district with the highest loan status type by first joining the three necessary tables s1.account, s1.district, and s1.loan. These three tables are used in an inner query for each loan status type. Then, in each of these inner queries, we query the district id, district name, and the count of the first row after ordering by count. We finalize the query by “stacking” these results with UNION ALL.

%%sql
SELECT district_id, district_name, 'A' as status, count
FROM
    (SELECT a.district_id, d.district_name, COUNT(*) as count
    FROM s1.account a
    JOIN s1.loan l ON a.account_id = l.account_id
    JOIN s1.district d ON a.district_id = d.district_id
    WHERE l.status = 'A'
    GROUP BY a.district_id, d.district_name
    ORDER BY count DESC
    LIMIT 1) as A_max
UNION ALL
SELECT district_id, district_name, 'B' as status, count
FROM
    (SELECT a.district_id, d.district_name, COUNT(*) as count
    FROM s1.account a
    JOIN s1.loan l ON a.account_id = l.account_id
    JOIN s1.district d ON a.district_id = d.district_id
    WHERE l.status = 'B'
    GROUP BY a.district_id, d.district_name
    ORDER BY count DESC
    LIMIT 1) as B_max
UNION ALL
SELECT district_id, district_name, 'C' as status, count
FROM
    (SELECT a.district_id, d.district_name, COUNT(*) as count
    FROM s1.account a
    JOIN s1.loan l ON a.account_id = l.account_id
    JOIN s1.district d ON a.district_id = d.district_id
    WHERE l.status = 'C'
    GROUP BY a.district_id, d.district_name
    ORDER BY count DESC
    LIMIT 1) as C_max
UNION ALL
SELECT district_id, district_name, 'D' as status, count
FROM
    (SELECT a.district_id, d.district_name, COUNT(*) as count
    FROM s1.account a
    JOIN s1.loan l ON a.account_id = l.account_id
    JOIN s1.district d ON a.district_id = d.district_id
    WHERE l.status = 'D'
    GROUP BY a.district_id, d.district_name
    ORDER BY count DESC
    LIMIT 1) as D_max
district_id district_name status count
1 Hl.m. Praha A 34
1 Hl.m. Praha D 4
1 Hl.m. Praha C 43
1 Hl.m. Praha B 3

We can make the above query easier to understand by using JupySQL’s --save feature.

%%sql --save A
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'A'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
district_id district_name count
1 Hl.m. Praha 34
%%sql --save B
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'B'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
district_id district_name count
1 Hl.m. Praha 3
%%sql --save C
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'C'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
district_id district_name count
1 Hl.m. Praha 43
%%sql --save D
SELECT a.district_id, d.district_name, COUNT(*) as count
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
JOIN s1.district d ON a.district_id = d.district_id
WHERE l.status = 'D'
GROUP BY a.district_id, d.district_name
ORDER BY count DESC
LIMIT 1
district_id district_name count
1 Hl.m. Praha 4

We can then use these saved snippets and concatenate their results using UNION ALL.

%%sql
SELECT * 
FROM A 
UNION ALL 
SELECT *
FROM B
UNION ALL
SELECT *
FROM C
UNION ALL
SELECT *
FROM D
Generating CTE with stored snippets: 'A', 'B', 'C', and 'D'
district_id district_name count
1 Hl.m. Praha 34
1 Hl.m. Praha 4
1 Hl.m. Praha 43
1 Hl.m. Praha 3

Question 4 (Bonus):#

Output the COUNT() of of each unique ‘status’ variable under s1.loan that are greater than the average of ‘A’ type loans. Have the outputs be only five columns for each ‘status’ type with a single value each alongside with the total number of loans. You must use one or more subqueries.

Hint CASE WHEN is a clause that acts as a conditional statement when performing other SQL actions. Try to see how you can incorporate CASE WHEN with COUNT() to answer this question.

Answers

For each SELECT argument, we are finding the count of each status using CASE WHEN to have a “1” count when aggregating the ‘status’ values. This allows our query to have the correct counts for each value given that the loan amount is less than the average ‘A’ loan amount.

%%sql 
SELECT COUNT(CASE WHEN status = 'A' THEN 1 END) AS A,
       COUNT(CASE WHEN status = 'B' THEN 1 END) AS B,
       COUNT(CASE WHEN status = 'C' THEN 1 END) AS C,
       COUNT(CASE WHEN status = 'D' THEN 1 END) AS D,
       COUNT(*) AS Total
FROM s1.loan
WHERE amount > (
        SELECT AVG(amount)
        FROM s1.loan
        WHERE status = 'A'
    )
A B C D Total
78 17 273 39 407

Delete tables

%%sql
DROP TABLE IF EXISTS s1.loan;
DROP TABLE IF EXISTS s1.account;
DROP TABLE IF EXISTS s1.district;
DROP SCHEMA s1;
Success

Wrapping Up#

In this section, we introduced subqueries and how they can be implemented with common clauses. Subqueries are an incredibly powerful and intuitive technique that can serve as an alternative to joins, as well as enable users to utilize the results of one query within another query.

We also explored how we can make subqueries more readable using JupySQL’s --save feature. Alongside this, we introduced some additional JupySQL magic commands such as %sqlcmd tables and %sqlcmd columns to view current tables in our schema.

In the next section, you will learn how to implement more advanced join techniques to your queries.