Advanced join operations in SQL#

Install - execute this once.#

Important

Note: The --save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql pandas --quiet

Load the data#

Important

Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.

This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402


_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'

If you ran the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv, account.csv, district.csv.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Important

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True

Creating Tables#

Let’s start off with loading three of the eight .csv files from the expanded_data folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1 in which we will store the tables. Here we use the CREATE TABLE syntax in DuckDB to ingest four of the eight .csv files. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
Count

The code above will create three tables in the database schema: s1.account, s1.district, s1.loan.

Exploring the data#

Let’s take a look at its entries.

%sqlcmd explore --table s1.account
%sqlcmd explore --table s1.district
%sqlcmd explore --table s1.loan

Nested-loop joins#

A nested loop join compares each row from the first table with each row from the second table to find all pairs of rows which satisfy the join predicate.

When to use#

This strategy is generally used when one of the tables in the join is significantly smaller than the other. The small table (or sometimes just a subset of it) can be kept in memory while the larger table is scanned, allowing for efficient access to the smaller table.

%%sql
SELECT DISTINCT a.account_id, a.district_id, a.frequency, a.date, l.loan_id, l.date as date_1, l.amount, l.duration, l.payments, l.status
FROM s1.account a, s1.loan l
WHERE a.account_id = l.account_id
LIMIT 5;
account_id district_id frequency date loan_id date_1 amount duration payments status
5270 44 POPLATEK MESICNE 930113 6077 931122 79608 24 3317.0 A
11265 15 POPLATEK MESICNE 930114 7284 930915 52788 12 4399.0 A
10364 55 POPLATEK MESICNE 930117 7121 931110 21924 36 609.0 A
3834 54 POPLATEK MESICNE 930119 5754 940928 23052 12 1921.0 A
9307 68 POPLATEK MESICNE 930124 6895 940919 41904 12 3492.0 A

Merge joins#

Merge join combines two sorted lists like a zipper based on the join predicates.

When to use#

This is a very efficient join strategy when the join columns of both tables are sorted, or when the database can efficiently sort them. If you know that your tables are sorted on the join column, this strategy is likely to be chosen.

%%sql
SELECT *
FROM s1.account a
INNER JOIN s1.district d
ON a.district_id = d.district_id
LIMIT 5;
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
576 55 POPLATEK MESICNE 930101 55 Brno - venkov south Moravia 157042 49 70 18 0 9 33.9 8743 1.88 2.43 111 3659 3894
3818 74 POPLATEK MESICNE 930101 74 Ostrava - mesto north Moravia 323870 0 0 0 1 1 100.0 10673 4.75 5.44 100 18782 18347
704 55 POPLATEK MESICNE 930101 55 Brno - venkov south Moravia 157042 49 70 18 0 9 33.9 8743 1.88 2.43 111 3659 3894
2378 16 POPLATEK MESICNE 930101 16 Jindrichuv Hradec south Bohemia 93931 74 21 10 1 8 56.9 8427 1.12 1.54 107 1874 1913
2632 24 POPLATEK MESICNE 930102 24 Karlovy Vary west Bohemia 122603 25 21 6 2 8 80.0 8991 1.39 2.01 128 5198 5273

Hash joins#

A hash join uses a hash table for finding matching rows. The table is partitioned based on the hash value of the join column(s).

When to use#

This strategy is typically used when the join columns are not sorted, and neither table is much smaller than the other. The database builds a hash table from one of the tables, then scans the other table and uses the hash table to find matching rows. If your tables are large and not sorted on the join columns, the database is likely to use this strategy.

Since DuckDB doesn’t support join hints, the corresponding SQL query example for join hints (SELECT /*+ HASH_JOIN(a, l) */ *…) it’s just for illustration purposes and doesn’t actually force a hash join in DuckDB.

%%sql
SELECT *
FROM s1.account a
JOIN s1.district d
ON a.district_id = d.district_id
LIMIT 5;
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
576 55 POPLATEK MESICNE 930101 55 Brno - venkov south Moravia 157042 49 70 18 0 9 33.9 8743 1.88 2.43 111 3659 3894
3818 74 POPLATEK MESICNE 930101 74 Ostrava - mesto north Moravia 323870 0 0 0 1 1 100.0 10673 4.75 5.44 100 18782 18347
704 55 POPLATEK MESICNE 930101 55 Brno - venkov south Moravia 157042 49 70 18 0 9 33.9 8743 1.88 2.43 111 3659 3894
2378 16 POPLATEK MESICNE 930101 16 Jindrichuv Hradec south Bohemia 93931 74 21 10 1 8 56.9 8427 1.12 1.54 107 1874 1913
2632 24 POPLATEK MESICNE 930102 24 Karlovy Vary west Bohemia 122603 25 21 6 2 8 80.0 8991 1.39 2.01 128 5198 5273

Internal joins#

An internal join combines rows from different tables if the join condition is true.

When to use#

In the context of DuckDB, we can’t explicitly choose an “internal join”. But in terms of a join operation that uses indexes to expedite the join process, this would be most applicable when you have indexed your join columns, and these indexes can be effectively used by the database engine to perform the join operation faster.

%%sql
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
JOIN s1.loan l ON a.account_id = l.account_id
LIMIT 5;
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96 loan_id account_id_1 date_1 amount duration payments status
5270 44 POPLATEK MESICNE 930113 44 Chrudim east Bohemia 105606 77 26 7 2 7 53.0 8254 2.79 3.76 97 2166 2325 6077 5270 931122 79608 24 3317.0 A
11265 15 POPLATEK MESICNE 930114 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.6 124 1845 1879 7284 11265 930915 52788 12 4399.0 A
10364 55 POPLATEK MESICNE 930117 55 Brno - venkov south Moravia 157042 49 70 18 0 9 33.9 8743 1.88 2.43 111 3659 3894 7121 10364 931110 21924 36 609.0 A
3834 54 POPLATEK MESICNE 930119 54 Brno - mesto south Moravia 387570 0 0 0 1 1 100.0 9897 1.60 1.96 140 18721 18696 5754 3834 940928 23052 12 1921.0 A
9307 68 POPLATEK MESICNE 930124 68 Frydek - Mistek north Moravia 228848 15 40 18 2 6 57.2 9893 4.09 4.72 96 5623 5887 6895 9307 940919 41904 12 3492.0 A

Join hints#

Join hints are a way to influence the join strategy chosen by the SQL server. For instance, you can suggest using a loop, hash, or merge join.

When to use#

DuckDB does not support join hints. The optimizer in DuckDB chooses the join method based on the table statistics and query specifics. In most cases, letting the optimizer make this decision is the best choice.

%%sql
SELECT /*+ HASH_JOIN(a, l) */ *
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
LIMIT 5;
account_id district_id frequency date loan_id account_id_1 date_1 amount duration payments status
5270 44 POPLATEK MESICNE 930113 6077 5270 931122 79608 24 3317.0 A
11265 15 POPLATEK MESICNE 930114 7284 11265 930915 52788 12 4399.0 A
10364 55 POPLATEK MESICNE 930117 7121 10364 931110 21924 36 609.0 A
3834 54 POPLATEK MESICNE 930119 5754 3834 940928 23052 12 1921.0 A
9307 68 POPLATEK MESICNE 930124 6895 9307 940919 41904 12 3492.0 A

Exercise 1#

Given an account id (e.g., 1787), retrieve all loan records for that account.

Answers

We can use a Nested-loop Join on s1.account as a and s1.loan as l where the account_id matches in each table, and where the account_id is 1787.

%%sql 
SELECT *
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
WHERE a.account_id = 1787;
account_id district_id frequency date loan_id account_id_1 date_1 amount duration payments status
1787 30 POPLATEK TYDNE 930322 5314 1787 930705 96396 12 8033.0 B

Exercise 2#

Retrieve all account and district information for accounts with district_id between 10 and 20.

Answers

We can use a Merge Join on s1.account as a and s1.district as d where the district_id matches in each table, and where the district_id is between 10 and 20.

%%sql 
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
WHERE d.district_id BETWEEN 10 AND 20;
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
2378 16 POPLATEK MESICNE 930101 16 Jindrichuv Hradec south Bohemia 93931 74 21 10 1 8 56.9 8427 1.12 1.54 107 1874 1913
2357 19 POPLATEK MESICNE 930104 19 Prachatice south Bohemia 51428 50 11 3 1 4 52.7 8402 3.13 3.98 120 999 1099
2393 10 POPLATEK MESICNE 930107 10 Praha - vychod central Bohemia 92084 55 29 4 3 5 46.7 10124 0.56 0.54 141 3810 4316
1909 14 POPLATEK MESICNE 930112 14 Ceske Budejovice south Bohemia 177686 69 27 10 1 9 74.8 10045 1.42 1.71 135 6604 6295
11265 15 POPLATEK MESICNE 930114 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.6 124 1845 1879
1481 15 POPLATEK MESICNE 930117 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.6 124 1845 1879
6473 12 POPLATEK MESICNE 930126 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868
2670 13 POPLATEK MESICNE 930127 13 Rakovnik central Bohemia 53921 61 22 1 1 2 41.3 8598 2.77 3.26 123 1597 1875
1843 12 POPLATEK MESICNE 930130 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868
2434 17 POPLATEK MESICNE 930202 17 Pelhrimov south Bohemia 74062 99 15 4 2 7 61.4 8114 2.38 2.62 119 1003 1181
Truncated to displaylimit of 10.

Exercise 3#

Retrieve all account, loan and district information.

Answers

We can use a Hash Join on s1.account as a and s1.district as d where the district_id matches in each table. We can then join this to the s1.loan table as l where the account_id in a and l match.

%%sql 
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
JOIN s1.loan l ON a.account_id = l.account_id;
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96 loan_id account_id_1 date_1 amount duration payments status
5270 44 POPLATEK MESICNE 930113 44 Chrudim east Bohemia 105606 77 26 7 2 7 53.0 8254 2.79 3.76 97 2166 2325 6077 5270 931122 79608 24 3317.0 A
11265 15 POPLATEK MESICNE 930114 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.6 124 1845 1879 7284 11265 930915 52788 12 4399.0 A
10364 55 POPLATEK MESICNE 930117 55 Brno - venkov south Moravia 157042 49 70 18 0 9 33.9 8743 1.88 2.43 111 3659 3894 7121 10364 931110 21924 36 609.0 A
3834 54 POPLATEK MESICNE 930119 54 Brno - mesto south Moravia 387570 0 0 0 1 1 100.0 9897 1.60 1.96 140 18721 18696 5754 3834 940928 23052 12 1921.0 A
9307 68 POPLATEK MESICNE 930124 68 Frydek - Mistek north Moravia 228848 15 40 18 2 6 57.2 9893 4.09 4.72 96 5623 5887 6895 9307 940919 41904 12 3492.0 A
5891 54 POPLATEK MESICNE 930125 54 Brno - mesto south Moravia 387570 0 0 0 1 1 100.0 9897 1.60 1.96 140 18721 18696 6202 5891 940418 65184 12 5432.0 A
6473 12 POPLATEK MESICNE 930126 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 6316 6473 940526 76908 12 6409.0 B
1843 12 POPLATEK MESICNE 930130 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 5325 1843 930803 105804 36 2939.0 A
9265 1 POPLATEK MESICNE 930204 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 6888 9265 940711 39576 12 3298.0 A
8051 1 POPLATEK TYDNE 930207 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 6647 8051 940601 208320 48 4340.0 A
Truncated to displaylimit of 10.

Delete tables

%%sql
DROP TABLE s1.account;
DROP TABLE s1.loan;
DROP TABLE s1.district;
DROP SCHEMA s1;
Success

Conclusion#

In this tutorial, we have explored advanced join operations in SQL, including nested-loop joins, merge joins, hash joins, internal joins, and the concept of join hints. We learned how to utilize these different join methods in SQL queries and understood the specific scenarios in which each type of join is most efficient.

We used DuckDB as our SQL engine and the banking dataset for our exercises. DuckDB is an excellent tool for SQL queries because of its ease of use and integration with the Jupyter notebook environment. However, it’s important to note that DuckDB’s query optimizer chooses the join method based on the table statistics and query specifics. So while the SQL examples in this tutorial illustrate the syntax and usage of different types of joins, the actual join type chosen by DuckDB might differ.