Advanced join operations in SQL#
Install - execute this once.#
Important
Note: The --save
and %sqlcmd
features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql pandas --quiet
Load the data#
Important
Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.
This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.
import sys
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.MarketData(
"https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip", # noqa E501
"expanded_data",
)
_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'
If you ran the above cell, you should have a folder expanded_data
in your current directory that contains the .csv
files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv
, account.csv
, district.csv
.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Important
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Creating Tables#
Let’s start off with loading three of the eight .csv
files from the expanded_data
folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1
in which we will store the tables. Here we use the CREATE TABLE
syntax in DuckDB to ingest four of the eight .csv
files. The read_csv_auto
is a function that helps SQL understand our local .csv
file for creation into our database.
%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
Count |
---|
The code above will create three tables in the database schema: s1.account
, s1.district
, s1.loan
.
Exploring the data#
Let’s take a look at its entries.
%sqlcmd explore --table s1.account
%sqlcmd explore --table s1.district
%sqlcmd explore --table s1.loan
Nested-loop joins#
A nested loop join compares each row from the first table with each row from the second table to find all pairs of rows which satisfy the join predicate.
When to use#
This strategy is generally used when one of the tables in the join is significantly smaller than the other. The small table (or sometimes just a subset of it) can be kept in memory while the larger table is scanned, allowing for efficient access to the smaller table.
%%sql
SELECT DISTINCT a.account_id, a.district_id, a.frequency, a.date, l.loan_id, l.date as date_1, l.amount, l.duration, l.payments, l.status
FROM s1.account a, s1.loan l
WHERE a.account_id = l.account_id
LIMIT 5;
account_id | district_id | frequency | date | loan_id | date_1 | amount | duration | payments | status |
---|---|---|---|---|---|---|---|---|---|
5270 | 44 | POPLATEK MESICNE | 930113 | 6077 | 931122 | 79608 | 24 | 3317.0 | A |
11265 | 15 | POPLATEK MESICNE | 930114 | 7284 | 930915 | 52788 | 12 | 4399.0 | A |
10364 | 55 | POPLATEK MESICNE | 930117 | 7121 | 931110 | 21924 | 36 | 609.0 | A |
3834 | 54 | POPLATEK MESICNE | 930119 | 5754 | 940928 | 23052 | 12 | 1921.0 | A |
9307 | 68 | POPLATEK MESICNE | 930124 | 6895 | 940919 | 41904 | 12 | 3492.0 | A |
Merge joins#
Merge join combines two sorted lists like a zipper based on the join predicates.
When to use#
This is a very efficient join strategy when the join columns of both tables are sorted, or when the database can efficiently sort them. If you know that your tables are sorted on the join column, this strategy is likely to be chosen.
%%sql
SELECT *
FROM s1.account a
INNER JOIN s1.district d
ON a.district_id = d.district_id
LIMIT 5;
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
576 | 55 | POPLATEK MESICNE | 930101 | 55 | Brno - venkov | south Moravia | 157042 | 49 | 70 | 18 | 0 | 9 | 33.9 | 8743 | 1.88 | 2.43 | 111 | 3659 | 3894 |
3818 | 74 | POPLATEK MESICNE | 930101 | 74 | Ostrava - mesto | north Moravia | 323870 | 0 | 0 | 0 | 1 | 1 | 100.0 | 10673 | 4.75 | 5.44 | 100 | 18782 | 18347 |
704 | 55 | POPLATEK MESICNE | 930101 | 55 | Brno - venkov | south Moravia | 157042 | 49 | 70 | 18 | 0 | 9 | 33.9 | 8743 | 1.88 | 2.43 | 111 | 3659 | 3894 |
2378 | 16 | POPLATEK MESICNE | 930101 | 16 | Jindrichuv Hradec | south Bohemia | 93931 | 74 | 21 | 10 | 1 | 8 | 56.9 | 8427 | 1.12 | 1.54 | 107 | 1874 | 1913 |
2632 | 24 | POPLATEK MESICNE | 930102 | 24 | Karlovy Vary | west Bohemia | 122603 | 25 | 21 | 6 | 2 | 8 | 80.0 | 8991 | 1.39 | 2.01 | 128 | 5198 | 5273 |
Hash joins#
A hash join uses a hash table for finding matching rows. The table is partitioned based on the hash value of the join column(s).
When to use#
This strategy is typically used when the join columns are not sorted, and neither table is much smaller than the other. The database builds a hash table from one of the tables, then scans the other table and uses the hash table to find matching rows. If your tables are large and not sorted on the join columns, the database is likely to use this strategy.
Since DuckDB
doesn’t support join hints, the corresponding SQL query example for join hints (SELECT /*+ HASH_JOIN(a, l) */ *…) it’s just for illustration purposes and doesn’t actually force a hash join in DuckDB
.
%%sql
SELECT *
FROM s1.account a
JOIN s1.district d
ON a.district_id = d.district_id
LIMIT 5;
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
576 | 55 | POPLATEK MESICNE | 930101 | 55 | Brno - venkov | south Moravia | 157042 | 49 | 70 | 18 | 0 | 9 | 33.9 | 8743 | 1.88 | 2.43 | 111 | 3659 | 3894 |
3818 | 74 | POPLATEK MESICNE | 930101 | 74 | Ostrava - mesto | north Moravia | 323870 | 0 | 0 | 0 | 1 | 1 | 100.0 | 10673 | 4.75 | 5.44 | 100 | 18782 | 18347 |
704 | 55 | POPLATEK MESICNE | 930101 | 55 | Brno - venkov | south Moravia | 157042 | 49 | 70 | 18 | 0 | 9 | 33.9 | 8743 | 1.88 | 2.43 | 111 | 3659 | 3894 |
2378 | 16 | POPLATEK MESICNE | 930101 | 16 | Jindrichuv Hradec | south Bohemia | 93931 | 74 | 21 | 10 | 1 | 8 | 56.9 | 8427 | 1.12 | 1.54 | 107 | 1874 | 1913 |
2632 | 24 | POPLATEK MESICNE | 930102 | 24 | Karlovy Vary | west Bohemia | 122603 | 25 | 21 | 6 | 2 | 8 | 80.0 | 8991 | 1.39 | 2.01 | 128 | 5198 | 5273 |
Internal joins#
An internal join combines rows from different tables if the join condition is true.
When to use#
In the context of DuckDB, we can’t explicitly choose an “internal join”. But in terms of a join operation that uses indexes to expedite the join process, this would be most applicable when you have indexed your join columns, and these indexes can be effectively used by the database engine to perform the join operation faster.
%%sql
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
JOIN s1.loan l ON a.account_id = l.account_id
LIMIT 5;
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 | loan_id | account_id_1 | date_1 | amount | duration | payments | status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5270 | 44 | POPLATEK MESICNE | 930113 | 44 | Chrudim | east Bohemia | 105606 | 77 | 26 | 7 | 2 | 7 | 53.0 | 8254 | 2.79 | 3.76 | 97 | 2166 | 2325 | 6077 | 5270 | 931122 | 79608 | 24 | 3317.0 | A |
11265 | 15 | POPLATEK MESICNE | 930114 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.6 | 124 | 1845 | 1879 | 7284 | 11265 | 930915 | 52788 | 12 | 4399.0 | A |
10364 | 55 | POPLATEK MESICNE | 930117 | 55 | Brno - venkov | south Moravia | 157042 | 49 | 70 | 18 | 0 | 9 | 33.9 | 8743 | 1.88 | 2.43 | 111 | 3659 | 3894 | 7121 | 10364 | 931110 | 21924 | 36 | 609.0 | A |
3834 | 54 | POPLATEK MESICNE | 930119 | 54 | Brno - mesto | south Moravia | 387570 | 0 | 0 | 0 | 1 | 1 | 100.0 | 9897 | 1.60 | 1.96 | 140 | 18721 | 18696 | 5754 | 3834 | 940928 | 23052 | 12 | 1921.0 | A |
9307 | 68 | POPLATEK MESICNE | 930124 | 68 | Frydek - Mistek | north Moravia | 228848 | 15 | 40 | 18 | 2 | 6 | 57.2 | 9893 | 4.09 | 4.72 | 96 | 5623 | 5887 | 6895 | 9307 | 940919 | 41904 | 12 | 3492.0 | A |
Join hints#
Join hints are a way to influence the join strategy chosen by the SQL server. For instance, you can suggest using a loop, hash, or merge join.
When to use#
DuckDB does not support join hints. The optimizer in DuckDB chooses the join method based on the table statistics and query specifics. In most cases, letting the optimizer make this decision is the best choice.
%%sql
SELECT /*+ HASH_JOIN(a, l) */ *
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
LIMIT 5;
account_id | district_id | frequency | date | loan_id | account_id_1 | date_1 | amount | duration | payments | status |
---|---|---|---|---|---|---|---|---|---|---|
5270 | 44 | POPLATEK MESICNE | 930113 | 6077 | 5270 | 931122 | 79608 | 24 | 3317.0 | A |
11265 | 15 | POPLATEK MESICNE | 930114 | 7284 | 11265 | 930915 | 52788 | 12 | 4399.0 | A |
10364 | 55 | POPLATEK MESICNE | 930117 | 7121 | 10364 | 931110 | 21924 | 36 | 609.0 | A |
3834 | 54 | POPLATEK MESICNE | 930119 | 5754 | 3834 | 940928 | 23052 | 12 | 1921.0 | A |
9307 | 68 | POPLATEK MESICNE | 930124 | 6895 | 9307 | 940919 | 41904 | 12 | 3492.0 | A |
Exercise 1#
Given an account id (e.g., 1787), retrieve all loan records for that account.
Answers
We can use a Nested-loop Join on s1.account
as a
and s1.loan
as l
where the account_id
matches in each table, and where the account_id
is 1787
.
%%sql
SELECT *
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
WHERE a.account_id = 1787;
account_id | district_id | frequency | date | loan_id | account_id_1 | date_1 | amount | duration | payments | status |
---|---|---|---|---|---|---|---|---|---|---|
1787 | 30 | POPLATEK TYDNE | 930322 | 5314 | 1787 | 930705 | 96396 | 12 | 8033.0 | B |
Exercise 2#
Retrieve all account and district information for accounts with district_id between 10 and 20.
Answers
We can use a Merge Join on s1.account
as a
and s1.district
as d
where the district_id
matches in each table, and where the district_id
is between 10 and 20.
%%sql
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
WHERE d.district_id BETWEEN 10 AND 20;
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2378 | 16 | POPLATEK MESICNE | 930101 | 16 | Jindrichuv Hradec | south Bohemia | 93931 | 74 | 21 | 10 | 1 | 8 | 56.9 | 8427 | 1.12 | 1.54 | 107 | 1874 | 1913 |
2357 | 19 | POPLATEK MESICNE | 930104 | 19 | Prachatice | south Bohemia | 51428 | 50 | 11 | 3 | 1 | 4 | 52.7 | 8402 | 3.13 | 3.98 | 120 | 999 | 1099 |
2393 | 10 | POPLATEK MESICNE | 930107 | 10 | Praha - vychod | central Bohemia | 92084 | 55 | 29 | 4 | 3 | 5 | 46.7 | 10124 | 0.56 | 0.54 | 141 | 3810 | 4316 |
1909 | 14 | POPLATEK MESICNE | 930112 | 14 | Ceske Budejovice | south Bohemia | 177686 | 69 | 27 | 10 | 1 | 9 | 74.8 | 10045 | 1.42 | 1.71 | 135 | 6604 | 6295 |
11265 | 15 | POPLATEK MESICNE | 930114 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.6 | 124 | 1845 | 1879 |
1481 | 15 | POPLATEK MESICNE | 930117 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.6 | 124 | 1845 | 1879 |
6473 | 12 | POPLATEK MESICNE | 930126 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 |
2670 | 13 | POPLATEK MESICNE | 930127 | 13 | Rakovnik | central Bohemia | 53921 | 61 | 22 | 1 | 1 | 2 | 41.3 | 8598 | 2.77 | 3.26 | 123 | 1597 | 1875 |
1843 | 12 | POPLATEK MESICNE | 930130 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 |
2434 | 17 | POPLATEK MESICNE | 930202 | 17 | Pelhrimov | south Bohemia | 74062 | 99 | 15 | 4 | 2 | 7 | 61.4 | 8114 | 2.38 | 2.62 | 119 | 1003 | 1181 |
Exercise 3#
Retrieve all account, loan and district information.
Answers
We can use a Hash Join on s1.account
as a
and s1.district
as d
where the district_id
matches in each table. We can then join this to the s1.loan
table as l
where the account_id
in a
and l
match.
%%sql
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
JOIN s1.loan l ON a.account_id = l.account_id;
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 | loan_id | account_id_1 | date_1 | amount | duration | payments | status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5270 | 44 | POPLATEK MESICNE | 930113 | 44 | Chrudim | east Bohemia | 105606 | 77 | 26 | 7 | 2 | 7 | 53.0 | 8254 | 2.79 | 3.76 | 97 | 2166 | 2325 | 6077 | 5270 | 931122 | 79608 | 24 | 3317.0 | A |
11265 | 15 | POPLATEK MESICNE | 930114 | 15 | Cesky Krumlov | south Bohemia | 58796 | 22 | 16 | 7 | 1 | 5 | 51.9 | 9045 | 3.13 | 3.6 | 124 | 1845 | 1879 | 7284 | 11265 | 930915 | 52788 | 12 | 4399.0 | A |
10364 | 55 | POPLATEK MESICNE | 930117 | 55 | Brno - venkov | south Moravia | 157042 | 49 | 70 | 18 | 0 | 9 | 33.9 | 8743 | 1.88 | 2.43 | 111 | 3659 | 3894 | 7121 | 10364 | 931110 | 21924 | 36 | 609.0 | A |
3834 | 54 | POPLATEK MESICNE | 930119 | 54 | Brno - mesto | south Moravia | 387570 | 0 | 0 | 0 | 1 | 1 | 100.0 | 9897 | 1.60 | 1.96 | 140 | 18721 | 18696 | 5754 | 3834 | 940928 | 23052 | 12 | 1921.0 | A |
9307 | 68 | POPLATEK MESICNE | 930124 | 68 | Frydek - Mistek | north Moravia | 228848 | 15 | 40 | 18 | 2 | 6 | 57.2 | 9893 | 4.09 | 4.72 | 96 | 5623 | 5887 | 6895 | 9307 | 940919 | 41904 | 12 | 3492.0 | A |
5891 | 54 | POPLATEK MESICNE | 930125 | 54 | Brno - mesto | south Moravia | 387570 | 0 | 0 | 0 | 1 | 1 | 100.0 | 9897 | 1.60 | 1.96 | 140 | 18721 | 18696 | 6202 | 5891 | 940418 | 65184 | 12 | 5432.0 | A |
6473 | 12 | POPLATEK MESICNE | 930126 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | 6316 | 6473 | 940526 | 76908 | 12 | 6409.0 | B |
1843 | 12 | POPLATEK MESICNE | 930130 | 12 | Pribram | central Bohemia | 107870 | 84 | 29 | 6 | 1 | 6 | 58.0 | 8754 | 3.83 | 4.31 | 137 | 3804 | 3868 | 5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
9265 | 1 | POPLATEK MESICNE | 930204 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | 6888 | 9265 | 940711 | 39576 | 12 | 3298.0 | A |
8051 | 1 | POPLATEK TYDNE | 930207 | 1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 | 6647 | 8051 | 940601 | 208320 | 48 | 4340.0 | A |
Delete tables
%%sql
DROP TABLE s1.account;
DROP TABLE s1.loan;
DROP TABLE s1.district;
DROP SCHEMA s1;
Success |
---|
Conclusion#
In this tutorial, we have explored advanced join operations in SQL, including nested-loop joins, merge joins, hash joins, internal joins, and the concept of join hints. We learned how to utilize these different join methods in SQL queries and understood the specific scenarios in which each type of join is most efficient.
We used DuckDB as our SQL engine and the banking dataset for our exercises. DuckDB is an excellent tool for SQL queries because of its ease of use and integration with the Jupyter notebook environment. However, it’s important to note that DuckDB’s query optimizer chooses the join method based on the table statistics and query specifics. So while the SQL examples in this tutorial illustrate the syntax and usage of different types of joins, the actual join type chosen by DuckDB might differ.