Advanced join operations in SQL

Advanced join operations in SQL#

Install - execute this once.#

Important

Note: The --save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql pandas --quiet

Load the data#

Important

Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.

This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402


_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)

Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'

If you ran the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv, account.csv, district.csv.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Important

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db

Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'

Settings changed:

Config	value
displaycon	False
feedback	True
autopandas	False
named_parameters	True

Creating Tables#

Let’s start off with loading three of the eight .csv files from the expanded_data folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1 in which we will store the tables. Here we use the CREATE TABLE syntax in DuckDB to ingest four of the eight .csv files. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');

Count

The code above will create three tables in the database schema: s1.account, s1.district, s1.loan.

Exploring the data#

Let’s take a look at its entries.

%sqlcmd explore --table s1.account

%sqlcmd explore --table s1.district

%sqlcmd explore --table s1.loan

Nested-loop joins#

A nested loop join compares each row from the first table with each row from the second table to find all pairs of rows which satisfy the join predicate.

When to use#

This strategy is generally used when one of the tables in the join is significantly smaller than the other. The small table (or sometimes just a subset of it) can be kept in memory while the larger table is scanned, allowing for efficient access to the smaller table.

%%sql
SELECT DISTINCT a.account_id, a.district_id, a.frequency, a.date, l.loan_id, l.date as date_1, l.amount, l.duration, l.payments, l.status
FROM s1.account a, s1.loan l
WHERE a.account_id = l.account_id
LIMIT 5;

account_id	district_id	frequency	date	loan_id	date_1	amount	duration	payments	status
5270	44	POPLATEK MESICNE	930113	6077	931122	79608	24	3317.0	A
11265	15	POPLATEK MESICNE	930114	7284	930915	52788	12	4399.0	A
10364	55	POPLATEK MESICNE	930117	7121	931110	21924	36	609.0	A
3834	54	POPLATEK MESICNE	930119	5754	940928	23052	12	1921.0	A
9307	68	POPLATEK MESICNE	930124	6895	940919	41904	12	3492.0	A

Merge joins#

Merge join combines two sorted lists like a zipper based on the join predicates.

When to use#

This is a very efficient join strategy when the join columns of both tables are sorted, or when the database can efficiently sort them. If you know that your tables are sorted on the join column, this strategy is likely to be chosen.

%%sql
SELECT *
FROM s1.account a
INNER JOIN s1.district d
ON a.district_id = d.district_id
LIMIT 5;

account_id	district_id	frequency	date	district_id_1	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96
576	55	POPLATEK MESICNE	930101	55	Brno - venkov	south Moravia	157042	49	70	18	0	9	33.9	8743	1.88	2.43	111	3659	3894
3818	74	POPLATEK MESICNE	930101	74	Ostrava - mesto	north Moravia	323870	0	0	0	1	1	100.0	10673	4.75	5.44	100	18782	18347
704	55	POPLATEK MESICNE	930101	55	Brno - venkov	south Moravia	157042	49	70	18	0	9	33.9	8743	1.88	2.43	111	3659	3894
2378	16	POPLATEK MESICNE	930101	16	Jindrichuv Hradec	south Bohemia	93931	74	21	10	1	8	56.9	8427	1.12	1.54	107	1874	1913
2632	24	POPLATEK MESICNE	930102	24	Karlovy Vary	west Bohemia	122603	25	21	6	2	8	80.0	8991	1.39	2.01	128	5198	5273

Hash joins#

A hash join uses a hash table for finding matching rows. The table is partitioned based on the hash value of the join column(s).

When to use#

This strategy is typically used when the join columns are not sorted, and neither table is much smaller than the other. The database builds a hash table from one of the tables, then scans the other table and uses the hash table to find matching rows. If your tables are large and not sorted on the join columns, the database is likely to use this strategy.

Since DuckDB doesn’t support join hints, the corresponding SQL query example for join hints (SELECT /*+ HASH_JOIN(a, l) */ *…) it’s just for illustration purposes and doesn’t actually force a hash join in DuckDB.

%%sql
SELECT *
FROM s1.account a
JOIN s1.district d
ON a.district_id = d.district_id
LIMIT 5;

account_id	district_id	frequency	date	district_id_1	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96
576	55	POPLATEK MESICNE	930101	55	Brno - venkov	south Moravia	157042	49	70	18	0	9	33.9	8743	1.88	2.43	111	3659	3894
3818	74	POPLATEK MESICNE	930101	74	Ostrava - mesto	north Moravia	323870	0	0	0	1	1	100.0	10673	4.75	5.44	100	18782	18347
704	55	POPLATEK MESICNE	930101	55	Brno - venkov	south Moravia	157042	49	70	18	0	9	33.9	8743	1.88	2.43	111	3659	3894
2378	16	POPLATEK MESICNE	930101	16	Jindrichuv Hradec	south Bohemia	93931	74	21	10	1	8	56.9	8427	1.12	1.54	107	1874	1913
2632	24	POPLATEK MESICNE	930102	24	Karlovy Vary	west Bohemia	122603	25	21	6	2	8	80.0	8991	1.39	2.01	128	5198	5273

Internal joins#

An internal join combines rows from different tables if the join condition is true.

When to use#

In the context of DuckDB, we can’t explicitly choose an “internal join”. But in terms of a join operation that uses indexes to expedite the join process, this would be most applicable when you have indexed your join columns, and these indexes can be effectively used by the database engine to perform the join operation faster.

%%sql
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
JOIN s1.loan l ON a.account_id = l.account_id
LIMIT 5;

account_id	district_id	frequency	date	district_id_1	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96	loan_id	account_id_1	date_1	amount	duration	payments	status
5270	44	POPLATEK MESICNE	930113	44	Chrudim	east Bohemia	105606	77	26	7	2	7	53.0	8254	2.79	3.76	97	2166	2325	6077	5270	931122	79608	24	3317.0	A
11265	15	POPLATEK MESICNE	930114	15	Cesky Krumlov	south Bohemia	58796	22	16	7	1	5	51.9	9045	3.13	3.6	124	1845	1879	7284	11265	930915	52788	12	4399.0	A
10364	55	POPLATEK MESICNE	930117	55	Brno - venkov	south Moravia	157042	49	70	18	0	9	33.9	8743	1.88	2.43	111	3659	3894	7121	10364	931110	21924	36	609.0	A
3834	54	POPLATEK MESICNE	930119	54	Brno - mesto	south Moravia	387570	0	0	0	1	1	100.0	9897	1.60	1.96	140	18721	18696	5754	3834	940928	23052	12	1921.0	A
9307	68	POPLATEK MESICNE	930124	68	Frydek - Mistek	north Moravia	228848	15	40	18	2	6	57.2	9893	4.09	4.72	96	5623	5887	6895	9307	940919	41904	12	3492.0	A

Join hints#

Join hints are a way to influence the join strategy chosen by the SQL server. For instance, you can suggest using a loop, hash, or merge join.

When to use#

DuckDB does not support join hints. The optimizer in DuckDB chooses the join method based on the table statistics and query specifics. In most cases, letting the optimizer make this decision is the best choice.

%%sql
SELECT /*+ HASH_JOIN(a, l) */ *
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
LIMIT 5;

account_id	district_id	frequency	date	loan_id	account_id_1	date_1	amount	duration	payments	status
5270	44	POPLATEK MESICNE	930113	6077	5270	931122	79608	24	3317.0	A
11265	15	POPLATEK MESICNE	930114	7284	11265	930915	52788	12	4399.0	A
10364	55	POPLATEK MESICNE	930117	7121	10364	931110	21924	36	609.0	A
3834	54	POPLATEK MESICNE	930119	5754	3834	940928	23052	12	1921.0	A
9307	68	POPLATEK MESICNE	930124	6895	9307	940919	41904	12	3492.0	A

Exercise 1#

Given an account id (e.g., 1787), retrieve all loan records for that account.

Answers

We can use a Nested-loop Join on s1.account as a and s1.loan as l where the account_id matches in each table, and where the account_id is 1787.

%%sql 
SELECT *
FROM s1.account a
JOIN s1.loan l ON a.account_id = l.account_id
WHERE a.account_id = 1787;

account_id	district_id	frequency	date	loan_id	account_id_1	date_1	amount	duration	payments	status
1787	30	POPLATEK TYDNE	930322	5314	1787	930705	96396	12	8033.0	B

Exercise 2#

Retrieve all account and district information for accounts with district_id between 10 and 20.

Answers

We can use a Merge Join on s1.account as a and s1.district as d where the district_id matches in each table, and where the district_id is between 10 and 20.

%%sql 
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
WHERE d.district_id BETWEEN 10 AND 20;

account_id	district_id	frequency	date	district_id_1	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96
2378	16	POPLATEK MESICNE	930101	16	Jindrichuv Hradec	south Bohemia	93931	74	21	10	1	8	56.9	8427	1.12	1.54	107	1874	1913
2357	19	POPLATEK MESICNE	930104	19	Prachatice	south Bohemia	51428	50	11	3	1	4	52.7	8402	3.13	3.98	120	999	1099
2393	10	POPLATEK MESICNE	930107	10	Praha - vychod	central Bohemia	92084	55	29	4	3	5	46.7	10124	0.56	0.54	141	3810	4316
1909	14	POPLATEK MESICNE	930112	14	Ceske Budejovice	south Bohemia	177686	69	27	10	1	9	74.8	10045	1.42	1.71	135	6604	6295
11265	15	POPLATEK MESICNE	930114	15	Cesky Krumlov	south Bohemia	58796	22	16	7	1	5	51.9	9045	3.13	3.6	124	1845	1879
1481	15	POPLATEK MESICNE	930117	15	Cesky Krumlov	south Bohemia	58796	22	16	7	1	5	51.9	9045	3.13	3.6	124	1845	1879
6473	12	POPLATEK MESICNE	930126	12	Pribram	central Bohemia	107870	84	29	6	1	6	58.0	8754	3.83	4.31	137	3804	3868
2670	13	POPLATEK MESICNE	930127	13	Rakovnik	central Bohemia	53921	61	22	1	1	2	41.3	8598	2.77	3.26	123	1597	1875
1843	12	POPLATEK MESICNE	930130	12	Pribram	central Bohemia	107870	84	29	6	1	6	58.0	8754	3.83	4.31	137	3804	3868
2434	17	POPLATEK MESICNE	930202	17	Pelhrimov	south Bohemia	74062	99	15	4	2	7	61.4	8114	2.38	2.62	119	1003	1181

Truncated to displaylimit of 10.

Exercise 3#

Retrieve all account, loan and district information.

Answers

We can use a Hash Join on s1.account as a and s1.district as d where the district_id matches in each table. We can then join this to the s1.loan table as l where the account_id in a and l match.

%%sql 
SELECT *
FROM s1.account a
JOIN s1.district d ON a.district_id = d.district_id
JOIN s1.loan l ON a.account_id = l.account_id;

account_id	district_id	frequency	date	district_id_1	district_name	region	no_of_inhabitants	no_of_municipalities_lt_499	no_of_municipalities_500_1999	no_of_municipalities_2000_9999	no_of_municipalities_gt_10000	no_of_cities	ratio_of_urban_inhabitants	average_salary	unemployment_rate_95	unemployment_rate_96	no_of_entrepreneurs_per_1000_inhabitants	no_of_committed_crimes_95	no_of_committed_crimes_96	loan_id	account_id_1	date_1	amount	duration	payments	status
5270	44	POPLATEK MESICNE	930113	44	Chrudim	east Bohemia	105606	77	26	7	2	7	53.0	8254	2.79	3.76	97	2166	2325	6077	5270	931122	79608	24	3317.0	A
11265	15	POPLATEK MESICNE	930114	15	Cesky Krumlov	south Bohemia	58796	22	16	7	1	5	51.9	9045	3.13	3.6	124	1845	1879	7284	11265	930915	52788	12	4399.0	A
10364	55	POPLATEK MESICNE	930117	55	Brno - venkov	south Moravia	157042	49	70	18	0	9	33.9	8743	1.88	2.43	111	3659	3894	7121	10364	931110	21924	36	609.0	A
3834	54	POPLATEK MESICNE	930119	54	Brno - mesto	south Moravia	387570	0	0	0	1	1	100.0	9897	1.60	1.96	140	18721	18696	5754	3834	940928	23052	12	1921.0	A
9307	68	POPLATEK MESICNE	930124	68	Frydek - Mistek	north Moravia	228848	15	40	18	2	6	57.2	9893	4.09	4.72	96	5623	5887	6895	9307	940919	41904	12	3492.0	A
5891	54	POPLATEK MESICNE	930125	54	Brno - mesto	south Moravia	387570	0	0	0	1	1	100.0	9897	1.60	1.96	140	18721	18696	6202	5891	940418	65184	12	5432.0	A
6473	12	POPLATEK MESICNE	930126	12	Pribram	central Bohemia	107870	84	29	6	1	6	58.0	8754	3.83	4.31	137	3804	3868	6316	6473	940526	76908	12	6409.0	B
1843	12	POPLATEK MESICNE	930130	12	Pribram	central Bohemia	107870	84	29	6	1	6	58.0	8754	3.83	4.31	137	3804	3868	5325	1843	930803	105804	36	2939.0	A
9265	1	POPLATEK MESICNE	930204	1	Hl.m. Praha	Prague	1204953	0	0	0	1	1	100.0	12541	0.29	0.43	167	85677	99107	6888	9265	940711	39576	12	3298.0	A
8051	1	POPLATEK TYDNE	930207	1	Hl.m. Praha	Prague	1204953	0	0	0	1	1	100.0	12541	0.29	0.43	167	85677	99107	6647	8051	940601	208320	48	4340.0	A

Truncated to displaylimit of 10.

Delete tables

%%sql
DROP TABLE s1.account;
DROP TABLE s1.loan;
DROP TABLE s1.district;
DROP SCHEMA s1;

Success

Conclusion#

In this tutorial, we have explored advanced join operations in SQL, including nested-loop joins, merge joins, hash joins, internal joins, and the concept of join hints. We learned how to utilize these different join methods in SQL queries and understood the specific scenarios in which each type of join is most efficient.

We used DuckDB as our SQL engine and the banking dataset for our exercises. DuckDB is an excellent tool for SQL queries because of its ease of use and integration with the Jupyter notebook environment. However, it’s important to note that DuckDB’s query optimizer chooses the join method based on the table statistics and query specifics. So while the SQL examples in this tutorial illustrate the syntax and usage of different types of joins, the actual join type chosen by DuckDB might differ.

Advanced join operations in SQL

Contents

Advanced join operations in SQL#

Install - execute this once.#

Load the data#

Load Engine#

Creating Tables#

Exploring the data#

Nested-loop joins#

When to use#

Merge joins#

When to use#

Hash joins#

When to use#

Internal joins#

When to use#

Join hints#

When to use#

Exercise 1#

Exercise 2#

Exercise 3#

Conclusion#