Joining data in SQL#
This section introduces a fundamental concept in SQL: joining. Joining is a powerful technique used to combine data from multiple tables based on their relationships. It allows you to supplement a dataset with additional information from another related dataset.
To show the capabilites of joining, we move away from the single bank dataset we initially used in this course. We will be incorporating several new datasets illustrated below.
Datasets#
The datasets we will be using originates from another bank’s financial data.
Source: https://web.archive.org/web/20180506061559/http://lisp.vse.cz/pkdd99/Challenge/chall.htm
We first focus on just two datasets, the account
and district
dataset. To expedite our progress, we will skip the detailed explanation of each dataset’s variables and dive straight into how joining works.
For a comprehensive understanding of the data structure and attributes, please refer to the datasets’ documentation.
Documentation: https://web.archive.org/web/20180506035658/http://lisp.vse.cz/pkdd99/Challenge/berka.htm
Below is a display of account
and district
in an Entity-Relationship Diagram (ERD).
ERDs are visual representations that help understand the relationship between two or more datasets. In an ERD, each table in the diagram represents a dataset. The variables of each dataset are represented as rows under each respective table. In our case, the first column of our table is the variable’s name while the second column is the variable’s value type. The notations of the line connecting our two tables indicate their relationship type and is defined as “Crow’s Foot Notation”. To learn more about this notation, we recommend visiting this article: https://vertabelo.com/blog/crow-s-foot-notation/
There exists only two value types in the accounts
and district
table: “INT” and “VARCHAR”. The “INT” value type indicates that the corresponding value is an integer, while “VARCHAR” represents a variable-length string that can contain various characters. These value types help SQL understand the appropriate operations that can be performed on each value. Alongside this, the second column of our tables also show if a variable is a primary key (PK) or foreign key (FK). We introduce these concepts below.
What is a primary key and a foreign key?#
In a database, a primary key is a unique identifier for each record in a table. For instance, our accounts
table has the primary key of “account_id”. This makes sense because every single row in the accounts
table corresponds to “account_id” which represents one single account. The district
table has the primary key of “District ID”. This means that each row under the district
table represents one single district (or district id). So under the accounts
table, there should not be any rows with the same “account_id” value. Similarly, the district
table should not have any rows with the same “district_id” value.
A foreign key, on the other hand, establishes a relationship between two tables. It refers to the primary key of another table and helps connect the records across multiple tables. In our example, the “district_id” in the account table is a foreign key, indicating that it references the primary key of the district table. This allows us to associate each account with its corresponding district.
By using primary and foreign keys, we establish relationships between tables, enabling us to perform joins and retrieve meaningful information by linking related data together. Let’s jump straight into demonstrating these joins and how primary keys and foreign keys work.
Install - execute this once.#
Note: If you are following these lessons locally and not on Google Colab, then there is no need to reinstall these packages.
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql duckdb-engine pandas --quiet
Load the data#
We extract the financial data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file and convert the .asc files to .csv files. Finally, we save converted data into a folder.
Helper script#
We developed a banking.py
script to help you extract the data from the URL and load it into a DuckDB database. This script is located here
This script downloads and stores the necessary data into a folder within the current directory. Please reference the script for more information.
import sys
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.MarketData(
"https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip", # noqa E501
"expanded_data",
)
_.convert_asc_to_csv(banking.district_column_names)
Converted expanded_data/trans.asc to CSV.
Converted expanded_data/order.asc to CSV.
Converted expanded_data/loan.asc to CSV.
Converted expanded_data/district.asc to CSV.
Converted expanded_data/disp.asc to CSV.
Converted expanded_data/client.asc to CSV.
Converted expanded_data/card.asc to CSV.
Converted expanded_data/account.asc to CSV.
All ASC files converted to CSV.
If you ran the above cell, you should have a folder expanded_data
in your current directory that contains the .csv
files we will be using.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db'
%sql duckdb:///bank_data.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Queries#
Creating a Schema#
A schema helps define how our data is organized. Think of it as a container that holds certain datasets that relate to each other. In our bank_data.duck.db
database, we could have several schemas, each having their own datasets that relate to each other. For now we will create one schema to hold our accounts
and district
dataset.
%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
Count |
---|
Let’s take a brief look at both of our tables before we get started.
%%sql
SELECT *
FROM s1.account
LIMIT 5
account_id | district_id | frequency | date |
---|---|---|---|
576 | 55 | POPLATEK MESICNE | 930101 |
3818 | 74 | POPLATEK MESICNE | 930101 |
704 | 55 | POPLATEK MESICNE | 930101 |
2378 | 16 | POPLATEK MESICNE | 930101 |
2632 | 24 | POPLATEK MESICNE | 930102 |
%%sql
SELECT *
FROM s1.district
LIMIT 5
district_id | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Hl.m. Praha | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100.0 | 12541 | 0.29 | 0.43 | 167 | 85677 | 99107 |
2 | Benesov | central Bohemia | 88884 | 80 | 26 | 6 | 2 | 5 | 46.7 | 8507 | 1.67 | 1.85 | 132 | 2159 | 2674 |
3 | Beroun | central Bohemia | 75232 | 55 | 26 | 4 | 1 | 5 | 41.7 | 8980 | 1.95 | 2.21 | 111 | 2824 | 2813 |
4 | Kladno | central Bohemia | 149893 | 63 | 29 | 6 | 2 | 6 | 67.4 | 9753 | 4.64 | 5.05 | 109 | 5244 | 5892 |
5 | Kolin | central Bohemia | 95616 | 65 | 30 | 4 | 1 | 6 | 51.4 | 9307 | 3.85 | 4.43 | 118 | 2616 | 3040 |
Before we jump into joins, I highly recommend following along with this resource.
Inner Join#
The most basic join is the inner join. Inner joins result in a query that returns rows where both tables have the specified key. For example, the query below inner joins our s1.account
and s1.district
table ON
the “accounts_id” variable.
%%sql
SELECT *,
FROM s1.account
INNER JOIN s1.district
ON s1.account.district_id = s1.district.district_id
LIMIT 5
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
576 | 55 | POPLATEK MESICNE | 930101 | 55 | Brno - venkov | south Moravia | 157042 | 49 | 70 | 18 | 0 | 9 | 33.9 | 8743 | 1.88 | 2.43 | 111 | 3659 | 3894 |
3818 | 74 | POPLATEK MESICNE | 930101 | 74 | Ostrava - mesto | north Moravia | 323870 | 0 | 0 | 0 | 1 | 1 | 100.0 | 10673 | 4.75 | 5.44 | 100 | 18782 | 18347 |
704 | 55 | POPLATEK MESICNE | 930101 | 55 | Brno - venkov | south Moravia | 157042 | 49 | 70 | 18 | 0 | 9 | 33.9 | 8743 | 1.88 | 2.43 | 111 | 3659 | 3894 |
2378 | 16 | POPLATEK MESICNE | 930101 | 16 | Jindrichuv Hradec | south Bohemia | 93931 | 74 | 21 | 10 | 1 | 8 | 56.9 | 8427 | 1.12 | 1.54 | 107 | 1874 | 1913 |
2632 | 24 | POPLATEK MESICNE | 930102 | 24 | Karlovy Vary | west Bohemia | 122603 | 25 | 21 | 6 | 2 | 8 | 80.0 | 8991 | 1.39 | 2.01 | 128 | 5198 | 5273 |
This query results in joining our s1.account
and s1.district
tables wherever the “district_id” exists in both tables. In this inner join, the “district_id” value must exist in both tables. For demonstration purposes, let’s hypothetically assume that there is a row in the s1.account
table that has a “district_id” value of 99999. We INSERT
this value INTO
our s1.account
table below.
%%sql
INSERT INTO s1.account
VALUES (9999, 99999, 'POPLATEK MESICNE', 930101)
Count |
---|
Because the value 99999 does not exist under the “district_id” column in the s1.district
table, this row from the s1.account
will not appear in our join.
%%sql
SELECT COUNT(*)
FROM s1.account
count_star() |
---|
4501 |
%%sql
SELECT COUNT(*)
FROM s1.account
INNER JOIN s1.district
ON s1.account.district_id = s1.district.district_id
count_star() |
---|
4500 |
We see that the COUNT(*)
of s1.account
is 4501 and the COUNT(*)
of our join is 4500. This is because our inner join excludes the row with an “district_id” value of 99999 from our s1.account
table since it does not appear anywhere under the “district_id” column of our s1.district
table.
Also notice that the “district_id_1” column in our original inner join query. This column does not inherit the “district_id” column name from our s1.district
table because we would then have two columns with both “district_id” due to the inclusion of s1.account
. To avoid this ambiguity, SQL automatically adds “_1” to the end of identical columns resulting from a join. If we are joining more than two tables (seen in the next section), then SQL will automatically increment the number to distinguish each identical column.
Question 1 (Medium):#
How many counts of each “district_id” appear in s1.account
? Query the district_id, the respective count, and the district name. Filter the results to only have district id’s with a count greater than 40.
Hint: Try breaking the problem down step by step. First, take a look at the results of an inner join. What can you do from there to achieve the correct results?
Answers
We first inner join s1.account
and s1.district
on “district_id” to have a query that has the information necessary in answering this question. Then, we group by “district_id” and “district_name” in order to aggregate and have them in our select statement. The last filter step is through the HAVING
clause because we filter post-aggregation.
%%sql
SELECT s1.district.district_id, COUNT(*), s1.district.district_name
FROM s1.account
INNER JOIN s1.district
ON s1.account.district_id = s1.district.district_id
GROUP BY s1.district.district_id, s1.district.district_name
HAVING COUNT(*) > 40
district_id | count_star() | district_name |
---|---|---|
55 | 53 | Brno - venkov |
74 | 135 | Ostrava - mesto |
16 | 52 | Jindrichuv Hradec |
24 | 42 | Karlovy Vary |
1 | 554 | Hl.m. Praha |
47 | 53 | Pardubice |
76 | 55 | Sumperk |
48 | 53 | Rychnov nad Kneznou |
70 | 152 | Karvina |
19 | 55 | Prachatice |
Left Join#
A left join guarantees that every row in the table before the ON
clause (the left table) appears in our query, regardless if the key from that row matches the “right” table being joined.
%%sql
SELECT COUNT(*)
FROM s1.account
LEFT JOIN s1.district
ON s1.account.district_id = s1.district.district_id
count_star() |
---|
4501 |
Recall the row we INSERT
into the s1.account
table. This row is included in our left join because s1.account
appears before the JOIN
clause in our SQL statement. This is why the COUNT(*)
matches the number of rows of our s1.account
table after we inserted the row. Let’s take a look at what this row looks like after our left join.
%%sql
SELECT *
FROM s1.account
LEFT JOIN s1.district
ON s1.account.district_id = s1.district.district_id
WHERE s1.account.district_id = 99999
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9999 | 99999 | POPLATEK MESICNE | 930101 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
The first 4 columns have values, but the resulting values from the “right table” are all “None”! Since s1.district
does not have the value 99999 under its “district_id” column, there is no information on this particular district_id to supplement the left table.
Also notice the particular syntax for our column in our WHERE
clause. Since “district_id” appears in both s1.account
and s1.district
, we have to specify which table to run our WHERE
clause on.
Right Join#
Right join is identical to the nature of the left join. A right join will guarantee the inclusion of every row from the “right table”, regardless if the key being joined on appears in the “left table.”
Here we replace the LEFT JOIN
in our last example with RIGHT JOIN
.
%%sql
SELECT COUNT(*)
FROM s1.account
RIGHT JOIN s1.district
ON s1.account.district_id = s1.district.district_id
count_star() |
---|
4500 |
The resulting COUNT(*)
of our join omits the one row we INSERT
into s1.account
previously. Let’s double check and see if our inserted row exists after this join.
%%sql
SELECT *
FROM s1.account
RIGHT JOIN s1.district
ON s1.account.district_id = s1.district.district_id
WHERE s1.account.district_id = 99999
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|
Exactly what we expect.
What would happen if we reverse the placement of our tables in our RIGHT JOIN
clause? The results of reversing the table placements are below.
%%sql
SELECT COUNT(*)
FROM s1.district
RIGHT JOIN s1.account
ON s1.account.district_id = s1.district.district_id
count_star() |
---|
4501 |
%%sql
SELECT *
FROM s1.district
RIGHT JOIN s1.account
ON s1.account.district_id = s1.district.district_id
WHERE s1.account.district_id = 99999
district_id | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 | account_id | district_id_1 | frequency | date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | 9999 | 99999 | POPLATEK MESICNE | 930101 |
We get near exact results from our previous LEFT JOIN
demonstration. The only difference is the order of our columns in the query output. The s1.account
table appears on the far right whil the s1.district
table appears on the left. To ensure complete visibility of the output, please utilize the scroll bar.
Question 2 (Easy):#
Show the maximum “account_id” value corresponding with that account’s district information, regardless if there is any or not. Name the resulting query colum “max_acc_id”. You must include a join.
Answers
We first left join s1.account
and s1.district
on “district_id” to have a query that guarantees completed information from the s1.account
table. Then, we find the max of “account_id” and rename it accordingly.
%%sql
SELECT MAX(s1.account.account_id) AS max_acc_id
FROM s1.account
LEFT JOIN s1.district
ON s1.account.district_id = s1.district.district_id
max_acc_id |
---|
11382 |
Full Join#
Full join (also known as outer join) results in the inclusion of all rows from both tables. To showcase the full capability of full joins, we first INSERT
another row into the s1.district
table with a “district_id” value not present in the s1.account
table. In other words, we are creating new value for s1.district
’s primary key that does not appear in s1.account
’s foreign key.
%%sql
INSERT INTO s1.district
VALUES (3333, 'Hypothetical District', 'Hypothetical Region',1,1,1,1,1,1,1,1,1,1,1,1,1)
Count |
---|
%%sql
SELECT COUNT(*)
FROM s1.account
FULL JOIN s1.district
ON s1.account.district_id = s1.district.district_id
count_star() |
---|
4502 |
%%sql
SELECT *
FROM s1.account
FULL JOIN s1.district
ON s1.account.district_id = s1.district.district_id
WHERE s1.account.district_id = 99999 OR s1.district.district_id = 3333
account_id | district_id | frequency | date | district_id_1 | district_name | region | no_of_inhabitants | no_of_municipalities_lt_499 | no_of_municipalities_500_1999 | no_of_municipalities_2000_9999 | no_of_municipalities_gt_10000 | no_of_cities | ratio_of_urban_inhabitants | average_salary | unemployment_rate_95 | unemployment_rate_96 | no_of_entrepreneurs_per_1000_inhabitants | no_of_committed_crimes_95 | no_of_committed_crimes_96 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9999 | 99999 | POPLATEK MESICNE | 930101 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None |
None | None | None | None | 3333 | Hypothetical District | Hypothetical Region | 1 | 1 | 1 | 1 | 1 | 1 | 1.0 | 1 | 1 | 1.0 | 1 | 1 | 1 |
The full join on s1.district
and s1.account
results in a query that includes all rows from both tables. The OR
clause in the last SQL statement verifies this by querying the hypothetical rows we created in each table.
Question 3 (Medium):#
What is the average “account_id” value for accounts in the “Prague” region? Round the average by 3 decimal places.
It doesn’t really make sense to average by the “account_id” as it is an arbitrary number to uniquely identify each account. However, try to ignore that for the purpose of practice.
Answers
We first join s1.account
and s1.district
on “district_id” to have a query of completed information between the two tables. Then, we GROUP BY
region and SELECT
the region name and the average value of “account_id” within regions. Finally, we use HAVING
to filter where region has the value ‘Prague’ post grouping with GROUP BY
.
%%sql
SELECT s1.district.region, ROUND(AVG(s1.account.account_id), 3)
FROM s1.account
INNER JOIN s1.district
ON s1.account.district_id = s1.district.district_id
GROUP BY s1.district.region
HAVING s1.district.region = 'Prague'
region | round(avg(s1.account.account_id), 3) |
---|---|
Prague | 2963.758 |
Deleting Tables
%%sql
DROP TABLE s1.account;
DROP TABLE s1.district;
DROP SCHEMA s1;
Success |
---|
Wrapping Up#
In this section, we learned the basic join types and how to use them when given two tables. We also learned the definiton of primary and foriegn keys along with an introduction to ERDs. To recap:
FULL JOIN
- Guarantees that every row from both joined tables is included in the resulting query, regardless of whether a matching key exists in the other table or not. AFULL JOIN
essentially combines the results of aRIGHT JOIN
andLEFT JOIN
LEFT JOIN
- Guarantees that every row from the left table is included in the resulting query, regardless of whether a matching key exists in the other table or not.RIGHT JOIN
- Guarantees that every row from the right table is included in the resulting query, regardless of whether a matching key exists in the other table or not.Primary key - The column in a table that uniquely identifies each row of the table.
Foreign key - A column in a table that establishes a link or relationship to the primary key of another table.
In the next section, you will learn how to implement joins in more than two tables.
References#
Dataset citation:
“PKDD’99 Discovery Challenge Guide to the Financial Data Set.” Home page of PKDD Discovery Challenge, 1999. https://sorry.vse.cz/~berka/challenge/PAST/.