Joining data in SQL#

This section introduces a fundamental concept in SQL: joining. Joining is a powerful technique used to combine data from multiple tables based on their relationships. It allows you to supplement a dataset with additional information from another related dataset.

To show the capabilites of joining, we move away from the single bank dataset we initially used in this course. We will be incorporating several new datasets illustrated below.

Datasets#

The datasets we will be using originates from another bank’s financial data.

Source: https://web.archive.org/web/20180506061559/http://lisp.vse.cz/pkdd99/Challenge/chall.htm

We first focus on just two datasets, the account and district dataset. To expedite our progress, we will skip the detailed explanation of each dataset’s variables and dive straight into how joining works.

For a comprehensive understanding of the data structure and attributes, please refer to the datasets’ documentation.

Documentation: https://web.archive.org/web/20180506035658/http://lisp.vse.cz/pkdd99/Challenge/berka.htm

Below is a display of account and district in an Entity-Relationship Diagram (ERD).

diagram

ERDs are visual representations that help understand the relationship between two or more datasets. In an ERD, each table in the diagram represents a dataset. The variables of each dataset are represented as rows under each respective table. In our case, the first column of our table is the variable’s name while the second column is the variable’s value type. The notations of the line connecting our two tables indicate their relationship type and is defined as “Crow’s Foot Notation”. To learn more about this notation, we recommend visiting this article: https://vertabelo.com/blog/crow-s-foot-notation/

There exists only two value types in the accounts and district table: “INT” and “VARCHAR”. The “INT” value type indicates that the corresponding value is an integer, while “VARCHAR” represents a variable-length string that can contain various characters. These value types help SQL understand the appropriate operations that can be performed on each value. Alongside this, the second column of our tables also show if a variable is a primary key (PK) or foreign key (FK). We introduce these concepts below.

What is a primary key and a foreign key?#

In a database, a primary key is a unique identifier for each record in a table. For instance, our accounts table has the primary key of “account_id”. This makes sense because every single row in the accounts table corresponds to “account_id” which represents one single account. The district table has the primary key of “District ID”. This means that each row under the district table represents one single district (or district id). So under the accounts table, there should not be any rows with the same “account_id” value. Similarly, the district table should not have any rows with the same “district_id” value.

A foreign key, on the other hand, establishes a relationship between two tables. It refers to the primary key of another table and helps connect the records across multiple tables. In our example, the “district_id” in the account table is a foreign key, indicating that it references the primary key of the district table. This allows us to associate each account with its corresponding district.

By using primary and foreign keys, we establish relationships between tables, enabling us to perform joins and retrieve meaningful information by linking related data together. Let’s jump straight into demonstrating these joins and how primary keys and foreign keys work.

Install - execute this once.#

Note: If you are following these lessons locally and not on Google Colab, then there is no need to reinstall these packages.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql duckdb-engine pandas --quiet

Load the data#

We extract the financial data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file and convert the .asc files to .csv files. Finally, we save converted data into a folder.

Helper script#

We developed a banking.py script to help you extract the data from the URL and load it into a DuckDB database. This script is located here

This script downloads and stores the necessary data into a folder within the current directory. Please reference the script for more information.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402


_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)
Converted expanded_data/trans.asc to CSV.
Converted expanded_data/order.asc to CSV.
Converted expanded_data/loan.asc to CSV.
Converted expanded_data/district.asc to CSV.
Converted expanded_data/disp.asc to CSV.
Converted expanded_data/client.asc to CSV.
Converted expanded_data/card.asc to CSV.
Converted expanded_data/account.asc to CSV.
All ASC files converted to CSV.

If you ran the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db'
%sql duckdb:///bank_data.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True

Queries#

Creating a Schema#

A schema helps define how our data is organized. Think of it as a container that holds certain datasets that relate to each other. In our bank_data.duck.db database, we could have several schemas, each having their own datasets that relate to each other. For now we will create one schema to hold our accounts and district dataset.

%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
Count

Let’s take a brief look at both of our tables before we get started.

%%sql
SELECT *
FROM s1.account
LIMIT 5
account_id district_id frequency date
576 55 POPLATEK MESICNE 930101
3818 74 POPLATEK MESICNE 930101
704 55 POPLATEK MESICNE 930101
2378 16 POPLATEK MESICNE 930101
2632 24 POPLATEK MESICNE 930102
%%sql
SELECT *
FROM s1.district
LIMIT 5
district_id district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107
2 Benesov central Bohemia 88884 80 26 6 2 5 46.7 8507 1.67 1.85 132 2159 2674
3 Beroun central Bohemia 75232 55 26 4 1 5 41.7 8980 1.95 2.21 111 2824 2813
4 Kladno central Bohemia 149893 63 29 6 2 6 67.4 9753 4.64 5.05 109 5244 5892
5 Kolin central Bohemia 95616 65 30 4 1 6 51.4 9307 3.85 4.43 118 2616 3040

Before we jump into joins, I highly recommend following along with this resource.

Inner Join#

The most basic join is the inner join. Inner joins result in a query that returns rows where both tables have the specified key. For example, the query below inner joins our s1.account and s1.district table ON the “accounts_id” variable.

%%sql
SELECT *,
FROM s1.account
INNER JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
LIMIT 5
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
576 55 POPLATEK MESICNE 930101 55 Brno - venkov south Moravia 157042 49 70 18 0 9 33.9 8743 1.88 2.43 111 3659 3894
3818 74 POPLATEK MESICNE 930101 74 Ostrava - mesto north Moravia 323870 0 0 0 1 1 100.0 10673 4.75 5.44 100 18782 18347
704 55 POPLATEK MESICNE 930101 55 Brno - venkov south Moravia 157042 49 70 18 0 9 33.9 8743 1.88 2.43 111 3659 3894
2378 16 POPLATEK MESICNE 930101 16 Jindrichuv Hradec south Bohemia 93931 74 21 10 1 8 56.9 8427 1.12 1.54 107 1874 1913
2632 24 POPLATEK MESICNE 930102 24 Karlovy Vary west Bohemia 122603 25 21 6 2 8 80.0 8991 1.39 2.01 128 5198 5273

This query results in joining our s1.account and s1.district tables wherever the “district_id” exists in both tables. In this inner join, the “district_id” value must exist in both tables. For demonstration purposes, let’s hypothetically assume that there is a row in the s1.account table that has a “district_id” value of 99999. We INSERT this value INTO our s1.account table below.

%%sql 
INSERT INTO s1.account
VALUES (9999, 99999, 'POPLATEK MESICNE', 930101)
Count

Because the value 99999 does not exist under the “district_id” column in the s1.district table, this row from the s1.account will not appear in our join.

%%sql 
SELECT COUNT(*)
FROM s1.account
count_star()
4501
%%sql 
SELECT COUNT(*)
FROM s1.account
INNER JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
count_star()
4500

We see that the COUNT(*) of s1.account is 4501 and the COUNT(*) of our join is 4500. This is because our inner join excludes the row with an “district_id” value of 99999 from our s1.account table since it does not appear anywhere under the “district_id” column of our s1.district table.

Also notice that the “district_id_1” column in our original inner join query. This column does not inherit the “district_id” column name from our s1.district table because we would then have two columns with both “district_id” due to the inclusion of s1.account. To avoid this ambiguity, SQL automatically adds “_1” to the end of identical columns resulting from a join. If we are joining more than two tables (seen in the next section), then SQL will automatically increment the number to distinguish each identical column.

Question 1 (Medium):#

How many counts of each “district_id” appear in s1.account? Query the district_id, the respective count, and the district name. Filter the results to only have district id’s with a count greater than 40.

Hint: Try breaking the problem down step by step. First, take a look at the results of an inner join. What can you do from there to achieve the correct results?

Answers

We first inner join s1.account and s1.district on “district_id” to have a query that has the information necessary in answering this question. Then, we group by “district_id” and “district_name” in order to aggregate and have them in our select statement. The last filter step is through the HAVING clause because we filter post-aggregation.

%%sql 
SELECT s1.district.district_id, COUNT(*), s1.district.district_name
FROM s1.account
INNER JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
GROUP BY s1.district.district_id, s1.district.district_name
HAVING COUNT(*) > 40
district_id count_star() district_name
55 53 Brno - venkov
74 135 Ostrava - mesto
16 52 Jindrichuv Hradec
24 42 Karlovy Vary
1 554 Hl.m. Praha
47 53 Pardubice
76 55 Sumperk
48 53 Rychnov nad Kneznou
70 152 Karvina
19 55 Prachatice
Truncated to displaylimit of 10.

Left Join#

A left join guarantees that every row in the table before the ON clause (the left table) appears in our query, regardless if the key from that row matches the “right” table being joined.

%%sql 
SELECT COUNT(*)
FROM s1.account
LEFT JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
count_star()
4501

Recall the row we INSERT into the s1.account table. This row is included in our left join because s1.account appears before the JOIN clause in our SQL statement. This is why the COUNT(*) matches the number of rows of our s1.account table after we inserted the row. Let’s take a look at what this row looks like after our left join.

%%sql
SELECT *
FROM s1.account
LEFT JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
WHERE s1.account.district_id = 99999
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
9999 99999 POPLATEK MESICNE 930101 None None None None None None None None None None None None None None None None

The first 4 columns have values, but the resulting values from the “right table” are all “None”! Since s1.district does not have the value 99999 under its “district_id” column, there is no information on this particular district_id to supplement the left table.

Also notice the particular syntax for our column in our WHERE clause. Since “district_id” appears in both s1.account and s1.district, we have to specify which table to run our WHERE clause on.

Right Join#

Right join is identical to the nature of the left join. A right join will guarantee the inclusion of every row from the “right table”, regardless if the key being joined on appears in the “left table.”

Here we replace the LEFT JOIN in our last example with RIGHT JOIN.

%%sql 
SELECT COUNT(*)
FROM s1.account
RIGHT JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
count_star()
4500

The resulting COUNT(*) of our join omits the one row we INSERT into s1.account previously. Let’s double check and see if our inserted row exists after this join.

%%sql 
SELECT *
FROM s1.account
RIGHT JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
WHERE s1.account.district_id = 99999
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96

Exactly what we expect.

What would happen if we reverse the placement of our tables in our RIGHT JOIN clause? The results of reversing the table placements are below.

%%sql 
SELECT COUNT(*)
FROM s1.district
RIGHT JOIN s1.account 
    ON s1.account.district_id = s1.district.district_id
count_star()
4501
%%sql 
SELECT *
FROM s1.district
RIGHT JOIN s1.account 
    ON s1.account.district_id = s1.district.district_id
WHERE s1.account.district_id = 99999
district_id district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96 account_id district_id_1 frequency date
None None None None None None None None None None None None None None None None 9999 99999 POPLATEK MESICNE 930101

We get near exact results from our previous LEFT JOIN demonstration. The only difference is the order of our columns in the query output. The s1.account table appears on the far right whil the s1.district table appears on the left. To ensure complete visibility of the output, please utilize the scroll bar.

Question 2 (Easy):#

Show the maximum “account_id” value corresponding with that account’s district information, regardless if there is any or not. Name the resulting query colum “max_acc_id”. You must include a join.

Answers

We first left join s1.account and s1.district on “district_id” to have a query that guarantees completed information from the s1.account table. Then, we find the max of “account_id” and rename it accordingly.

%%sql 
SELECT MAX(s1.account.account_id) AS max_acc_id
FROM s1.account
LEFT JOIN s1.district
    ON s1.account.district_id = s1.district.district_id
max_acc_id
11382

Full Join#

Full join (also known as outer join) results in the inclusion of all rows from both tables. To showcase the full capability of full joins, we first INSERT another row into the s1.district table with a “district_id” value not present in the s1.account table. In other words, we are creating new value for s1.district’s primary key that does not appear in s1.account’s foreign key.

%%sql 
INSERT INTO s1.district
VALUES (3333, 'Hypothetical District', 'Hypothetical Region',1,1,1,1,1,1,1,1,1,1,1,1,1)
Count
%%sql 
SELECT COUNT(*)
FROM s1.account 
FULL JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
count_star()
4502
%%sql 
SELECT *
FROM s1.account
FULL JOIN s1.district 
    ON s1.account.district_id = s1.district.district_id
WHERE s1.account.district_id = 99999 OR s1.district.district_id = 3333
account_id district_id frequency date district_id_1 district_name region no_of_inhabitants no_of_municipalities_lt_499 no_of_municipalities_500_1999 no_of_municipalities_2000_9999 no_of_municipalities_gt_10000 no_of_cities ratio_of_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 no_of_entrepreneurs_per_1000_inhabitants no_of_committed_crimes_95 no_of_committed_crimes_96
9999 99999 POPLATEK MESICNE 930101 None None None None None None None None None None None None None None None None
None None None None 3333 Hypothetical District Hypothetical Region 1 1 1 1 1 1 1.0 1 1 1.0 1 1 1

The full join on s1.district and s1.account results in a query that includes all rows from both tables. The OR clause in the last SQL statement verifies this by querying the hypothetical rows we created in each table.

Question 3 (Medium):#

What is the average “account_id” value for accounts in the “Prague” region? Round the average by 3 decimal places.

It doesn’t really make sense to average by the “account_id” as it is an arbitrary number to uniquely identify each account. However, try to ignore that for the purpose of practice.

Answers

We first join s1.account and s1.district on “district_id” to have a query of completed information between the two tables. Then, we GROUP BY region and SELECT the region name and the average value of “account_id” within regions. Finally, we use HAVING to filter where region has the value ‘Prague’ post grouping with GROUP BY.

%%sql
SELECT s1.district.region, ROUND(AVG(s1.account.account_id), 3) 
FROM s1.account
INNER JOIN s1.district
    ON s1.account.district_id = s1.district.district_id
GROUP BY s1.district.region
HAVING s1.district.region = 'Prague'
region round(avg(s1.account.account_id), 3)
Prague 2963.758

Deleting Tables

%%sql
DROP TABLE s1.account;
DROP TABLE s1.district;
DROP SCHEMA s1;
Success

Wrapping Up#

In this section, we learned the basic join types and how to use them when given two tables. We also learned the definiton of primary and foriegn keys along with an introduction to ERDs. To recap:

  • FULL JOIN - Guarantees that every row from both joined tables is included in the resulting query, regardless of whether a matching key exists in the other table or not. A FULL JOIN essentially combines the results of a RIGHT JOIN and LEFT JOIN

  • LEFT JOIN - Guarantees that every row from the left table is included in the resulting query, regardless of whether a matching key exists in the other table or not.

  • RIGHT JOIN - Guarantees that every row from the right table is included in the resulting query, regardless of whether a matching key exists in the other table or not.

  • Primary key - The column in a table that uniquely identifies each row of the table.

  • Foreign key - A column in a table that establishes a link or relationship to the primary key of another table.

In the next section, you will learn how to implement joins in more than two tables.

References#

Dataset citation:

“PKDD’99 Discovery Challenge Guide to the Financial Data Set.” Home page of PKDD Discovery Challenge, 1999. https://sorry.vse.cz/~berka/challenge/PAST/.