Combining Data from Multiple Tables#

In the last section, we learned about joining two tables in SQL. We now build upon those basics to learn to combine multiple tables into a tidy, single table.

Let’s first run the installations and setup before running any queries, just like the previous lesson: Joining Data in SQL.

Install - execute this once.#

Important

Note: If you are following these lessons locally and not on Google Colab, then there is no need to reinstall these packages.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql duckdb-engine pandas --quiet

Load the data#

Important

Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.

Helper script#

We developed a banking.py script to help you extract the data from the URL and load it into a DuckDB database. This script is located here

This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402


_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)
Converted expanded_data/trans.asc to CSV.
Converted expanded_data/order.asc to CSV.
Converted expanded_data/loan.asc to CSV.
Converted expanded_data/district.asc to CSV.
Converted expanded_data/disp.asc to CSV.
Converted expanded_data/client.asc to CSV.
Converted expanded_data/card.asc to CSV.
Converted expanded_data/account.asc to CSV.
All ASC files converted to CSV.

If you ran the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using. However, in this tutorial, we will focus on four out of the eight .csv files.

ER Diagram#

In the previous tutorial, we provided an Entity-Relationship Diagram (ERD) to help us understand the relationship between the account and district tables. In this section, we will augment that diagram to understand the relationship between multiple tables.

Each table in the diagram represents a dataset. The variables of each dataset are represented as rows. The first column is the variable name while the second column is the variable’s value type alongside if the variable is a primary key (PK) or foreign key (FK).

Important

If you are unfamiliar with the terms primary key and foreign key, please refer to the previous tutorial for more information.

In this section, we will be focusing on joining the following 4 tables: account, card, district, and disp. The schema diagram for these tables is shown below.

ERD-Combining

The account table has a foreign key column named “district_id” referencing the primary key in the district table. The card table has a foreign key column named “disp_id” referencing the primary key in the disp table, and the link table has a foreign key column named “account_id” referencing the primary key in the account table. It’s worth noting that the “type” columns in the card and disp tables are not foreign keys; they represent categorical variables indicating the type of card and disposition, respectively.

The notations of the line connecting our two tables indicate their relationship type and is defined as “Crow’s Foot Notation”. To learn more about this notation, we recommend visiting this article: https://vertabelo.com/blog/crow-s-foot-notation/

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Important

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True

Creating Tables#

Let’s start off with loading four of the eight .csv files from the expanded_data folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1 in which we will store the tables. Here we use the CREATE TABLE syntax in DuckDB to ingest four of the eight .csv files. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.card AS
FROM read_csv_auto('expanded_data/card.csv', header=True, sep=',');
CREATE TABLE s1.link AS
FROM read_csv_auto('expanded_data/disp.csv', header=True, sep=',');
Count

Combining Multiple Tables#

When combining more than two tables, it is imperative to know what the final result of our joins will look like. If we are familiar with the desired output before executing our queries, we can make an informed decision on whether to use an INNER JOIN, OUTER JOIN, or FULL JOIN to achieve our desired result. In short, assessing the case at hand will help us determine the type of join to use.

To quickly refresh our memory from the previous tutorial on INNER JOIN, recall that an INNER JOIN is suitable when we want to retrieve only the matching records, effectively filtering out any unmatched records, between the tables involved in the join. Therefore, the result will contain only the rows where the join condition is met in all the joined tables.

On the other hand, an OUTER JOIN can be of three types: Left, Right, and Full. They were covered extensively in the previous tutorial, but here is a quick refresher. In situations where we need to retrieve all records from one table, regardless of whether there is a match in the other tables, an OUTER JOIN can be used. This is particularly useful when we want to include supplementary information or handle missing data. A LEFT JOIN includes all rows from the left table (the table specified before the LEFT JOIN keyword) and the matching rows from the right table(s) (the table(s) specified after the LEFT JOIN keyword). Similarly, a RIGHT JOIN includes all rows from the right table(s) and the matching rows from the left table. A FULL JOIN combines operations of both the LEFT JOIN and RIGHT JOIN, including all rows from both tables and matching rows where the join condition is met.

Therefore, understanding the desired output and the relationships between the tables will help us determine whether an INNER JOIN or an OUTER JOIN is more appropriate for obtaining the desired result. By utilizing the appropriate join type and leveraging primary and foreign keys, we can combine data from multiple tables efficiently.

INNER JOIN#

Suppose that we want to retrieve common data points between the account, card, district, and link tables because unmatched records represent churned customers who do not utilize this bank’s services anymore. Therefore, we can use the INNER JOIN to join the account, card, district, and link tables on their respective foreign keys. The query below demonstrates how we can use the INNER JOIN to join the four tables:

%%sql 
SELECT a.account_id, c.card_id, d.district_id, l.disp_id
FROM s1.account AS a
INNER JOIN s1.district AS d 
ON a.district_id = d.district_id
INNER JOIN s1.link AS l 
ON l.account_id = a.account_id
INNER JOIN s1.card AS c 
ON c.disp_id = l.disp_id; 
account_id card_id district_id disp_id
1695 321 76 2058
2099 405 3 2542
3909 725 1 4715
4034 743 1 4862
212 40 27 255
456 99 49 548
1019 201 52 1223
5891 874 54 7127
1750 329 1 2122
1649 314 46 1999
Truncated to displaylimit of 10.

The result of the query above is a table containing the primary key columns from each table. The number of rows is 892, which signifies that there are 892 matching records between the four tables. This means that there are 892 customers who have accounts, cards, and are linked to a district. Next, attempt the question below to drive home your understanding of the INNER JOIN!

Question 1 (Medium):#

Suppose we want to analyze the spending patterns of customers across different districts based on their account and card information. Unmatched records could represent customers who have recently opened accounts or received new cards and are yet to make any transactions. Write a SQL query to join the account, card, district, and link tables from the s1 schema in the DuckDB database. Include only primary key columns from each table in the result set.

Answers

We should use an INNER JOIN to exclude unmatched data points, so that we can focus our analysis on customers who have actively used their accounts and cards for transactions.

Recall that in the previous tutorial, we learned that the account table contains a column named district_id, which serves as a foreign key referencing the primary key in the district table. Similarly, the card table has a column named disp_id, which is a foreign key referencing the primary key in the link table. By utilizing these relationships, we can perform a join operation on these tables to obtain a consolidated result set.

To combine the account, card, district, and link tables, we can use the following SQL query:

%%sql 
SELECT a.account_id, c.card_id, d.district_id, l.disp_id
FROM s1.account AS a
INNER JOIN s1.district AS d 
ON a.district_id = d.district_id
INNER JOIN s1.link AS l 
ON l.account_id = a.account_id
INNER JOIN s1.card AS c 
ON c.disp_id = l.disp_id; 
account_id card_id district_id disp_id
1695 321 76 2058
2099 405 3 2542
3909 725 1 4715
4034 743 1 4862
212 40 27 255
456 99 49 548
1019 201 52 1223
5891 874 54 7127
1750 329 1 2122
1649 314 46 1999
Truncated to displaylimit of 10.

In this query, we are performing a series of INNER JOIN operations to merge the tables based on the specified join conditions. The ON clause defines the relationship between the columns that are used for joining. The SELECT statement retrieves only the primary keys from each of the joined tables in the result set.

Important

The above query can also be written without the INNER JOIN clause! Another way to write the query is as follows:

%%sql
SELECT a.account_id, c.card_id, d.district_id, l.disp_id
FROM s1.account as a, s1.card as c, s1.district as d, s1.link as l
WHERE a.district_id = d.district_id AND
      l.account_id = a.account_id AND
      c.disp_id = l.disp_id;

OUTER JOIN#

Suppose that the bank is planning to hike interest rates on accounts because the districts, where they have their branches, have gentrified over the past couple of years. We want to analyze not only the activity of customers across different districts based on their account and card information, but also district-level statistics (employment rate and crime rate). However, we also want to include customers who have recently opened accounts or received new cards and are yet to make any transactions. Therefore, we should use an OUTER JOIN, preferably the LEFT JOIN, so that we can capture records of all customers and appropriately conduct a district-wide analysis.

The query below demonstrates how we can use the LEFT JOIN to join the tables account, card, district, and link:

%%sql 
SELECT DISTINCT a.account_id, c.card_id, d.district_id, l.disp_id, d.unemployment_rate_95, d.unemployment_rate_96, d.no_of_committed_crimes_95, d.no_of_committed_crimes_96
FROM s1.district AS d
LEFT JOIN s1.account AS a
ON a.district_id = d.district_id
LEFT JOIN s1.link AS l 
ON l.account_id = a.account_id
LEFT JOIN s1.card AS c 
ON c.disp_id = l.disp_id;
account_id card_id district_id disp_id unemployment_rate_95 unemployment_rate_96 no_of_committed_crimes_95 no_of_committed_crimes_96
3909 725 1 4715 0.29 0.43 85677 99107
4034 743 1 4862 0.29 0.43 85677 99107
212 40 27 255 0.65 1.29 1029 1127
456 99 49 548 1.89 2.26 1660 2123
5891 874 54 7127 1.60 1.96 18721 18696
1750 329 1 2122 0.29 0.43 85677 99107
1649 314 46 1999 1.79 2.31 2854 2618
2670 503 13 3225 2.77 3.26 1597 1875
1127 223 27 1353 0.65 1.29 1029 1127
4296 765 63 5175 3.79 4.52 1562 1460
Truncated to displaylimit of 10.

In contrast to the INNER JOIN, the order of the tables plays an important role in the LEFT JOIN, and the results may be completely different if the order changes in your SQL query. When determining the order of tables in a LEFT JOIN, the general rule is to start with the table from which you want to keep all the records in the final result. Therefore, in the above example, instead of starting with the account table, we started with the district table, because we want to keep all the records from the district table in the final result.

Also, keep in mind that a LEFT JOIN cascades to all joins in a query. If you use a LEFT JOIN, often the subsequent tables should also be left-joined. An INNER JOIN will drop records not found on both sides of the join, and you might lose all the records you want to keep.

Furthermore, a LEFT JOIN should be used for the third table when the first table has records not found in the second (and so on) and a different common field is used to join the second and the third tables. An INNER JOIN in this situation will drop all the records from the first table not matched to the second and third tables.

Question 2 (Hard):#

Suppose only tables links and cards did not have matching rows that are related to each other, what kind of join would you use to combine these tables? If you were to use this join, and then join the resulting table with the merged table of account and district, how would the final output differ from Question 1?

Hint: We can use both INNER JOIN and OUTER JOIN in a single query! The first step would be to think about the type of JOIN to combine the links and cards tables. Then, how can this table be joined with the remaining tables if the remaining tables have matching rows that are related to each other?

Answers

A FULL OUTER JOIN would be used to combine the links and cards tables if we want to keep both the rows that can be matched and the unpaired rows. This is because, in this case, the tables links and cards do not have matching rows that are related to each other. Therefore, we need to account for the unpaired rows in the result set.

The FULL OUTER JOIN will return all rows from both tables, and if there are no matches, the columns from the other table will be filled with NULL values. This is in contrast to the INNER JOIN, which only returns rows that have matching values in both tables and, hence, no NULL values.

Code to implement the multiple joins is shown below:

%%sql
SELECT a.account_id, c.card_id, d.district_id, l.disp_id
FROM s1.link AS l
FULL OUTER JOIN s1.card AS c 
ON l.disp_id = c.disp_id 
INNER JOIN s1.account AS a 
ON a.account_id = l.account_id 
INNER JOIN s1.district AS d 
ON d.district_id = a.district_id;
account_id card_id district_id disp_id
7 1 60 9
14 2 47 19
33 3 22 41
34 4 67 42
43 5 36 51
48 7 21 56
51 8 67 60
65 9 36 76
66 10 48 77
68 11 37 79
Truncated to displaylimit of 10.

Therefore, the final output will include all rows from the links and cards tables, as well as the rows from the account and district tables that have matching values in the links and cards tables. Because the links table has the highest number of rows, 5369, all of them will be included in the final output. The rows from the account and district tables that do not have matching values in the links and cards tables will not be included in the final output. Since a FULL OUTER JOIN is used first, the final output will include all rows from the links and cards tables, even if there are no matching values in the account and district tables.

Wrapping Up#

In this section, we learnt about how to join more than two tables. To summarize:

  • When combining multiple tables, it is important to understand the desired output and the relationships between the tables to determine the appropriate join type.

  • Understanding primary and foreign keys helps in performing join operations efficiently.

  • INNER JOIN and JOIN : The JOIN function is a shorthand for INNER JOIN and is used interchangeably with INNER JOIN to achieve the same result. Both functions combine rows from multiple tables, based on specified join conditions. They return only the rows that have matching values in both tables being joined. The join conditions are specified using the ON keyword, where you define the columns from each table that should be compared for the join.

  • An OUTER JOIN, such as the LEFT JOIN, can be used to retrieve all records from one table regardless of whether there is a match in the other tables. This is useful when including supplementary information or handling missing data.

  • The order of tables in a LEFT JOIN is important, and it cascades to all joins in a query. Start with the table from which you want to keep all the records in the final result.

This brings us to a wrap of the first module: Intro to SQL! We hope you have enjoyed the content and are ready to move on to the next module: Interactive Queries and Parameterization.

Delete table

%%sql
DROP TABLE s1.account;
DROP TABLE s1.district;
DROP TABLE s1.card;
DROP TABLE s1.link;
DROP SCHEMA s1;
Success

References#

Dataset citation:

“PKDD’99 Discovery Challenge Guide to the Financial Data Set.” Home page of PKDD Discovery Challenge, 1999. https://sorry.vse.cz/~berka/challenge/PAST/.