Types of data visualizations#
Welcome to the Visualizing your SQL queries module of the course! This part of the course will introduce data visualizations and commonly used packages. After getting familiar with the types of data visualizations and visualization packages, we’ll revisit SQL and teach you JupySQL’s unique feature of utilizing ggplot
to visualize queries.
In this module, we will learn about the seaborn
and plotly
packages. Before we get into the details of each package, we first introduce the common types of data visualization with one of the most basic visualization packages: matplotlib
. The purpose of this section is to not teach you the ins and outs of matplotlib
, but more so to introduce some basic data visualizations.
Getting started#
To demonstrate these visualizations, we’re going to revisit our familiar bank data set from the Making Your First Query section.
As always, let’s first follow the steps of ensuring we have our necessary packages ready for use.
Install - execute this once.#
This code installs JupySQL, DuckDB, Pandas, and Matplotlib in your environment. We will be using these moving forward.
%pip install jupysql pandas matplotlib --quiet
Load the data#
We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv
. We also import the matplotlib
package as plt
.
import sys
import matplotlib.pyplot as plt
import pandas as pd
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()
After running this code, you should have bank_cleaned.csv
in the current directory.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Creating Table#
Now that we have our bank_cleaned.csv
file, let’s load it into our DuckDB database and take a look at the data.
%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')
Count |
---|
%sqlcmd explore --table bank
The visualization packages we will be introducing in this module work best with the Pandas DataFrame data structure. Prior to visualizing our queries, we will always first convert them into Pandas DataFrames.
We convert the bank
table below as an example.
bank = %sql SELECT * FROM bank
bank_df = pd.DataFrame(bank)
bank_df
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
2 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
4 | 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4516 | 33 | services | married | secondary | no | -333 | yes | no | cellular | 30 | jul | 329 | 5 | -1 | 0 | unknown | no |
4517 | 57 | self-employed | married | tertiary | yes | -3313 | yes | yes | unknown | 9 | may | 153 | 1 | -1 | 0 | unknown | no |
4518 | 57 | technician | married | secondary | no | 295 | no | no | cellular | 19 | aug | 151 | 11 | -1 | 0 | unknown | no |
4519 | 28 | blue-collar | married | secondary | no | 1137 | no | no | cellular | 6 | feb | 129 | 4 | 211 | 3 | other | no |
4520 | 44 | entrepreneur | single | tertiary | no | 1136 | yes | yes | cellular | 3 | apr | 345 | 2 | 249 | 7 | other | no |
4521 rows × 17 columns
Now we can jump into one of the most simple, yet essential, data visualizations: the bar plot.
Bar Plot#
Let’s use a bar plot that visualizes the count of each job type in our data. To do so, we will query the counts of each job and convert the query into a Pandas DataFrame.
%%sql --save jobs
SELECT job, COUNT(*) as count
FROM bank
GROUP BY job
job | count |
---|---|
unemployed | 128 |
services | 417 |
management | 969 |
blue-collar | 946 |
self-employed | 183 |
technician | 768 |
entrepreneur | 168 |
admin. | 478 |
student | 84 |
housemaid | 112 |
jobs = %sql SELECT * FROM jobs
jobs_df = pd.DataFrame(jobs)
Now that we have our query in a Pandas DataFrame, we can use matplotlib
to visualize a bar plot.
plt.figure(figsize=(10, 6))
plt.bar(data=jobs_df, x="job", height="count")
plt.xlabel("Job")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.title("Count of Each Job")
plt.show()

The second line in the above code cell, plt.bar(data=jobs_df, x="job", height="count")
, is really all we need to create a baseline bar plot. The remaining statements are supplemental elements that labels and customizes the y-axis, x-axis, size, and title of the plot.
We can easily see that management
and blue-collar
jobs are the most prominent job category in this data set. Box plots are a great option when you need to visualize distributions of groups in a categorical variable.
Scatter Plot#
We first query the age
and balance
of single individuals and save it as a table called age_balance
with a CTE.
%%sql --save age_balance
SELECT age, balance, marital
FROM bank
WHERE marital = 'single'
age | balance | marital |
---|---|---|
35 | 1350 | single |
35 | 747 | single |
20 | 502 | single |
37 | 2317 | single |
25 | -221 | single |
44 | 106 | single |
32 | 2536 | single |
23 | 363 | single |
38 | 11971 | single |
36 | 553 | single |
Then we again convert the table as a Pandas DataFrame.
age_balance_query = %sql SELECT * FROM age_balance
age_balance_df = pd.DataFrame(age_balance_query)
plt.figure(figsize=(10, 6))
plt.scatter(data=age_balance_df, x="age", y="balance")
plt.xlabel("Age")
plt.ylabel("Balance")
plt.title("Age by Balance of Single Clients")
Text(0.5, 1.0, 'Age by Balance of Single Clients')

Scatter plots are great when analyzing the relationship between two numerical variables. In this example, we plot the relationship between one’s age
and balance
for single individuals in our data set.
Box Plot#
%%sql --save edu_balance
SELECT education, age
FROM bank
education | age |
---|---|
primary | 30 |
secondary | 33 |
tertiary | 35 |
tertiary | 30 |
secondary | 59 |
tertiary | 35 |
tertiary | 36 |
secondary | 39 |
tertiary | 41 |
primary | 43 |
edu_balance_query = %sql SELECT * FROM edu_balance
edu_df = pd.DataFrame(edu_balance_query)
# group the data by 'education'
edu_group = edu_df.groupby("education")["age"].apply(list)
plt.figure(figsize=(10, 6))
plt.boxplot(edu_group)
# customize x-axis tick labels
plt.xticks(range(1, len(edu_group) + 1), edu_df["education"].unique())
plt.xlabel("Education Level")
plt.ylabel("Age")
plt.title("Boxplot of Education Level by Age")
plt.show()

Box plots can be used when examining the relationship between a categorical feature and a numerical feature. In this plot, our categorical feature is the education
variable. Each box represents a group within education
and their respective ages in quantiles. This allows for a quick comparison of the distribution of groups within this categorical variable.
Heat Map#
%%sql --save job_education
SELECT job, education, COUNT(*) as count
FROM bank
GROUP BY job, education
job | education | count |
---|---|---|
unemployed | primary | 26 |
services | secondary | 363 |
management | tertiary | 787 |
blue-collar | secondary | 524 |
self-employed | tertiary | 88 |
technician | secondary | 520 |
entrepreneur | tertiary | 73 |
services | primary | 25 |
admin. | secondary | 393 |
technician | tertiary | 211 |
job_edu_query = %sql SELECT * FROM job_education
job_df = pd.DataFrame(job_edu_query)
data = job_df.pivot(index="education", columns="job", values="count")
plt.figure(figsize=(10, 6))
plt.imshow(data)
plt.colorbar(label="Count")
plt.xticks(range(len(data.columns)), data.columns, rotation=45)
plt.yticks(range(len(data.index)), data.index)
plt.title("Heatmap of Job and Education")
plt.show()

The above plot displays the counts of job
and education
level of our data set. Heat maps are generally easy to understand because viewers can quickly point out extremes based on darker or lighter boxes. Here, we easily see people with management jobs have a high count of having a tertiary level education in our data set. You can think of heat maps as illustrating three dimensions: the x-axis, the y-axis, and the color gradient (which is usually a numerical feature).
Delete table
%%sql
DROP TABLE bank;
Success |
---|
Wrapping Up#
In this section, we introduced some basic data visualization plots: bar plots, scatter plots, box plots, and heat maps. The sections moving forward will teach you how to implement each of these plots using the seaborn
and plotly
libraries using the familiar banking data sets from the previous modules.