Types of data visualizations#

Welcome to the Visualizing your SQL queries module of the course! This part of the course will introduce data visualizations and commonly used packages. After getting familiar with the types of data visualizations and visualization packages, we’ll revisit SQL and teach you JupySQL’s unique feature of utilizing ggplot to visualize queries.

In this module, we will learn about the seaborn and plotly packages. Before we get into the details of each package, we first introduce the common types of data visualization with one of the most basic visualization packages: matplotlib. The purpose of this section is to not teach you the ins and outs of matplotlib, but more so to introduce some basic data visualizations.

Getting started#

To demonstrate these visualizations, we’re going to revisit our familiar bank data set from the Making Your First Query section.

As always, let’s first follow the steps of ensuring we have our necessary packages ready for use.

Install - execute this once.#

This code installs JupySQL, DuckDB, Pandas, and Matplotlib in your environment. We will be using these moving forward.

%pip install jupysql pandas matplotlib --quiet

Load the data#

We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv. We also import the matplotlib package as plt.

import sys
import matplotlib.pyplot as plt
import pandas as pd

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()

After running this code, you should have bank_cleaned.csv in the current directory.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True

Creating Table#

Now that we have our bank_cleaned.csv file, let’s load it into our DuckDB database and take a look at the data.

%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')
Count
%sqlcmd explore --table bank

The visualization packages we will be introducing in this module work best with the Pandas DataFrame data structure. Prior to visualizing our queries, we will always first convert them into Pandas DataFrames.

We convert the bank table below as an example.

bank = %sql SELECT * FROM bank
bank_df = pd.DataFrame(bank)
bank_df
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4516 33 services married secondary no -333 yes no cellular 30 jul 329 5 -1 0 unknown no
4517 57 self-employed married tertiary yes -3313 yes yes unknown 9 may 153 1 -1 0 unknown no
4518 57 technician married secondary no 295 no no cellular 19 aug 151 11 -1 0 unknown no
4519 28 blue-collar married secondary no 1137 no no cellular 6 feb 129 4 211 3 other no
4520 44 entrepreneur single tertiary no 1136 yes yes cellular 3 apr 345 2 249 7 other no

4521 rows × 17 columns

Now we can jump into one of the most simple, yet essential, data visualizations: the bar plot.

Bar Plot#

Let’s use a bar plot that visualizes the count of each job type in our data. To do so, we will query the counts of each job and convert the query into a Pandas DataFrame.

%%sql --save jobs
SELECT job, COUNT(*) as count
FROM bank 
GROUP BY job
job count
unemployed 128
services 417
management 969
blue-collar 946
self-employed 183
technician 768
entrepreneur 168
admin. 478
student 84
housemaid 112
Truncated to displaylimit of 10.
jobs = %sql SELECT * FROM jobs
jobs_df = pd.DataFrame(jobs)
Generating CTE with stored snippets: 'jobs'

Now that we have our query in a Pandas DataFrame, we can use matplotlib to visualize a bar plot.

plt.figure(figsize=(10, 6))
plt.bar(data=jobs_df, x="job", height="count")

plt.xlabel("Job")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.title("Count of Each Job")

plt.show()
../_images/e3e8b789764ed12626612029448f4838954ec4d8c4b44fbf5eb108dab59c2ed7.png

The second line in the above code cell, plt.bar(data=jobs_df, x="job", height="count"), is really all we need to create a baseline bar plot. The remaining statements are supplemental elements that labels and customizes the y-axis, x-axis, size, and title of the plot.

We can easily see that management and blue-collar jobs are the most prominent job category in this data set. Box plots are a great option when you need to visualize distributions of groups in a categorical variable.

Scatter Plot#

We first query the age and balance of single individuals and save it as a table called age_balance with a CTE.

%%sql --save age_balance
SELECT age, balance, marital
FROM bank
WHERE marital = 'single'
age balance marital
35 1350 single
35 747 single
20 502 single
37 2317 single
25 -221 single
44 106 single
32 2536 single
23 363 single
38 11971 single
36 553 single
Truncated to displaylimit of 10.

Then we again convert the table as a Pandas DataFrame.

age_balance_query = %sql SELECT * FROM age_balance
age_balance_df = pd.DataFrame(age_balance_query)
Generating CTE with stored snippets: 'age_balance'
plt.figure(figsize=(10, 6))
plt.scatter(data=age_balance_df, x="age", y="balance")

plt.xlabel("Age")
plt.ylabel("Balance")
plt.title("Age by Balance of Single Clients")
Text(0.5, 1.0, 'Age by Balance of Single Clients')
../_images/324bb253cdc7dbdb5900b6bd58be58024dd44a13db737f9f3867ac92cb1c3e63.png

Scatter plots are great when analyzing the relationship between two numerical variables. In this example, we plot the relationship between one’s age and balance for single individuals in our data set.

Box Plot#

%%sql --save edu_balance
SELECT education, age
FROM bank
education age
primary 30
secondary 33
tertiary 35
tertiary 30
secondary 59
tertiary 35
tertiary 36
secondary 39
tertiary 41
primary 43
Truncated to displaylimit of 10.
edu_balance_query = %sql SELECT * FROM edu_balance
edu_df = pd.DataFrame(edu_balance_query)
Generating CTE with stored snippets: 'edu_balance'
# group the data by 'education'
edu_group = edu_df.groupby("education")["age"].apply(list)

plt.figure(figsize=(10, 6))
plt.boxplot(edu_group)

# customize x-axis tick labels
plt.xticks(range(1, len(edu_group) + 1), edu_df["education"].unique())

plt.xlabel("Education Level")
plt.ylabel("Age")
plt.title("Boxplot of Education Level by Age")

plt.show()
../_images/839bbf8531224830b9927b28c4a53d63db2e993c8b5457a94dc3dd854d5b5b5e.png

Box plots can be used when examining the relationship between a categorical feature and a numerical feature. In this plot, our categorical feature is the education variable. Each box represents a group within education and their respective ages in quantiles. This allows for a quick comparison of the distribution of groups within this categorical variable.

Heat Map#

%%sql --save job_education
SELECT job, education, COUNT(*) as count
FROM bank 
GROUP BY job, education
job education count
unemployed primary 26
services secondary 363
management tertiary 787
blue-collar secondary 524
self-employed tertiary 88
technician secondary 520
entrepreneur tertiary 73
services primary 25
admin. secondary 393
technician tertiary 211
Truncated to displaylimit of 10.
job_edu_query = %sql SELECT * FROM job_education
job_df = pd.DataFrame(job_edu_query)
Generating CTE with stored snippets: 'job_education'
data = job_df.pivot(index="education", columns="job", values="count")

plt.figure(figsize=(10, 6))
plt.imshow(data)

plt.colorbar(label="Count")
plt.xticks(range(len(data.columns)), data.columns, rotation=45)
plt.yticks(range(len(data.index)), data.index)
plt.title("Heatmap of Job and Education")

plt.show()
../_images/298ab4e7b308d862b32e57a2b6ec556fc73536b1030715a69c082c75dffe1fe4.png

The above plot displays the counts of job and education level of our data set. Heat maps are generally easy to understand because viewers can quickly point out extremes based on darker or lighter boxes. Here, we easily see people with management jobs have a high count of having a tertiary level education in our data set. You can think of heat maps as illustrating three dimensions: the x-axis, the y-axis, and the color gradient (which is usually a numerical feature).

Delete table

%%sql
DROP TABLE bank;
Success

Wrapping Up#

In this section, we introduced some basic data visualization plots: bar plots, scatter plots, box plots, and heat maps. The sections moving forward will teach you how to implement each of these plots using the seaborn and plotly libraries using the familiar banking data sets from the previous modules.