Types of data visualizations

Types of data visualizations#

Welcome to the Visualizing your SQL queries module of the course! This part of the course will introduce data visualizations and commonly used packages. After getting familiar with the types of data visualizations and visualization packages, we’ll revisit SQL and teach you JupySQL’s unique feature of utilizing ggplot to visualize queries.

In this module, we will learn about the seaborn and plotly packages. Before we get into the details of each package, we first introduce the common types of data visualization with one of the most basic visualization packages: matplotlib. The purpose of this section is to not teach you the ins and outs of matplotlib, but more so to introduce some basic data visualizations.

Getting started#

To demonstrate these visualizations, we’re going to revisit our familiar bank data set from the Making Your First Query section.

As always, let’s first follow the steps of ensuring we have our necessary packages ready for use.

Install - execute this once.#

This code installs JupySQL, DuckDB, Pandas, and Matplotlib in your environment. We will be using these moving forward.

%pip install jupysql pandas matplotlib --quiet

Load the data#

We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv. We also import the matplotlib package as plt.

import sys
import matplotlib.pyplot as plt
import pandas as pd

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()

After running this code, you should have bank_cleaned.csv in the current directory.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db

Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'

Settings changed:

Config	value
displaycon	False
feedback	True
autopandas	False
named_parameters	True

Creating Table#

Now that we have our bank_cleaned.csv file, let’s load it into our DuckDB database and take a look at the data.

%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')

Count

%sqlcmd explore --table bank

The visualization packages we will be introducing in this module work best with the Pandas DataFrame data structure. Prior to visualizing our queries, we will always first convert them into Pandas DataFrames.

We convert the bank table below as an example.

bank = %sql SELECT * FROM bank
bank_df = pd.DataFrame(bank)
bank_df

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
0	30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
1	33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure	no
2	35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
3	30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown	no
4	59	blue-collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown	no
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4516	33	services	married	secondary	no	-333	yes	no	cellular	30	jul	329	5	-1	0	unknown	no
4517	57	self-employed	married	tertiary	yes	-3313	yes	yes	unknown	9	may	153	1	-1	0	unknown	no
4518	57	technician	married	secondary	no	295	no	no	cellular	19	aug	151	11	-1	0	unknown	no
4519	28	blue-collar	married	secondary	no	1137	no	no	cellular	6	feb	129	4	211	3	other	no
4520	44	entrepreneur	single	tertiary	no	1136	yes	yes	cellular	3	apr	345	2	249	7	other	no

4521 rows × 17 columns

Now we can jump into one of the most simple, yet essential, data visualizations: the bar plot.

Bar Plot#

Let’s use a bar plot that visualizes the count of each job type in our data. To do so, we will query the counts of each job and convert the query into a Pandas DataFrame.

%%sql --save jobs
SELECT job, COUNT(*) as count
FROM bank 
GROUP BY job

job	count
unemployed	128
services	417
management	969
blue-collar	946
self-employed	183
technician	768
entrepreneur	168
admin.	478
student	84
housemaid	112

Truncated to displaylimit of 10.

jobs = %sql SELECT * FROM jobs
jobs_df = pd.DataFrame(jobs)

Generating CTE with stored snippets: 'jobs'

Now that we have our query in a Pandas DataFrame, we can use matplotlib to visualize a bar plot.

plt.figure(figsize=(10, 6))
plt.bar(data=jobs_df, x="job", height="count")

plt.xlabel("Job")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.title("Count of Each Job")

plt.show()

../_images/e3e8b789764ed12626612029448f4838954ec4d8c4b44fbf5eb108dab59c2ed7.png

The second line in the above code cell, plt.bar(data=jobs_df, x="job", height="count"), is really all we need to create a baseline bar plot. The remaining statements are supplemental elements that labels and customizes the y-axis, x-axis, size, and title of the plot.

We can easily see that management and blue-collar jobs are the most prominent job category in this data set. Box plots are a great option when you need to visualize distributions of groups in a categorical variable.

Scatter Plot#

We first query the age and balance of single individuals and save it as a table called age_balance with a CTE.

%%sql --save age_balance
SELECT age, balance, marital
FROM bank
WHERE marital = 'single'

age	balance	marital
35	1350	single
35	747	single
20	502	single
37	2317	single
25	-221	single
44	106	single
32	2536	single
23	363	single
38	11971	single
36	553	single

Truncated to displaylimit of 10.

Then we again convert the table as a Pandas DataFrame.

age_balance_query = %sql SELECT * FROM age_balance
age_balance_df = pd.DataFrame(age_balance_query)

Generating CTE with stored snippets: 'age_balance'

plt.figure(figsize=(10, 6))
plt.scatter(data=age_balance_df, x="age", y="balance")

plt.xlabel("Age")
plt.ylabel("Balance")
plt.title("Age by Balance of Single Clients")

Text(0.5, 1.0, 'Age by Balance of Single Clients')

../_images/324bb253cdc7dbdb5900b6bd58be58024dd44a13db737f9f3867ac92cb1c3e63.png

Scatter plots are great when analyzing the relationship between two numerical variables. In this example, we plot the relationship between one’s age and balance for single individuals in our data set.

Box Plot#

%%sql --save edu_balance
SELECT education, age
FROM bank

education	age
primary	30
secondary	33
tertiary	35
tertiary	30
secondary	59
tertiary	35
tertiary	36
secondary	39
tertiary	41
primary	43

Truncated to displaylimit of 10.

edu_balance_query = %sql SELECT * FROM edu_balance
edu_df = pd.DataFrame(edu_balance_query)

Generating CTE with stored snippets: 'edu_balance'

# group the data by 'education'
edu_group = edu_df.groupby("education")["age"].apply(list)

plt.figure(figsize=(10, 6))
plt.boxplot(edu_group)

# customize x-axis tick labels
plt.xticks(range(1, len(edu_group) + 1), edu_df["education"].unique())

plt.xlabel("Education Level")
plt.ylabel("Age")
plt.title("Boxplot of Education Level by Age")

plt.show()

../_images/839bbf8531224830b9927b28c4a53d63db2e993c8b5457a94dc3dd854d5b5b5e.png

Box plots can be used when examining the relationship between a categorical feature and a numerical feature. In this plot, our categorical feature is the education variable. Each box represents a group within education and their respective ages in quantiles. This allows for a quick comparison of the distribution of groups within this categorical variable.

Heat Map#

%%sql --save job_education
SELECT job, education, COUNT(*) as count
FROM bank 
GROUP BY job, education

job	education	count
unemployed	primary	26
services	secondary	363
management	tertiary	787
blue-collar	secondary	524
self-employed	tertiary	88
technician	secondary	520
entrepreneur	tertiary	73
services	primary	25
admin.	secondary	393
technician	tertiary	211

Truncated to displaylimit of 10.

job_edu_query = %sql SELECT * FROM job_education
job_df = pd.DataFrame(job_edu_query)

Generating CTE with stored snippets: 'job_education'

data = job_df.pivot(index="education", columns="job", values="count")

plt.figure(figsize=(10, 6))
plt.imshow(data)

plt.colorbar(label="Count")
plt.xticks(range(len(data.columns)), data.columns, rotation=45)
plt.yticks(range(len(data.index)), data.index)
plt.title("Heatmap of Job and Education")

plt.show()

../_images/298ab4e7b308d862b32e57a2b6ec556fc73536b1030715a69c082c75dffe1fe4.png

The above plot displays the counts of job and education level of our data set. Heat maps are generally easy to understand because viewers can quickly point out extremes based on darker or lighter boxes. Here, we easily see people with management jobs have a high count of having a tertiary level education in our data set. You can think of heat maps as illustrating three dimensions: the x-axis, the y-axis, and the color gradient (which is usually a numerical feature).

Delete table

%%sql
DROP TABLE bank;

Success

Wrapping Up#

In this section, we introduced some basic data visualization plots: bar plots, scatter plots, box plots, and heat maps. The sections moving forward will teach you how to implement each of these plots using the seaborn and plotly libraries using the familiar banking data sets from the previous modules.

Types of data visualizations

Contents

Types of data visualizations#

Getting started#

Install - execute this once.#

Load the data#

Load Engine#

Creating Table#

Bar Plot#

Scatter Plot#

Box Plot#

Heat Map#

Wrapping Up#