Types of data visualizations#
Welcome to the Visualizing your SQL queries module of the course! This part of the course will introduce data visualizations and commonly used packages. After getting familiar with the types of data visualizations and visualization packages, we’ll revisit SQL and teach you JupySQL’s unique feature of utilizing ggplot to visualize queries.
In this module, we will learn about the seaborn and plotly packages. Before we get into the details of each package, we first introduce the common types of data visualization with one of the most basic visualization packages: matplotlib. The purpose of this section is to not teach you the ins and outs of matplotlib, but more so to introduce some basic data visualizations.
Getting started#
To demonstrate these visualizations, we’re going to revisit our familiar bank data set from the Making Your First Query section.
As always, let’s first follow the steps of ensuring we have our necessary packages ready for use.
Install - execute this once.#
This code installs JupySQL, DuckDB, Pandas, and Matplotlib in your environment. We will be using these moving forward.
%pip install jupysql pandas matplotlib --quiet
Load the data#
We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv. We also import the matplotlib package as plt.
import sys
import matplotlib.pyplot as plt
import pandas as pd
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()
After running this code, you should have bank_cleaned.csv in the current directory.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db
| Config | value |
|---|---|
| displaycon | False |
| feedback | True |
| autopandas | False |
| named_parameters | True |
Creating Table#
Now that we have our bank_cleaned.csv file, let’s load it into our DuckDB database and take a look at the data.
%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')
| Count |
|---|
%sqlcmd explore --table bank
The visualization packages we will be introducing in this module work best with the Pandas DataFrame data structure. Prior to visualizing our queries, we will always first convert them into Pandas DataFrames.
We convert the bank table below as an example.
bank = %sql SELECT * FROM bank
bank_df = pd.DataFrame(bank)
bank_df
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
| 1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
| 2 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
| 3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
| 4 | 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4516 | 33 | services | married | secondary | no | -333 | yes | no | cellular | 30 | jul | 329 | 5 | -1 | 0 | unknown | no |
| 4517 | 57 | self-employed | married | tertiary | yes | -3313 | yes | yes | unknown | 9 | may | 153 | 1 | -1 | 0 | unknown | no |
| 4518 | 57 | technician | married | secondary | no | 295 | no | no | cellular | 19 | aug | 151 | 11 | -1 | 0 | unknown | no |
| 4519 | 28 | blue-collar | married | secondary | no | 1137 | no | no | cellular | 6 | feb | 129 | 4 | 211 | 3 | other | no |
| 4520 | 44 | entrepreneur | single | tertiary | no | 1136 | yes | yes | cellular | 3 | apr | 345 | 2 | 249 | 7 | other | no |
4521 rows × 17 columns
Now we can jump into one of the most simple, yet essential, data visualizations: the bar plot.
Bar Plot#
Let’s use a bar plot that visualizes the count of each job type in our data. To do so, we will query the counts of each job and convert the query into a Pandas DataFrame.
%%sql --save jobs
SELECT job, COUNT(*) as count
FROM bank
GROUP BY job
| job | count |
|---|---|
| unemployed | 128 |
| services | 417 |
| management | 969 |
| blue-collar | 946 |
| self-employed | 183 |
| technician | 768 |
| entrepreneur | 168 |
| admin. | 478 |
| student | 84 |
| housemaid | 112 |
jobs = %sql SELECT * FROM jobs
jobs_df = pd.DataFrame(jobs)
Now that we have our query in a Pandas DataFrame, we can use matplotlib to visualize a bar plot.
plt.figure(figsize=(10, 6))
plt.bar(data=jobs_df, x="job", height="count")
plt.xlabel("Job")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.title("Count of Each Job")
plt.show()
The second line in the above code cell, plt.bar(data=jobs_df, x="job", height="count"), is really all we need to create a baseline bar plot. The remaining statements are supplemental elements that labels and customizes the y-axis, x-axis, size, and title of the plot.
We can easily see that management and blue-collar jobs are the most prominent job category in this data set. Box plots are a great option when you need to visualize distributions of groups in a categorical variable.
Scatter Plot#
We first query the age and balance of single individuals and save it as a table called age_balance with a CTE.
%%sql --save age_balance
SELECT age, balance, marital
FROM bank
WHERE marital = 'single'
| age | balance | marital |
|---|---|---|
| 35 | 1350 | single |
| 35 | 747 | single |
| 20 | 502 | single |
| 37 | 2317 | single |
| 25 | -221 | single |
| 44 | 106 | single |
| 32 | 2536 | single |
| 23 | 363 | single |
| 38 | 11971 | single |
| 36 | 553 | single |
Then we again convert the table as a Pandas DataFrame.
age_balance_query = %sql SELECT * FROM age_balance
age_balance_df = pd.DataFrame(age_balance_query)
plt.figure(figsize=(10, 6))
plt.scatter(data=age_balance_df, x="age", y="balance")
plt.xlabel("Age")
plt.ylabel("Balance")
plt.title("Age by Balance of Single Clients")
Text(0.5, 1.0, 'Age by Balance of Single Clients')
Scatter plots are great when analyzing the relationship between two numerical variables. In this example, we plot the relationship between one’s age and balance for single individuals in our data set.
Box Plot#
%%sql --save edu_balance
SELECT education, age
FROM bank
| education | age |
|---|---|
| primary | 30 |
| secondary | 33 |
| tertiary | 35 |
| tertiary | 30 |
| secondary | 59 |
| tertiary | 35 |
| tertiary | 36 |
| secondary | 39 |
| tertiary | 41 |
| primary | 43 |
edu_balance_query = %sql SELECT * FROM edu_balance
edu_df = pd.DataFrame(edu_balance_query)
# group the data by 'education'
edu_group = edu_df.groupby("education")["age"].apply(list)
plt.figure(figsize=(10, 6))
plt.boxplot(edu_group)
# customize x-axis tick labels
plt.xticks(range(1, len(edu_group) + 1), edu_df["education"].unique())
plt.xlabel("Education Level")
plt.ylabel("Age")
plt.title("Boxplot of Education Level by Age")
plt.show()
Box plots can be used when examining the relationship between a categorical feature and a numerical feature. In this plot, our categorical feature is the education variable. Each box represents a group within education and their respective ages in quantiles. This allows for a quick comparison of the distribution of groups within this categorical variable.
Heat Map#
%%sql --save job_education
SELECT job, education, COUNT(*) as count
FROM bank
GROUP BY job, education
| job | education | count |
|---|---|---|
| unemployed | primary | 26 |
| services | secondary | 363 |
| management | tertiary | 787 |
| blue-collar | secondary | 524 |
| self-employed | tertiary | 88 |
| technician | secondary | 520 |
| entrepreneur | tertiary | 73 |
| services | primary | 25 |
| admin. | secondary | 393 |
| technician | tertiary | 211 |
job_edu_query = %sql SELECT * FROM job_education
job_df = pd.DataFrame(job_edu_query)
data = job_df.pivot(index="education", columns="job", values="count")
plt.figure(figsize=(10, 6))
plt.imshow(data)
plt.colorbar(label="Count")
plt.xticks(range(len(data.columns)), data.columns, rotation=45)
plt.yticks(range(len(data.index)), data.index)
plt.title("Heatmap of Job and Education")
plt.show()
The above plot displays the counts of job and education level of our data set. Heat maps are generally easy to understand because viewers can quickly point out extremes based on darker or lighter boxes. Here, we easily see people with management jobs have a high count of having a tertiary level education in our data set. You can think of heat maps as illustrating three dimensions: the x-axis, the y-axis, and the color gradient (which is usually a numerical feature).
Delete table
%%sql
DROP TABLE bank;
| Success |
|---|
Wrapping Up#
In this section, we introduced some basic data visualization plots: bar plots, scatter plots, box plots, and heat maps. The sections moving forward will teach you how to implement each of these plots using the seaborn and plotly libraries using the familiar banking data sets from the previous modules.