Plotting with plotly#
Plotly is a visualization library specifically designed for dynamic interactive plots. The library offers several additional effects to visualizations, such as zooming, panning, and hovering effects. This library is especially known for being easily deployed with web applications.
For more on plotly, visit: https://plotly.com/python/
Let’s see how we can apply plotly to our familiar bank marketing data sets.
Install - execute this once.#
Important
Note: The --save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql pandas plotly --quiet
Load the data#
Important
Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.
This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.
import sys
import plotly.express as px
import pandas as pd
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.MarketData(
"https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip", # noqa E501
"expanded_data",
)
_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: HTTP Error 429: Too Many Requests
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'
If you ran the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv, account.csv, district.csv.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Important
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
| Config | value |
|---|---|
| displaycon | False |
| feedback | True |
| autopandas | False |
| named_parameters | True |
Creating Tables#
Let’s start off with loading three of the eight .csv files from the expanded_data folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1 in which we will store the tables. Here we use the CREATE TABLE syntax in DuckDB to ingest four of the eight .csv files. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.
%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
| Count |
|---|
The code above will create three tables in the database schema: s1.account, s1.district, s1.loan.
Exploring the data#
Let’s take a look at each table.
%sqlcmd explore --table s1.account
%sqlcmd explore --table s1.district
%sqlcmd explore --table s1.loan
Bar Plots#
Let’s create a bar plot in plotly using the s1.district table. This plot will visualize the count of each region in the data set.
First, let’s query the count of each region in SQL.
%%sql --save region_count
SELECT region, COUNT(*) as count
FROM s1.district
GROUP BY region
ORDER BY count DESC
| region | count |
|---|---|
| south Moravia | 14 |
| central Bohemia | 12 |
| east Bohemia | 11 |
| north Moravia | 11 |
| west Bohemia | 10 |
| north Bohemia | 10 |
| south Bohemia | 8 |
| Prague | 1 |
Then, let’s save the CTE as a Pandas DataFrame.
query = %sql SELECT * FROM region_count
region_count_df = pd.DataFrame(query)
region_count_df
| region | count | |
|---|---|---|
| 0 | south Moravia | 14 |
| 1 | central Bohemia | 12 |
| 2 | east Bohemia | 11 |
| 3 | north Moravia | 11 |
| 4 | west Bohemia | 10 |
| 5 | north Bohemia | 10 |
| 6 | south Bohemia | 8 |
| 7 | Prague | 1 |
We can now plot the count of each region with this Pandas DataFrame.
fig = px.bar(
region_count_df,
x="region",
y="count",
title="Region Count",
color_discrete_sequence=["#7d26cd"],
)
fig.show()
Notice how the plotly outputs a plot that looks noticeably different than matplotlib or seaborn outputs. Try interacting around with the plot by hovering, zooming, and panning! Double click the plot to autoscale the plot.
There are three core arguments that are applied above: the Pandas DataFrame, the x variable, and the y variable. We also set the title and color of the bars to purple by providing the title and color_discrete_sequence arguments with values.
Question 1 (Hard):#
Your boss wants you to create a bar plot showing the average loan amount for each region. Your boss also specified that they want to you to also incorporate the average salary of each region as well.
Hint: The color variable that you learned in the seaborn section is also usable in plotly. You will also most likely have to use a CTE to solve this question.
Answers
We first need to join the s1.district, s1.loan, and s1.account tables in order to have the necessary information to solve this problem.
%%sql --save average_loan_per_district
SELECT d.district_id,
d.region,
d.average_salary,
ROUND(AVG(l.amount),2) AS avg_loan_amount
FROM s1.district d
JOIN s1.account a
ON d.district_id = a.district_id
JOIN s1.loan l
ON a.account_id = l.account_id
GROUP BY d.district_id,
d.region,
d.average_salary
ORDER BY avg_loan_amount DESC
| district_id | region | average_salary | avg_loan_amount |
|---|---|---|---|
| 46 | east Bohemia | 8369 | 294730.0 |
| 14 | south Bohemia | 10045 | 251365.5 |
| 3 | central Bohemia | 8980 | 243466.0 |
| 69 | north Moravia | 8173 | 242304.0 |
| 6 | central Bohemia | 8546 | 232886.67 |
| 67 | north Moravia | 8110 | 212966.0 |
| 20 | south Bohemia | 8547 | 204082.0 |
| 57 | south Moravia | 8720 | 198584.0 |
| 22 | west Bohemia | 8620 | 198504.0 |
| 28 | west Bohemia | 8594 | 191283.43 |
This resulting CTE give us the average salary of each district. This is because district_id was the primary key for the s1.district table. However, we are now able to GROUP BY each region and average the averages for the desired output.
%%sql --save average_loan_per_region
SELECT region, ROUND(AVG(average_salary),2) AS average_salary,
ROUND(AVG(avg_loan_amount),2) AS average_loan_amount
FROM average_loan_per_district
GROUP BY region
| region | average_salary | average_loan_amount |
|---|---|---|
| east Bohemia | 8611.18 | 162454.3 |
| south Bohemia | 8831.5 | 153746.69 |
| central Bohemia | 9357.25 | 155597.12 |
| north Moravia | 9049.18 | 160141.91 |
| south Moravia | 8728.5 | 147277.13 |
| west Bohemia | 9015.4 | 132575.11 |
| north Bohemia | 9334.2 | 123544.03 |
| Prague | 12541.0 | 153957.29 |
Now, convert the average_loan_per_region table to a Pandas DataFrame.
average_loan_per_region = %sql SELECT * FROM average_loan_per_region
avg_loans_df = pd.DataFrame(average_loan_per_region)
avg_loans_df
| region | average_salary | average_loan_amount | |
|---|---|---|---|
| 0 | east Bohemia | 8611.18 | 162454.30 |
| 1 | south Bohemia | 8831.50 | 153746.69 |
| 2 | central Bohemia | 9357.25 | 155597.12 |
| 3 | north Moravia | 9049.18 | 160141.91 |
| 4 | south Moravia | 8728.50 | 147277.13 |
| 5 | west Bohemia | 9015.40 | 132575.11 |
| 6 | north Bohemia | 9334.20 | 123544.03 |
| 7 | Prague | 12541.00 | 153957.29 |
And finally output the bar plot with plotly.
fig = px.bar(
avg_loans_df,
x="region",
y="average_loan_amount",
color="average_salary",
labels={
"total_loan_amount": "Total Loan Amount",
"district_name": "District Name",
"average_salary": "Average Salary",
},
title="Total Loan Amount by District with Average Salary color scale",
)
fig.show()
Scatter Plots#
Let’s now demonstrate a scatter plot in plotly by also using the s1.district table. This plot will visualize the relationship between the average_salary of a district with the unemployment rate in 1996. The plot also provides another dimension of visualization by incorporating the ratio_of_urban_inhabitants variable in size and color.
First, let’s load our data into a Pandas DataFrame.
district = %sql SELECT * FROM s1.district
district_df = pd.DataFrame(district)
We can now plot the Pandas DataFrame using plotly.
fig = px.scatter(
district_df,
x="no_of_inhabitants",
y="no_of_cities",
color="ratio_of_urban_inhabitants",
size="ratio_of_urban_inhabitants",
labels={
"no_of_cities": "Number of Cities",
"no_of_inhabitants": "Number of Inhabitants",
},
title="Number of Cities by Number of Inhabitants",
color_continuous_scale="Viridis",
)
fig.show()
Here we renamed the axis titles with the labels argument and changed the color scale with the color_continuous_scale argument to make the plot more visually appealing.
Question 2 (Medium):#
Your boss didn’t quite like the bar plot you’ve made. They’ve asked you to now instead create an interactive scatter plot showing the relationship between the average_loan_amount and average_salary by district.
Hint: Is there a way to reuse our work from question 1?
Answers
We can reuse the average_loan_per_district table we created with CTEs from the last question.
df = %sql SELECT * FROM average_loan_per_district
avg_loans_df = pd.DataFrame(df)
avg_loans_df
| district_id | region | average_salary | avg_loan_amount | |
|---|---|---|---|---|
| 0 | 46 | east Bohemia | 8369 | 294730.00 |
| 1 | 14 | south Bohemia | 10045 | 251365.50 |
| 2 | 3 | central Bohemia | 8980 | 243466.00 |
| 3 | 69 | north Moravia | 8173 | 242304.00 |
| 4 | 6 | central Bohemia | 8546 | 232886.67 |
| ... | ... | ... | ... | ... |
| 72 | 25 | west Bohemia | 8554 | 90292.00 |
| 73 | 15 | south Bohemia | 9045 | 83475.43 |
| 74 | 77 | north Moravia | 8909 | 81830.00 |
| 75 | 32 | north Bohemia | 9272 | 77114.00 |
| 76 | 30 | west Bohemia | 9650 | 74262.00 |
77 rows × 4 columns
fig = px.scatter(
avg_loans_df,
x="average_salary",
y="avg_loan_amount",
title="Average Loan Amount vs Average Salary per District",
)
fig.show()
Histograms#
Histograms are similar to bar plots. The only difference is that the x-axis should be a numerical feature rather than a categorical one. We demonstrate a plotly histogram below.
loans = %sql SELECT amount FROM s1.loan
loans_df = pd.DataFrame(loans)
loans_df
| amount | |
|---|---|
| 0 | 96396 |
| 1 | 165960 |
| 2 | 127080 |
| 3 | 105804 |
| 4 | 274740 |
| ... | ... |
| 677 | 352704 |
| 678 | 52512 |
| 679 | 139488 |
| 680 | 55632 |
| 681 | 240900 |
682 rows × 1 columns
fig = px.histogram(loans_df, x="amount", nbins=20, title="Loan Amounts")
fig.show()
Above we use a histogram to visualize the distribution of loan amounts. Try adjusting the nbins argument to see how it influences the plot.
Question 3 (Easy):#
A colleague asks you for a visualization of the distribution of loan amounts for loans with an ‘A’ or ‘D’ status.
Answers
Try clicking the boxes next to “A” and “D” under the “status” legend symbol to see a useful plotly feature.
%%sql --save loan_status
SELECT *
FROM s1.loan
WHERE status = 'A' or status = 'D'
| loan_id | account_id | date | amount | duration | payments | status |
|---|---|---|---|---|---|---|
| 5316 | 1801 | 930711 | 165960 | 36 | 4610.0 | A |
| 6863 | 9188 | 930728 | 127080 | 60 | 2118.0 | A |
| 5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
| 7240 | 11013 | 930906 | 274740 | 60 | 4579.0 | A |
| 6687 | 8261 | 930913 | 87840 | 24 | 3660.0 | A |
| 7284 | 11265 | 930915 | 52788 | 12 | 4399.0 | A |
| 7235 | 10973 | 931013 | 154416 | 48 | 3217.0 | A |
| 5997 | 4894 | 931104 | 117024 | 24 | 4876.0 | A |
| 7121 | 10364 | 931110 | 21924 | 36 | 609.0 | A |
| 6077 | 5270 | 931122 | 79608 | 24 | 3317.0 | A |
loan_status_table = %sql SELECT * FROM loan_status
loan_status_df = pd.DataFrame(loan_status_table)
loan_status_df
| loan_id | account_id | date | amount | duration | payments | status | |
|---|---|---|---|---|---|---|---|
| 0 | 5316 | 1801 | 930711 | 165960 | 36 | 4610.0 | A |
| 1 | 6863 | 9188 | 930728 | 127080 | 60 | 2118.0 | A |
| 2 | 5325 | 1843 | 930803 | 105804 | 36 | 2939.0 | A |
| 3 | 7240 | 11013 | 930906 | 274740 | 60 | 4579.0 | A |
| 4 | 6687 | 8261 | 930913 | 87840 | 24 | 3660.0 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 243 | 5976 | 4794 | 971227 | 465504 | 48 | 9698.0 | D |
| 244 | 6237 | 6075 | 980106 | 407952 | 48 | 8499.0 | D |
| 245 | 6682 | 8225 | 980330 | 36204 | 12 | 3017.0 | D |
| 246 | 5991 | 4858 | 980727 | 335184 | 48 | 6983.0 | D |
| 247 | 4967 | 37 | 981014 | 318480 | 60 | 5308.0 | D |
248 rows × 7 columns
fig = px.histogram(
loan_status_df,
x="amount",
color="status",
nbins=25,
title="'A' and 'D' Loan Amounts",
)
fig.show()
Box and Whiskers#
When you interact with Plotly box plots by hovering over them, they provide a wealth of information. For example, take a look of this box plot displaying the distribution of a loan’s amount with each distinct loan status.
loans = %sql SELECT * FROM s1.loan
loans_df = pd.DataFrame(loans)
category_order = ["A", "B", "C", "D"]
fig = px.box(
loans_df,
x="status",
y="amount",
color="status",
category_orders={"status": category_order},
title="Distribution of Loan Status By Amount",
)
fig.show()
The category_orders argument is just to have the boxes be in alphabetical order.
Hovering over each box plot displays additional information on the respective loan status. You can also check and uncheck the boxes under the “status” legend to adjust the plot.
Question 4 (Easy):#
A colleague asks you for a box plot displaying the relationship of loan duration by loan status.
Answers
%%sql --save loan_duration
SELECT
l.status,
l.duration AS loan_duration
FROM
s1.loan l
| status | loan_duration |
|---|---|
| B | 12 |
| A | 36 |
| A | 60 |
| A | 36 |
| A | 60 |
| A | 24 |
| A | 12 |
| B | 24 |
| A | 48 |
| A | 24 |
loan_duration = %sql SELECT * FROM loan_duration
loan_duration_df = pd.DataFrame(loan_duration)
category_order = ["A", "B", "C", "D"]
fig = px.box(
loan_duration_df,
x="status",
y="loan_duration",
color="status",
category_orders={"status": category_order},
labels={"status": "Loan Status", "loan_duration": "Loan Duration"},
title="Distribution of Loan Durations by Loan Status",
)
fig.show()
Delete tables
%%sql
DROP TABLE IF EXISTS s1.loan;
DROP TABLE IF EXISTS s1.account;
DROP TABLE IF EXISTS s1.district;
DROP SCHEMA s1;
| Success |
|---|
Wrapping Up#
In this section, you learned how to create interactive displays with plotly. The syntax for seaborn and plotly is quite similar, allowing for a seamless transition between the two libraries.
In the next section, you’ll learn how to visualize your SQL queries with ggplot!
References#
Plotly 5.15.0 documentation https://plotly.com/python/