Plotting with plotly
#
Plotly is a visualization library specifically designed for dynamic interactive plots. The library offers several additional effects to visualizations, such as zooming, panning, and hovering effects. This library is especially known for being easily deployed with web applications.
For more on plotly, visit: https://plotly.com/python/
Let’s see how we can apply plotly to our familiar bank marketing data sets.
Install - execute this once.#
Important
Note: The --save
and %sqlcmd
features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql pandas plotly --quiet
Load the data#
Important
Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.
This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.
import sys
import plotly.express as px
import pandas as pd
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.MarketData(
"https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip", # noqa E501
"expanded_data",
)
_.convert_asc_to_csv(banking.district_column_names)
Converted expanded_data/trans.asc to CSV.
Converted expanded_data/order.asc to CSV.
Converted expanded_data/loan.asc to CSV.
Converted expanded_data/district.asc to CSV.
Converted expanded_data/disp.asc to CSV.
Converted expanded_data/client.asc to CSV.
Converted expanded_data/card.asc to CSV.
Converted expanded_data/account.asc to CSV.
All ASC files converted to CSV.
If you ran the above cell, you should have a folder expanded_data
in your current directory that contains the .csv
files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv
, account.csv
, district.csv
.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Important
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Creating Tables#
Let’s start off with loading three of the eight .csv
files from the expanded_data
folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema s1
in which we will store the tables. Here we use the CREATE TABLE
syntax in DuckDB to ingest four of the eight .csv
files. The read_csv_auto
is a function that helps SQL understand our local .csv
file for creation into our database.
%%sql
CREATE SCHEMA s1;
CREATE TABLE s1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE s1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE s1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
Count |
---|
The code above will create three tables in the database schema: s1.account
, s1.district
, s1.loan
.
Exploring the data#
Let’s take a look at each table.
%sqlcmd explore --table s1.account
%sqlcmd explore --table s1.district
%sqlcmd explore --table s1.loan
Bar Plots#
Let’s create a bar plot in plotly
using the s1.district
table. This plot will visualize the count of each region
in the data set.
First, let’s query the count of each region in SQL.
%%sql --save region_count
SELECT region, COUNT(*) as count
FROM s1.district
GROUP BY region
ORDER BY count DESC
region | count |
---|---|
south Moravia | 14 |
central Bohemia | 12 |
east Bohemia | 11 |
north Moravia | 11 |
west Bohemia | 10 |
north Bohemia | 10 |
south Bohemia | 8 |
Prague | 1 |
Then, let’s save the CTE as a Pandas DataFrame.
query = %sql SELECT * FROM region_count
region_count_df = pd.DataFrame(query)
region_count_df
region | count | |
---|---|---|
0 | south Moravia | 14 |
1 | central Bohemia | 12 |
2 | east Bohemia | 11 |
3 | north Moravia | 11 |
4 | west Bohemia | 10 |
5 | north Bohemia | 10 |
6 | south Bohemia | 8 |
7 | Prague | 1 |
We can now plot the count of each region
with this Pandas DataFrame.
fig = px.bar(
region_count_df,
x="region",
y="count",
title="Region Count",
color_discrete_sequence=["#7d26cd"],
)
fig.show()