Plotting with seaborn#

Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures. It provides a high-level interface for drawing attractive and informative statistical graphics.

The plotting functions operate on dataframes and arrays containing whole datasets. Internally, they perform the necessary semantic mapping and statistical aggregation to produce informative plots.

Its dataset-oriented, declarative API lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them.

For more see: https://seaborn.pydata.org/

Install and Load Libraries#

Important

Note: The --save and %sqlcmd features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql pandas seaborn --quiet

Finally, we load in the libraries we will be using in this tutorial.

import sys
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

Load the data#

Important

Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.

This section was covered in detail in the previous tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.MarketData(
    "https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip",  # noqa E501
    "expanded_data",
)

_.convert_asc_to_csv(banking.district_column_names)
Error, could not download data: File is not a zip file
Error, could not convert ASC to CSV: 'NoneType' object has no attribute 'namelist'

If you run the above cell, you should have a folder expanded_data in your current directory that contains the .csv files we will be using. In this tutorial, we will be focusing on three of these files: loan.csv, account.csv, district.csv.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Important

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True

Delete tables and schema if they already exist.

Creating Tables#

Let’s start off with loading three of the eight .csv files from the expanded_data folder in the current directory to our newly created DuckDB database. Like in the previous tutorial, we will create a schema b1 in which we will store the tables. Here we use the CREATE TABLE syntax in DuckDB to ingest four of the eight .csv files. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

Delete tables

%%sql
DROP TABLE IF EXISTS b1.loan;
DROP TABLE IF EXISTS b1.account;
DROP TABLE IF EXISTS b1.district;
DROP SCHEMA IF EXISTS b1;
Success
%%sql
CREATE SCHEMA b1;
CREATE TABLE b1.account AS
FROM read_csv_auto('expanded_data/account.csv', header=True, sep=',');
CREATE TABLE b1.district AS
FROM read_csv_auto('expanded_data/district.csv', header=True, sep=',');
CREATE TABLE b1.loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
Count

The code above will create three tables in the database schema: b1.account, b1.district, b1.loan.

Exploring the data#

Let’s take a look at its entries.

%sqlcmd explore --table b1.account
%sqlcmd explore --table b1.district
%sqlcmd explore --table b1.loan

Matplotlib inheritance#

Seaborn is built on top of Matplotlib. Therefore, depending on the seaborn plotting command, it will return either a Matplotlib axes or figure object. If the plotting function is axes-level, a single matplotlib.pyplot.Axes object is returned. This object accepts an ax= argument, which integrates with Matplotlib’s object-oriented interface and allows composing plots with other plots. On the other hand, if the plotting function is figure-level, a FacetGrid object is returned. This object, unlike the axes-level object, is more standalone, but “smart” about subplot organization. To learn about these objects in greater detail, visit seaborn’s website here.

A few examples denoting this distinction are shown below.

Axes-level#

Suppose we want to identify whether gentrification increases the average salary of two regions, using the data we downloaded above. We first save our CTE (Common Table Expression) named levels_example that takes in the columns, average_salary, ratio_of_urban_inhabitants, and region, and filters for two regions, ‘central Bohemia’ and ‘east Bohemia’, from the b1.district table.

%%sql --save levels_example
SELECT average_salary, ratio_of_urban_inhabitants, region
FROM b1.district
WHERE region IN ('central Bohemia', 'east Bohemia');
average_salary ratio_of_urban_inhabitants region
8507 46.7 central Bohemia
8980 41.7 central Bohemia
9753 67.4 central Bohemia
9307 51.4 central Bohemia
8546 51.5 central Bohemia
9920 63.4 central Bohemia
11277 69.4 central Bohemia
8899 55.3 central Bohemia
10124 46.7 central Bohemia
9622 36.5 central Bohemia
Truncated to displaylimit of 10.

The result of the CTE is saved as a pandas DataFrame():

result = %sql SELECT * FROM levels_example
df = pd.DataFrame(result)
Generating CTE with stored snippets: 'levels_example'

You can determine what is returned using Python’s type() function:

plt.rcParams["figure.dpi"] = 300  # high resolution

scatter_plt = sns.scatterplot(
    data=df, x="ratio_of_urban_inhabitants", y="average_salary", hue="region"
)
print(type(scatter_plt))
plt.show()
<class 'matplotlib.axes._axes.Axes'>
../_images/e35f77be9efd4683ab0313a20619749c9e99c83f396dd85c1f3919ce3dd66e5a.png

Notice that the sns.scatterplot() function returns a Matplotlib Axes object, a single plot that is inclusive of the data from both regions. Other seaborn functions, including regplot, boxplot, kdeplot, and many others, also return Matplotlib Axes objects. Therefore, we can use various Matplotlib axes commands to modify the Seaborn figure.

Figure-level#

On the other hand, we can show that sns.relplot() returns a FacetGrid object, creating separate plots for the regions:

facet_plt = sns.relplot(
    data=df, x="ratio_of_urban_inhabitants", y="average_salary", col="region"
)
print(type(facet_plt))
plt.show()
<class 'seaborn.axisgrid.FacetGrid'>
../_images/6ba5243a914a1373fd60a9b69b0cc8a12e3ae3ec70983d4cc9ff86e66a8687ea.png

Other figure-level seaborn functions include catplot, displot, pairplot, and jointplot.

Important

The legends are placed outside the plot if a Figure-level plotting function is used. See scatterplot section below.

Let’s now jump into one of the most simple, yet essential, data visualizations: the bar plot.

Barplots#

The most basic seaborn.barplot() function takes a categorical and a numeric variable as encodings. A second layer of grouping, preferably with another categorical variable, can be added to the hue argument.

Example#

Suppose the marketing manager wants to see a visualization for the number of unique loan ID’s associated with each status of paying off the loan. To tackle this question, we will, first, create a CTE from the b1.loan table and obtain the counts for each status in different columns:

%%sql --save bar_plot_example
SELECT DISTINCT status, COUNT(loan_id) AS count_loan_id
FROM b1.loan
GROUP BY status
ORDER BY status;
status count_loan_id
A 203
B 31
C 403
D 45

Save the CTE as a pandas DataFrame():

result = %sql SELECT * FROM bar_plot_example
df = pd.DataFrame(result)
Generating CTE with stored snippets: 'bar_plot_example'

Finally, use seaborn.barplot() to produce a bar plot:

plt.figure(figsize=(15, 5), dpi=300)  # Initialize blank canvas
sns.barplot(data=df, x="status", y="count_loan_id")
plt.xlabel("Status of Paying off Loan")
plt.ylabel("Count")
plt.title("Count of Loan ID's by Loan Status")
plt.show()
../_images/5788111ff119d3c321698a2ec5c0fcf926b2d716d946f634762796a11e168d2c.png

Question 1 (Medium)#

The marketing manager now wants you to provide the same information as the example above, but with an added grouping of the frequency of issuance of statements, which corresponds to the frequency variable in the b1.account table. Create a grouped bar plot with clear axes labels, axes tick marks, title, and legend.

Hint Since the frequency variable is not in b1.loan, think of which SQL operation you can employ to combine both tables. Moreover, the b1.loan is a subset of the b1.account so use the appropriate technique that provides all rows from b1.account.

Answers

We start off by creating a CTE from both the b1.loan and b1.account table with the help of a LEFT JOIN on account_id. The reason for choosing this join is because all account_id’s in b1.loan are present in b1.account, so we obtain all accounts in the database. Next, because we also want counts by frequency, we add it to the GROUP BY clause and ensure we pass DISTINCT in the SELECT clause:

%%sql --save bar_plot_question
SELECT DISTINCT status, frequency, COUNT(loan_id) AS count_loan_id
FROM b1.account
LEFT JOIN b1.loan
    ON b1.account.account_id = b1.loan.account_id
GROUP BY status, frequency
ORDER BY status;
status frequency count_loan_id
A POPLATEK MESICNE 168
A POPLATEK TYDNE 27
A POPLATEK PO OBRATU 8
B POPLATEK MESICNE 22
B POPLATEK TYDNE 6
B POPLATEK PO OBRATU 3
C POPLATEK MESICNE 332
C POPLATEK TYDNE 53
C POPLATEK PO OBRATU 18
D POPLATEK MESICNE 37
Truncated to displaylimit of 10.

Save the CTE as a pandas DataFrame():

result = %sql SELECT * FROM bar_plot_question
df = pd.DataFrame(result)
Generating CTE with stored snippets: 'bar_plot_question'

Finally, use seaborn.barplot(), this time with the hue argument, to produce a grouped bar plot:

plt.figure(figsize=(15, 5), dpi=300)  # Initialize blank canvas
sns.barplot(data=df, x="status", y="count_loan_id", hue="frequency")
plt.xlabel("Status of Paying off Loan")
plt.ylabel("Count")
plt.title("Count of Loan ID's by Loan Status and Freq. of Statement Issuance")
plt.show()
../_images/e92ab045e30d1e2e3f592832a63cd48cada40aadf71d5b0e2fa3882851cbc460.png

Scatter plots#

Scatter plots help us analyze relationships between two numeric variables. In the Matplotlib inheritance section above, we saw examples of the scatterplot function, which create axes-level objects, to analyze the effect of ratio_of_urban_inhabitants on average_salary by region. Below, we will introduce a figure-level function relplot, along with some customizations, to create faceted scatter plots that help us easily visualize data from multiple tables and columns.

Example#

Let us travel back some decades for the purpose of this example! Suppose the local municipality wants to visually assess, using the bank data, their hypothesis that an increasing unemployment rate leads to clients opting for good-standing (status either “A” or “C”), short-term (<= 1 year) loans of lower amounts in the districts of south Moravia and north Moravia in the year 1996 by duration and status.

Instead of not only creating multiple sub-scatterplots manually but also filtering the joined dataset multiple times, we can do this in one line using relplot! The encodings (x and y), semantics (hue, etc.), and facet positions (row and col) are the only arguments we have to worry about when creating the faceted figure. However, before doing so, we need to get the data in order by performing LEFT JOIN on all the three tables and filtering by region, duration, and status. We shall save this table in a CTE named

%%sql --save relplot_example
SELECT status, duration, amount, region, unemployment_rate_96
FROM b1.account AS a
LEFT JOIN b1.loan AS l
ON a.account_id = l.account_id
LEFT JOIN b1.district AS d
ON d.district_id = a.district_id
WHERE region IN ('north Moravia', 'south Moravia') AND
duration <= 24 AND status IN ('A', 'C');
status duration amount region unemployment_rate_96
A 12 23052 south Moravia 1.96
A 12 41904 north Moravia 4.72
A 12 65184 south Moravia 1.96
A 12 83016 north Moravia 7.75
A 24 51696 north Moravia 5.44
A 24 55368 south Moravia 2.31
A 12 47016 south Moravia 2.31
A 24 121896 south Moravia 2.43
A 12 82896 south Moravia 4.48
A 24 104712 north Moravia 5.44
Truncated to displaylimit of 10.

Like before, save the CTE as a pandas DataFrame():

result = %sql SELECT * FROM relplot_example
df = pd.DataFrame(result)
Generating CTE with stored snippets: 'relplot_example'

In this sns.relplot, we we assign the col argument to the variable region and the row argument to the variable status that creates a faceted figure with multiple subplots arranged across both rows and columns of the grid:

sns.relplot(
    data=df,
    x="unemployment_rate_96",
    y="amount",
    hue="duration",
    col="region",
    row="status",
    height=3,
    aspect=1.5,
)
plt.show()
../_images/cd960a2b7233c4d9549423bed6c0cf670ece90f08186c2bdd34b63c8dcabc32d.png

This visualization should definitely help the local municipality obtain a first glance of their hypothesis! Upon eyeballing it, we do not see any apparent correlation of unemployment rate with the loan amount, but we can see that higher duration loans lead to higher amounts. Therefore, using faceted plots, we can accelerate our EDA process and focus on important relationships in the data.

However, there are still some problems with this plot. The axes labels and legend title are not descriptive enough and the plot lacks a title. We can customize FacetGrid figures by using Matplotlib figure-level functions that affect all facets to reduce duplication of labels. See the functions below and consult the FacetGrid documentation and Matplotlib documentation to know more:

g = sns.relplot(
    data=df,
    x="unemployment_rate_96",
    y="amount",
    hue="duration",
    col="region",
    row="status",
    facet_kws={"margin_titles": True},
    height=4,
    aspect=1,
)

g.set(xlabel=None, ylabel=None)  # remove duplicate x and y axis labels
g.set_titles(  # facet titles
    row_template="Status: {row_name}", col_template="{col_name}"
)

g._legend.set_title("Duration (months)")
g.legend.set_bbox_to_anchor((1.25, 0.5))  # Shift legend to the right

g.fig.suptitle(  # main title of the figure
    "Unemployment Rate vs Loan Amount by Loan", x=0.235, fontsize=12
)
g.fig.text(  # subtitle of the figure
    0.235,
    0.95,
    "Data subsetted on Region and Status",
    ha="right",
    va="top",
    fontsize=10,
)

g.fig.supylabel("Loan Amount ($)")  # y-axis label for all facets
g.fig.supxlabel("Unemployment Rate (%)")  # x-axis label for all facets
g.fig.subplots_adjust(top=0.9)  # adjust the Figure position

plt.show(g)
../_images/87e1933a68f52da67471ef90f0bc0f42ad54504048857551da455126ab4b86a7.png

If we wanted to access individual facets of the plot, we could use axes-level methods. For example, g.axes[0,0].set_xlabel('axes label 1') will set the x-axis label of the first quadrant facet and g.axes[0,1].set_xlabel('axes label 2') will set the x-axis label of the facet row-adjacent to the first quadrant facet and so on.

Consult the previously linked docs and the documentation of relplot to answer the question below!

Question 2 (Hard)#

Suppose that the local municipality now comes back to ask for a similar plot, but with all loan durations included. Their feedback on the previous graph was that they would only like to see two facets or subplots at max for readability. Your job is to modify the CTE from the above example to produce a relplot figure that incorporates the same encodings but additional visual semantics, including style for region, size for duration, and subplots by status. Make sure to use a blue-red color palette for the subplots and customize axes labels for clarity.

Hint Try to find an example that does exactly what the question asks in the documentation

Answers

We first modify the relplot_example CTE by removing the filter for duration:

%%sql --save relplot_question
SELECT status, duration, amount, region, unemployment_rate_96
FROM b1.account AS a
LEFT JOIN b1.loan AS l
ON a.account_id = l.account_id
LEFT JOIN b1.district AS d
ON d.district_id = a.district_id
WHERE region IN ('north Moravia', 'south Moravia') AND
status IN ('A', 'C');
status duration amount region unemployment_rate_96
A 36 21924 south Moravia 2.43
A 12 23052 south Moravia 1.96
A 12 41904 north Moravia 4.72
A 12 65184 south Moravia 1.96
A 12 83016 north Moravia 7.75
A 24 51696 north Moravia 5.44
A 24 55368 south Moravia 2.31
C 60 89040 north Moravia 5.56
A 12 47016 south Moravia 2.31
A 24 121896 south Moravia 2.43
Truncated to displaylimit of 10.

Save the CTE as a pandas DataFrame():

result = %sql SELECT * FROM relplot_question
df = pd.DataFrame(result)
Generating CTE with stored snippets: 'relplot_question'

Finally, sns.relplot is called and stored in a variable for customizing the x-axis label. Note that the sizes argument specifies magnitude of the point size, which is used to control the visibility of the points:

g = sns.relplot(
    data=df,
    x="unemployment_rate_96",
    y="amount",
    size="duration",
    style="region",
    hue="status",
    col="status",
    palette=["b", "r"],
    sizes=(10, 100),
)
g.set(xlabel=None, ylabel="Loan Amount ($)")  # remove duplicate x axis label
g.fig.supxlabel("Unemployment Rate (%)")  # x-axis label for whole figure
plt.show(g)
../_images/f660dbe8959a2e1ce9af0905834b872a51b5d253741d5c558ea9129dcc9685eb.png

Density plots#

A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analogous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions. Relative to a histogram, KDE can produce a plot that is less cluttered and more interpretable, especially when drawing multiple distributions.

Important

The bandwidth, or standard deviation of the smoothing kernel, is an important parameter. Misspecification of the bandwidth can produce a distorted representation of the data.

Example#

Seaborn’s kdeplot axes-level function can help us easily visualize KDE’s of multiple numeric variables. Its figure-level equivalent is the displot function with which we can produce KDE plots by specifying kind="kde".

Suppose the finance manager wants a visual representation of two distributions, the loan amount by loan status and loan amount by loan duration. We can easily produce a kdeplot to not only draw multiple distributions in a single plot but also create axes subplots. Before this, we first produce a CTE with the two variables and save it as a pandas DataFrame():

%%sql --save kde_example
SELECT amount, status, payments, duration
FROM b1.loan
ORDER BY status;
amount status payments duration
165960 A 6915.0 24
110112 A 4588.0 24
69360 A 1445.0 48
12540 A 1045.0 12
274740 A 4579.0 60
87840 A 3660.0 24
52788 A 4399.0 12
107640 A 4485.0 24
154416 A 3217.0 48
117024 A 4876.0 24
Truncated to displaylimit of 10.
result = %sql SELECT * FROM kde_example
df = pd.DataFrame(result)
Generating CTE with stored snippets: 'kde_example'
plt.figure(figsize=(10, 3), dpi=300)  # Initialize blank canvas

plt.subplot(1, 2, 1)  # first quadrant
sns.kdeplot(data=df, x="amount", hue="status")
plt.title("KDE of Loan Amount ($) by Loan Status")  # Set title
plt.xlabel("Loan Amount ($)")  # Set x-axis label

plt.subplot(1, 2, 2)  # second quadrant
sns.kdeplot(data=df, x="amount", hue="duration")
plt.title("KDE of Loan Amount ($) by Loan Duration (months)")  # Set title
plt.xlabel("Loan Amount ($)")  # Set x-axis label
plt.show()
../_images/a92bd06f624bf21765e4a64b24db07a9cc6c913e85ccf4f0f69e120e78ca22cd.png

Question 3 (Easy)#

Similar to the way we customized our figure-level plots for the previous section, we can do the same for axes-level plots too! Your task is to remove the duplicate axes labels and rename the legend titles to provide a cleaner, publication-level visualization. For loan duration, provide the units in years rather than months.

Hint Consult Matplotlib’s axes class documentation to find the right functions!

Answers

We do not need to make a new CTE and can jump straight into programming with seaborn. Since we are using the same plot as the example, copy pasting the code and building on top of it is a nice idea. Instead of using multiple plt.subplot() functions, we initialize the whole figure with fig and the individual axes, in this case only two (ax1 and ax2), with plt.subplots(1, 2, ...). The first and second plots are customized with their respective axes objects and the functions from matplotlib.axes class:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True)

sns.kdeplot(data=df, x="amount", hue="status", ax=ax1)
ax1.set_title("KDE of Loan Amount ($) by Loan Status")
ax1.set_xlabel("")  # Remove x-axis label
ax1.legend(["A", "B", "C", "D"], title="Loan Status")

sns.kdeplot(data=df, x="amount", hue="duration", ax=ax2)
ax2.set_title("KDE of Loan Amount ($) by Loan Duration (years)")
ax2.set_xlabel("")  # Remove x-axis label
ax2.legend(["1", "2", "3", "4", "5"], title="Loan Duration (years)")

fig.supxlabel("Loan Amount ($)")  # x-axis label for whole figure

plt.show()
../_images/e8106a4d06dda07d4f82f121a53376b8dc4e6d9868441d6d5b7cc97eb3fbd619.png

The plot above is cleaner, with less overplotting, and has the correct units across all labels. It is worth taking the extra time to produce good quality visualizations, especially for assignment or paper/conference submissions.

Box and whisker plot#

A box and whisker plot (box plot for short) displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box plot, we draw a box from the first quartile (25th percentile) to the third quartile (75th percentile). A vertical line goes through the box at the median, which is also the 50th percentile.

In seaborn, boxplot is an Axes-level function and has the same object-oriented functionality as the kdeplot. There are several visual variations of boxplots in seaborn, such as the violinplot, swarmplot or stripplot, and boxenplot. All of these functions are also at the axes-level.

Example#

Suppose the finance manager wants boxplots of the moving-average of loan amount, rounded to 0 decimals, for every three dates preceding and every two dates following the current date of a record. These amounts should be displayed with the the loan’s duration. If we recall, we saw this example in the Advanced Aggregations section.

Let us create the CTE and turn it into a pandas Dataframe() first:

%%sql --save boxplot_example
SELECT date, duration, ROUND(avg_amount, 0) AS avg_amount
FROM (SELECT date, duration, AVG(amount) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING) AS avg_amount FROM b1.loan)
ORDER BY date;
date duration avg_amount
930705 12 129812.0
930711 36 123810.0
930728 60 153996.0
930803 36 142970.0
930906 60 135702.0
930913 24 137166.0
930915 12 141722.0
930924 24 143592.0
931013 48 101456.0
931104 24 100084.0
Truncated to displaylimit of 10.
result = %sql SELECT * FROM boxplot_example
df = pd.DataFrame(result)
Generating CTE with stored snippets: 'boxplot_example'
plt.figure(figsize=(15, 5), dpi=300)  # Initialize blank canvas
sns.boxplot(data=df, x="duration", y="avg_amount")
plt.ylabel("Moving-Average Loan Amount ($)")
plt.xlabel("Loan Duration (months)")
plt.show()
../_images/73726dd1d7e4ee80be07c539c8992f68a80e138442fd7e317b659f37f0903c82.png

Question 4 (Medium)#

The manager comes back and asks for another grouping layer, loan status, on top of the boxplot in the example. The idea should remain the same, but this time calculate the moving-average of loan amount for every five dates following the current date of a record. Also, output the loan duration in years rather than months and rename the legend title to “Loan Status”.

Hint Recall the clause used to group the data when using windowing queries. Also, use matplotlib.pyplot functions to quickly and easily customize the plot.

Answers

The additional clause in the CTE is PARTITION BY, which adds the additional grouping by status. The windowing frame is also changed to incorporate five rows ahead of the current row:

%%sql --save boxplot_question
SELECT date, duration, status, ROUND(avg_amount, 0) AS avg_amount
FROM (SELECT date, duration, status, AVG(amount) OVER (PARTITION BY status ORDER BY date ROWS BETWEEN CURRENT ROW AND 5 FOLLOWING) AS avg_amount FROM b1.loan)
ORDER BY date;
date duration status avg_amount
930705 12 B 179332.0
930711 36 A 135702.0
930728 60 A 133778.0
930803 36 A 132102.0
930906 60 A 118122.0
930913 24 A 85600.0
930915 12 A 86860.0
930924 24 B 176084.0
931013 48 A 93710.0
931104 24 A 75810.0
Truncated to displaylimit of 10.
result = %sql SELECT * FROM boxplot_question
df = pd.DataFrame(result)
Generating CTE with stored snippets: 'boxplot_question'

Unlike the previous section in which we employed the matplotlib.axes functions to customize the plot, we use the simpler matplotlib.pyplot functions because we have not faceted the boxplot:

plt.figure(figsize=(15, 5), dpi=300)  # Initialize blank canvas
sns.boxplot(data=df, x="duration", y="avg_amount", hue="status")
plt.ylabel("Moving-Average Loan Amount ($)")
plt.xlabel("Loan Duration (years)")
plt.xticks([0, 1, 2, 3, 4], ["1", "2", "3", "4", "5"])
plt.legend(title="Loan Status")
plt.show()
../_images/490902106b554f9be5055905491c61ffbce3465cede110e290f9a6d8d9b6df8b.png

Wrapping Up#

In this section, we learned about plotting four types of visualizations with seaborn. To summarize:

  • Axes-level functions plot onto a single subplot that may or may not exist at the time the function is called

  • Figure-level functions internally create a matplotlib figure, potentially including multiple subplots

  • seaborn.barplot, an axes-level function, should be used for visualizing count data

  • seaborn.scatterplot, an axes-level function, helps visualize correlations between two numeric variables, subsetted on categorical variables if needed. seaborn.relplot is a figure-level function that that combines scatterplotwith a FacetGrid and can expedite the EDA process when combining multiple types of columns into a single visualization

  • seaborn.kdeplot, an axes-level function, creates a Kernel Density Estimate plot, analogous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions. The function can also account for categorical levels. Its figure-level alternative is seaborn.distplot

  • seaborn.boxplot is useful for visualizing the summary distribution of numeric variables, grouped by none, one, or multiple catgeorical variables. Several variations of the boxplot are provided by seaborn

In the next section, you will learn how to plot similar visualizations using the plotly python library.

References#

API reference - seaborn 0.12.2 documentation. (n.d.). https://seaborn.pydata.org/api.html

API Reference - Matplotlib 3.7.1 documentation. (n.d.). https://matplotlib.org/stable/api/index