Intro to Voilà

Intro to Voilà#

In this section, we will learn how to use the Voilà Python library to create a dashboard from our SQL queries and visualizations. Additionally, we will explore how the ETL (Extract, Transform, Load) - EDA (Exploratory Data Analysis) pipeline, introduced in the previous module, are integrated with the dashboard, examine the dashboard’s structure and deployment, and discuss interesting insights gathered from it.

What is Voilà?#

Voilà is a Python library that allows users to effortlessly create standalone web applications from Jupyter notebooks. Voilà takes the output of your notebook, while hiding code cells by default, and renders it in a web browser, so that you can share your work or use it in a production setting. With the help of ipywidgets, we can transform the rendered web application into an interactive dashboard and this module does covers it in detail!

Moreover, Voilà offers several ways to customize your dashboard, including changing themes, creating templates, and controlling cell output. This allows you to create a visually aesthetic dashboard from a Jupyter notebook with minimal additional code! Markdown cells are also displayed in the dashboard, allowing you to add text and images to your dashboard.

Install Voilà with the following command:

!pip install voila

Important

This section assumes comfort working with ipywidgets. To get an introduction to working with ipywidgets, please review the introductory section to ipywidgets, the section on query parameterization as well as the section on interactive queries with JupySQL

Questions to Answer in the Dashboard#

The interactive dashboard contains 4 tables and 5 plots, created using JupySQL’s ggplot API, seaborn, and ipywidgets. It answers the following questions:

How do yearly manufacturing trends of fuel-only, electric, and hybrid cars compare?
How are fuel consumption and \(CO2\) emissions distributed for all types of cars?
What is the relationship between charging time and travel range for electric vehicles by car size and model year?
How are \(CO_2\) emissions distributed by vehicle type (fuel-only, electric, and hybrid) and fuel type (gasoline, diesel, ethanol, natural gas, and electricty)?
Which US fuel-only and hybrid car manufacturers emit the least \(CO_2\) and how does this differ by transmission type?

Directory Structure#

For this blog, we will assume the following directory structure:

├── environment.yml
├── pipeline
│   ├── pipeline.yaml
│   ├── data
│   │   ├── database
│   │   │   ├── car_data.duckdb
│   ├── products
│   ├── src
│   │   ├── menu.py
│   │   ├── dashboard.py
│   │   ├── voila-app.ipynb
│   │   ├── datadownload.py
└── README.md

The voila-app.ipynb file in the sql/pipeline/src directory serves as our dashboard notebook. Additionally, all necessary files and modules, including the menu.py and dashboard.py files, are in the same directory as the notebook.

Dependencies#

The environment.yml file in the root directory of this course contains the dependencies for the project. You can create a conda environment using that file and then activate the environment to run voila-app.ipynb, as shown below:

conda env create -f environment.yml

conda activate sql-course

Running the Pipeline#

If you have not already done so, you need to build, package and run the pipeline by following the steps outlined in the previous module.

After running the pipeline, you will have the car_data.duckdb database, under pipeline/data/database, ready to be queried by the dashboard.

To execute the pipeline, we run the following command in the terminal from:

cd pipeline && ploomber build

This yields:

Loading pipeline...
Executing: 100%|███████████████████████████████████████████████████████████████████████████████| 18/18 [00:10<00:00,  1.69cell/s]
Executing:  29%|██████████████████████▊                                                         | 8/28 [00:10<00:25,  1.28s/cell]
Building task 'eda-pipeline': 100%|████████████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.48s/it]
name          Ran?      Elapsed (s)    Percentage
------------  ------  -------------  ------------
datadownload  True          10.7184        51.168
eda-pipeline  True          10.2291        48.832

ETL and Voilà#

The Ploomber pipeline will store the tables into a database file under /pipeline/data/database called car_data.duckdb. We can connect our dashboard to the DuckDB instance and generate queries for our visualizations.

Note: The ../ prefix is not required if the database is in the same directory as the notebook.

The pipeline process entailed above can be better understood with the following diagram:

ETL Pipeline

Dashboard Structure#

You can find the Jupyter notebook with the Voilà app here. The dashboard also uses custom helper scripts:

See menu.py
See dashboard.py

Introduction and Tables#

The dashboard, firstly, needs to have a relevant title and description for the user to understand what the dashboard is about. The date the fuel emissions data was last updated is also displayed because the data is updated monthly and, accordingly, the tables and visualizations may have novel insights since the last update. Next, we display the interactive table, integrated with ipywidgets and outputted from our ETL Pipeline, to allow the user interact with numerical and categorical columns in each table (fuel-only, hybrid, and electric). The user can, hence, begin the EDA process by filtering columns to find interesting patterns and relationships.

Specifically, SelectMultiple, Dropdown, and Combobox widgets are employed to filter categorical columns, including the car’s fuel type, size, model, and model year. Note that Combobox, which is a String widget, was not introduced earlier in the course and we recommend taking a look at its documentation. The \(CO_2\) ratings column (a higher rating suggesting lower \(CO_2\) emissions) can also be filtered with the IntSlider widget. The aforementioned widgets and table are shown below:

The code below is used to create the widgets and table. Press ‘Show code source’ to view the code.

The %sql magic command allows us to connect to the database and query the data. The duckdb:/// prefix specifies the database type and location. The pipeline/data/database/car_data.duckdb path specifies the location of the database relative to the notebook. The %sql magic command is used in the following code cells to create CTE’s for generating visualizations.

The code below will help us initialize a DuckDB instance that the Voilà app can use to fetch data.

%load_ext sql

%sql duckdb:///pipeline/data/database/car_data.duckdb

%config SqlMagic.displaycon = True

Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'

Settings changed:

Config	value
displaycon	False
feedback	True
autopandas	False
named_parameters	True

The code below will help us extract unique values for each column in the all_vehicles table. These values will be used to populate the widgets. Press ‘Show code source’ to view the code.

The code below will help us initialize the widgets and the tab. Press ‘Show code source’ to view the code.

Important

During this process, it is essential to become acquainted with the various types of data present in each table, such as numerical, categorical, boolean, and others, before proceeding with data visualization. This familiarity aids the user in developing an intuition about the most suitable visualizations for emphasizing columns individually or exploring relationships between them.

Questions Answered through Visualizations#

After the user interacts with the tables, questions that the dashboard answers with the help of visualizations are then listed (see above). It is recommended to begin with simpler, more high-level questions about the data, such as addressing counts of different vehicle types or distributions of emissions, before diving into relationships between columns. Doing so allows you to make informed decisions about the types of visualizations to use for answering complex questions.

Interactive Visualizations#

`Seaborn` Bar Plots#

The first visualization is a barplot(), built with seaborn and ipywidgets, of the number of unique fuel-only, electric, and hybrid car models, visualized across the years they were released. Interactivity is enabled through RadioButtons, which helps the user toggle between a bar plot of either fuel-only cars or a column chart of hybrid and electric cars. The decision to provide this separation is because fuel-only cars’ models date back to 1995 in the dataset, while hybrid and electric cars’ models date back to 2012.

Notice that two CTE’s, one that uses the fuel table for the fuel-only bar plot and another that uses the all_vehicles table for the electric and hybrid bar plot, were created and converted into pandas DataFrames to pass into our barplot() function that initializes the plots. The CTE’s generated are shown below:

%%sql --save q_1_hybrid_electric
SELECT DISTINCT model_year, vehicle_type, COUNT(id) AS num_vehicles
FROM all_vehicles
WHERE vehicle_type = 'hybrid' OR vehicle_type = 'electric'
GROUP BY model_year, vehicle_type
ORDER BY num_vehicles DESC;

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

model_year	vehicle_type	num_vehicles
2023	electric	134
2022	electric	82
2021	electric	49
2021	hybrid	40
2022	hybrid	40
2020	electric	40
2019	electric	35
2023	hybrid	32
2020	hybrid	31
2016	electric	27

Truncated to displaylimit of 10.

%%sql --save q_1_fuel
SELECT DISTINCT model_year, vehicle_type, COUNT(id) AS num_vehicles
FROM fuel
GROUP BY model_year, vehicle_type
ORDER BY model_year;

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

model_year	vehicle_type	num_vehicles
1995	fuel-only	839
1996	fuel-only	698
1997	fuel-only	658
1998	fuel-only	634
1999	fuel-only	688
2000	fuel-only	638
2001	fuel-only	679
2002	fuel-only	740
2003	fuel-only	820
2004	fuel-only	898

Truncated to displaylimit of 10.

We will use the resulting CTE’s to create pandas DataFrames that will be passed into our barplot() function. The code below shows how we can convert the CTE’s into pandas DataFrames.

hybrid_electric_count = %sql SELECT * FROM q_1_hybrid_electric
fuel_count = %sql SELECT * FROM q_1_fuel

hybrid_electric_count = pd.DataFrame(hybrid_electric_count)
hybrid_electric_count = hybrid_electric_count.sort_values(by=["model_year"])
fuel_count = pd.DataFrame(fuel_count)
fuel_count = fuel_count.sort_values(by=["model_year"])

Generating CTE with stored snippets: 'q_1_hybrid_electric'

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

Generating CTE with stored snippets: 'q_1_fuel'

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

We can then use the resulting dataframes after performing queries as follows.

barplot = Seaborn_Barplot(fuel_count, hybrid_electric_count)
interact(barplot.draw_bar_year_count, data=barplot.radio_button);

Insights#

The bar plot shows the number of unique car brand models for fuel-only cars in the Canadian market. It reveals an increasing trend until 2005, followed by a plateau until 2022. Notably, there was a significant spike in 2015. The introduction of regulations requiring higher percentages of zero-emissions vehicles seems to have influenced the market\(^1\), as 2023 experienced a sharp decline, reaching levels similar to those in 2003.

Note: To visualize the bar plot of the number of unique hybrid and electric car brands models, we recommend deploying the dashboard and interacting with the RadioButtons.

In 2012, only two electric car models, Nissan’s Leaf and Mitsubishi’s i-MiEV, and one hybrid car manufacturer, Chevrolet’s Volt, were present in the market. Since then, this figure has grown to 134 electric car models and 32 hybrid car models in 2023 in Canada.

`ggplot` API Boxplot#

To keep a user engaged and interested in your dashboard, it is important to use different libraries too along with visualizations and widgets. This dashboard also makes use of JupySQLs ggplot API for the second visualization, a boxplot of fuel consumption and \(CO_2\) emissions of all types of vehicles. Recall this example from the Visualizing Your SQL Queries module.

The \(CO_2\) emissions variable, co2emissions_g_km, is measured in a different scale, grams per kilometer, than the fuel consumption variables scale, liters per 100 kilometers. Therefore, it is recommended to visualize co2emissions_g_km individually to avoid confusion.

The process of building the boxplot is slightly different to the bar plot. A CTE is created from the all_vehicles table, but not converted into a pandas DataFrame because the ggplot function takes in only a SQL table name, instead of a dataset. The CTE is shown below:

%%sql --save boxplot_fuel_consum
SELECT fuelconsumption_city_l_100km::FLOAT as fuelconsumption_city_l_100km,
fuelconsumption_hwy_l_100km::FLOAT as fuelconsumption_hwy_l_100km,
fuelconsumption_comb_l_100km::FLOAT as fuelconsumption_comb_l_100km,
co2emissions_g_km::FLOAT as co2emissions_g_km
FROM all_vehicles

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

fuelconsumption_city_l_100km	fuelconsumption_hwy_l_100km	fuelconsumption_comb_l_100km	co2emissions_g_km
7.900000095367432	6.300000190734863	7.199999809265137	167.0
8.100000381469727	6.5	7.400000095367432	172.0
8.899999618530273	6.5	7.800000190734863	181.0
12.600000381469727	9.399999618530273	11.199999809265137	263.0
13.800000190734863	11.199999809265137	12.399999618530273	291.0
11.0	8.600000381469727	9.899999618530273	232.0
11.300000190734863	9.100000381469727	10.300000190734863	242.0
11.199999809265137	8.0	9.800000190734863	230.0
11.300000190734863	8.100000381469727	9.800000190734863	231.0
12.300000190734863	9.399999618530273	11.0	256.0

Truncated to displaylimit of 10.

Then, the SelectMultiple widget is stored in a variable columns, to be passed in to the ggplot function as x=columns to select between the four filtered columns. Finally, the interact function is called to visualize the boxplot with the SelectMultiple widget. A preview of the boxplot and its widgets are displayed below:

boxplot = Boxplot_ggplot()
interact(boxplot.fuel_co2_boxplot, columns=boxplot.selection_button);

Insights#

The boxplots above illustrate fuel consumption and \(CO2\) emissions for various types of cars. The median fuel consumption in the city, on the highway, and combined (average) for all cars is approximately 12, 10, and 11 litres per 100 kilometers, respectively. There is a positive correlation between fuel consumption and \(CO2\) emissions, with the median \(CO2\) emission for all cars being around 250 grams per kilometer. Electric cars have zero \(CO2\) emissions, while fuel-only luxury sports cars exhibit the highest \(CO2\) emissions.

`Seaborn` Scatterplot#

After answering some simple questions about the data, we can dive into unravelling complex relationships. The third visualization, therefore, is a seaborn scatterplot() to visualize the positive correlation between the recharge time and distance range of electric vehicles with respect to their size and the year of release.

The bar plot above indicates a rising number of unique electric car models from 2021 to 2023. To better analyze recent technological advancements in car battery and range, we grouped the data into two periods: 2012-2021 and 2022-2023. To avoid overplotting, electric car size is grouped into sedans (or smaller) and SUV’s (or larger). Also, since we are visualizing more than one hue, a way to make this plot interactive is by adding a Dropdown widget to toggle between the size and model year hues.

A single CTE is created from the electric table and converted into a pandas DataFrame, which is passed into the scatterplot() function. The CTE is shown below:

%%sql --save electric_range_charge
SELECT range1_km, recharge_time_h, vehicleclass_, model_year
FROM electric

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

range1_km	recharge_time_h	vehicleclass_	model_year
100	7	subcompact	2012
117	7	mid-size	2012
122	4	compact	2013
100	7	subcompact	2013
117	7	mid-size	2013
109	8	two-seater	2013
109	8	two-seater	2013
224	6	full-size	2013
335	10	full-size	2013
426	12	full-size	2013

Truncated to displaylimit of 10.

Like before, the interact function is called to visualize the scatterplot with the Dropdown widget, as shown below:

electric_range = %sql SELECT * FROM electric_range_charge

electric_range = pd.DataFrame(electric_range)
clean_electric_range = clean_electric_range(electric_range)

Generating CTE with stored snippets: 'electric_range_charge'

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

scatter = Seaborn_Scatter(clean_electric_range)
interact(scatter.draw_scatter_electric_range, hue=scatter.dropdown);

Insights#

The scatterplot provides insights into electric cars’ ranges and charging times based on their size and model year. Recent electric cars (manufactured from 2022 onwards) generally have a higher average range compared to those made between 2012 and 2021, likely due to advancements in battery technology and increased demand\(^2\). Interestingly, some newer electric cars with a 10-hour recharge time offer better ranges than older cars with a 12-hour recharge time.

Note: To visualize this scatterplot by vehicle size, we recommend deploying the dashboard and interacting with the Dropdown.

Regarding vehicle size, there are more electric sedans (and smaller cars) than SUVs (and larger vehicles) with recharge times between 4 to 7 hours, as expected. Sedans also tend to offer greater ranges than SUVs for recharge times over 7 hours. However, for recharge times less than 7 hours, SUVs provide greater ranges, likely due to their larger batteries. Notably, some sedans with a 10-hour recharge time offer better ranges than all SUVs with recharge times exceeding 10 hours.

Therefore, consumers have diverse options when it comes to electric cars, and making an informed choice involves considering the tradeoff between recharge time and range. This scatterplot aids in understanding and assessing these factors for different electric car models.

`ggplot` API Histogram#

The boxplot of \(CO_2\) emissions of all types of vehicles is a good starting point for exploring the distribution of \(CO_2\) emissions. Still, further insights can be drawn by visualizing the distribution of \(CO_2\) emissions by vehicle type (fuel-only, electric, and hybrid) and fuel type. Therefore, the fourth visualization is a ggplot histogram of \(CO_2\) emissions of fuel-only cars, electric cars, and hybrid cars. This histogram is similar to the example from the Visualizing Your SQL Queries module.

Three widgets, two Selection and an IntSlider, are employed for this visualization. The RadioButtons widget is used for mapping fill with either vehicle type or fuel type. The Dropdown widget allows the user to choose any of the five cmap options, which changes the color of the bars. The IntSlider allows selecting the number of bins for the histogram.

The CTE, created from the all_vehicles table, is shown below:

%%sql --save hist_co2
SELECT vehicle_type, mapped_fuel_type,
co2emissions_g_km::INTEGER as co2emissions_g_km
FROM all_vehicles
WHERE co2emissions_g_km is not null

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

vehicle_type	mapped_fuel_type	co2emissions_g_km
fuel-only	premium gasoline	167
fuel-only	premium gasoline	172
fuel-only	premium gasoline	181
fuel-only	premium gasoline	263
fuel-only	premium gasoline	291
fuel-only	premium gasoline	232
fuel-only	premium gasoline	242
fuel-only	premium gasoline	230
fuel-only	premium gasoline	231
fuel-only	premium gasoline	256

Truncated to displaylimit of 10.

A preview of the histogram and its widgets are displayed below:

histogram = Histogram_ggplot()
interact(
    histogram.co2_histogram,
    b=histogram.intslider,
    cmap=histogram.dropdown,
    fill=histogram.radio_button,
);

Insights#

The histogram illustrates the distribution of \(CO_2\) emissions (measured in grams per kilometer) for different vehicle or fuel types. Fuel-only cars emit the most \(CO_2\), approximately 6 times more than hybrid cars, which combine an electric motor and a gasoline engine. Hybrid cars emit between 10 to 80 grams per kilometer, while fuel-only cars emit between 100 to 500 grams per kilometer, with the majority emitting between 200 to 300 grams per kilometer. In contrast, electric cars have zero \(CO_2\) emissions and are fittingly known as zero-emission vehicles.

Note: To visualize this histogram by fuel type, we recommend deploying the dashboard and interacting with the RadioButtons.

Additionally, analyzing the histogram with the mapped_fuel_type attribute shows that most vehicles in Canada run on regular gasoline, with premium gasoline emitting greater than 450 grams per kilometer in some cars. Diesel and Ethanol (E85) vehicles emit slightly less \(CO_2\) than gasoline, with emissions ranging from 150 to 400 grams per kilometer and the majority emitting between 200 to 300 grams per kilometer.

`Seaborn` Boxplot#

The dataset contains five clean fuel (electric and hybrid) US car manufacturers. To assesses the cleanest, a grouped boxplot is visualized of \(CO_2\) emissions by vehicle type and transmission type. Similar to the scatterplot() above, the hue attribute is integrated with a Dropdown widget to toggle between vehicle type, transmission type, or no hue.

The CTE is shown below:

%%sql --save co2_usa
SELECT vehicle_type, make_, co2emissions_g_km, transmission_type
FROM all_vehicles
WHERE co2emissions_g_km is not null AND
vehicle_type IN ('fuel-only', 'hybrid') AND
make_ IN ('cadillac', 'chevrolet', 'chrysler', 'ford', 'jeep', 'lincoln')

Running query in 'duckdb:///pipeline/data/database/car_data.duckdb'

vehicle_type	make_	co2emissions_g_km	transmission_type
fuel-only	cadillac	206	automatic with select Shift
fuel-only	cadillac	221	automatic with select Shift
fuel-only	cadillac	216	automatic with select Shift
fuel-only	cadillac	231	automatic with select Shift
fuel-only	cadillac	239	automatic with select Shift
fuel-only	cadillac	244	automatic with select Shift
fuel-only	cadillac	297	automatic with select Shift
fuel-only	cadillac	303	manual
fuel-only	cadillac	207	automatic with select Shift
fuel-only	cadillac	252	automatic with select Shift

Truncated to displaylimit of 10.

A preview of the boxplot and its widgets are displayed below:

boxplot = Seaborn_Boxplot(co2_usa)
interact(boxplot.draw_boxplot_usa, hue=boxplot.dropdown);

Insights#

The boxplots depict the distribution of \(CO_2\) emissions for hybrid and fuel-only cars manufactured in the US. Chrysler shows the lowest median \(CO_2\) emission (around 250 grams per kilometer) among all car brands, indicating consistent emissions. Chevrolet, however, has the highest median \(CO_2\) emission (around 300 grams per kilometer) among US car brands.

When considering the vehicle type, Chevrolet’s hybrid cars have the lowest median \(CO_2\) emission among all hybrid car brands. In contrast, Jeep’s hybrid cars have the highest median \(CO_2\) emission among US hybrid car brands. However, among fuel-only cars, Chrysler and Chevrolet have comparable median \(CO_2\) emissions, with Chrysler being the cleanest.

Note: To visualize this boxplot by transmission type, we recommend deploying the dashboard and interacting with the Dropdown.

Moreover, continuously variable transmission (CVT) cars generally have the lowest \(CO_2\) emissions among various transmission types. Hybrid cars from US brands, which are the cleanest among hybrids, often utilize CVT. With the exception of Chrysler, all brands have lower median \(CO_2\) emissions for manual transmission cars compared to automatic transmission cars. This aligns with the EPA’s findings \(^3\) that manual transmissions were more efficient than automatic transmissions until around 2010, but modern automatic transmissions have since improved in efficiency. Ford is the sole brand offering an automated manual transmission, which exhibits a wide distribution of \(CO_2\) emissions similar to Cadillac’s CVT cars, but its median \(CO_2\) emission is lower than that of its automatic transmission cars.

Launching the Dashboard Locally#

Whether your Jupyter Notebook is incomplete or complete, Voilà will still be able to render it locally on the web! This will help you in prototyping your dashboard and fit it exactly to your needs. Before launching, make sure you have satisfied the following requirements:

Launching the Dashboard#

To launch the dashboard, run the following command in your terminal, given that you are in the same directory as the voila-app.ipynb file:

voila voila-app.ipynb

With that, you will see your Jupyter Notebook come to life as a dashboard!

References#

\({^1}\) Canada, Service. “Government of Canada.” Service Canada, n.d. https://www.canada.ca/.

\({^2}\) Crownhart, Casey. “What’s next for Batteries.” MIT Technology Review, 5 Jan. 2023, www.technologyreview.com/2023/01/04/1066141/whats-next-for-batteries/.

\({^3}\) The 2020 EPA Automotive Trends Report: Greenhouse gas emissions, fuel …, n.d. https://www.epa.gov/sites/default/files/2021-01/documents/420r21003.pdf.

“Widget List#.” Widget List - Jupyter Widgets 8.0.5 documentation, n.d. https://ipywidgets.readthedocs.io/en/stable/examples/Widget List.html.

API reference - seaborn 0.12.2 documentation. (n.d.). https://seaborn.pydata.org/api.html

API Reference - Matplotlib 3.7.1 documentation. (n.d.). https://matplotlib.org/stable/api/index

Intro to Voilà

Contents

Intro to Voilà#

What is Voilà?#

Questions to Answer in the Dashboard#

Directory Structure#

Dependencies#

Running the Pipeline#

ETL and Voilà#

Dashboard Structure#

Introduction and Tables#

Questions Answered through Visualizations#

Interactive Visualizations#

Seaborn Bar Plots#

Insights#

ggplot API Boxplot#

Insights#

Seaborn Scatterplot#

Insights#

ggplot API Histogram#

Insights#

Seaborn Boxplot#

Insights#

Launching the Dashboard Locally#

Launching the Dashboard#

References#

`Seaborn` Bar Plots#

`ggplot` API Boxplot#

`Seaborn` Scatterplot#

`ggplot` API Histogram#

`Seaborn` Boxplot#