Introduction to the dataset and problem: CO2 emissions of vehicles

Introduction to the dataset and problem: CO2 emissions of vehicles#

In previous sections of this course, we learned how to perform exploratory data analysis on an existing database. This time, we will learn how to setup an automated process that will fetch data from an API that is updated regularly, clean it, and store it into a database for further analysis. In this section of the course, you will learn about Python scripting, how to structure a .py script, as well as how it differs from working with Jupyter notebooks. Furthermore, you will learn about the Extract, Transform, Load (ETL) process and you will learn how to combine Python scripting along with EDA with Jupyter notebooks to populate a database.

About the data#

To help consumers in Canada find fuel-efficient vehicles, the Government of Canada released a fuel consumption ratings search tool.

In it, they provide users the ability to search vehicles by model, class and make and obtain information on the fuel consumption of various vehicles in three settings: city, highway and combined. Vehicles undergo 2-cycle and 5-cycle fuel consumption in each of these settings, and a co2 emissions score is assigned to the vehicle (for more information see here and here).

Additionally, they provide access through their open data portal as part of the Open Government License Canada.

While this tool allows consumers to obtain information via the website, it is recommended to develop an automated approach to extract information on the most recent vehicle models and analyze the data to answer important questions:

What is the average CO2 emissions of vehicles by class, make and model?
Which vehicles are better suited for different driving scenarios (city, highway, and combined) with a focus on lower CO2 emissions?
Can we identify trends in car manufacturing based on the number of vehicles produced per model year or by vehicle class (size) for different fuel types (gas, hybrid, and electric)?
How can fuel efficiency be analyzed for different fuel types (gas, hybrid, and electric)?
What are benefits of using hybrid vehicles over non-hybrid vehicles?

In the next sections, we will explore how we can combine Python and SQL to answer these questions. We will learn how to automate the process of extracting the data, cleaning it and setting up an Extract Transform Load (ETL) pipeline to load the data into a database. We will then learn how to use SQL to answer the questions like the ones above.

The dataset#

Important

Source: Open Canada Portal

Title: Fuel consumption ratings

Link: https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64 Format: metadata (API), CSV files

The metadata contains information in English and French, and points at links with CSV files containing information on fuel-based, electric and hybrid vehicles by year, make, model, along with their results for city, highway and combined fuel consumption tests for 2-cycle and 5-cycle tests.

The problem#

As the data is refreshed on a regular basis, the goal is to develop a workflow that eases the process of extracting the data, cleaning it and loading it into a database for exploration and analysis.

Approach#

We will use the requests library to extract the metadata. We will then use pandas to process the content of each file, perform data cleaning, and load the data into a DuckDB database with duckdb.

For the analysis, we will then use SQL to answer the questions above and JupySQL’s functionality to perform exploratory data analysis, save interesting queries into the database, and later reuse these tables in the form of a dashboard.

In the next section, we will dive into the process of extracting the data and cleaning it through Python scripting. We will explore how to use requests to extract the metadata, and how to use pandas to process the content of each file, perform data cleaning, and load the data into a DuckDB database with the its Python API duckduck. Furthermore, we will explore how to write clean and reusable functions to ensure our process can be maintained.