Data wrangling and exploratory data analysis in Jupyter#
In the previous tutorial, we used a Python script to call a movie database API and populate a DuckDB instance with the extracted movie data. Furthermore, we packaged the script within a Ploomber pipeline. In this tutorial, we will expand our pipeline to incorporate exploratory data analysis and data wrangling with SQL using a Jupyter notebook. This step will involve extracting data, data wrangling, and creating new tables.
Important
This tutorial assumes you have completed the previous tutorial and have a DuckDB instance populated with movie data.
Ensure you have completed all steps up to this point and execute
cd mini-projects/movie-rec-system
poetry run ploomber build
Setup#
Within our folder structure, we will create a Jupyter notebook called eda.ipynb
in the etl
folder. From last time, our folder structure should look like this:
mini-projects
├──movie-rec-system
├──├── movies_data.duckdb
├──├── movies_data.duckdb.wal
├──├── pipeline.yaml
├──├── pyproject.toml
├──├── README.md
├──├── etl
├──│ └── extract.py
├──│ └── eda.ipynb
├──├── products
├── └── extract-pipeline.ipynb
├──└── tests
├──│ └── __init__.py
├──└── .env
Important
If running the notebook in VSCode, you will need to install the Jupyter extension to run the notebook.
To load the data from our DuckDB instance within Jupyter, we will use JupySQL. JupySQL allows you to run SQL and plot large datasets in Jupyter via a %sql, %%sql, and %sqlplot magics.
Loading %sql
extension and DuckDB
instance#
To set up access to the database, we will use the %sql
extension from jupysql
.
%reload_ext sql
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Important
Did you know you can configure the default behavior of the %sql
magic? The configuration above will transform SQL query results into pandas
format automatically. Learn more in the video below ⬇️⬇️⬇️ .
We can then form a connection string to our DuckDB instance. We will use the %%sql
magic to run SQL queries in our notebook. Notice that when creating and executing this notebook, we are using the same environment as our pipeline. This ensures that we are using the same dependencies and versions as our pipeline.
We need to give our notebook access to the DuckDB instance. To do this, we will use the duckdb:///
prefix. This prefix tells JupySQL to use the DuckDB engine to connect to the database. We will then specify the path to our DuckDB instance.
%sql duckdb:///movies_data.duckdb
⚡⚡ Challenge ⚡⚡#
Explore the tables in the movies_data.duckdb
instance. Use the %sql
magic to run SQL queries to explore the tables. How many tables are there? What are the columns in each table? What are the data types of each column? What are the primary keys? What are the foreign keys? What are the relationships between the tables?
Data Wrangling#
Handling Nested Data#
Let’s take a look at the data. We will start by looking at the movies
table. We can use the %%sql
magic to run SQL queries in our notebook. We will use the SELECT
statement to select all columns from the movies
table. We will then use the LIMIT
statement to limit the number of rows returned to 5.
%%sql
SELECT *
FROM movies
LIMIT 2;
genre_ids | id | original_language | overview | popularity | release_date | title | vote_average | vote_count |
---|---|---|---|---|---|---|---|---|
[28, 878, 27] | 615656 | en | An exploratory dive into the deepest depths of the ocean of a daring research team spirals into chaos when a malevolent mining operation threatens their mission and forces them into a high-stakes battle for survival. | 4546.63 | 2023-08-02 00:00:00 | Meg 2: The Trench | 6.9 | 894 |
[16, 35, 10751, 14, 10749] | 976573 | en | In a city where fire, water, land and air residents live together, a fiery young woman and a go-with-the-flow guy will discover something elemental: how much they have in common. | 2723.167 | 2023-06-14 00:00:00 | Elemental | 7.8 | 1526 |
We see that the genre_ids
column contains a lists of integers.
%%sql
SELECT *
FROM genres
LIMIT 2;
id | name |
---|---|
28 | Action |
12 | Adventure |
We see the genre ids in the movies
table correspond to the id
column in the genres
table. Let’s do some data wrangling:
%%sql
WITH ExpandedGenres AS (
SELECT
m.id AS movie_id,
mg.movie_genre_id,
g.name AS genre_name
FROM
(SELECT UNNEST(movies.genre_ids) as movie_genre_id, movies.id FROM movies) AS mg
JOIN
movies m ON mg.id = m.id
JOIN
genres g ON mg.movie_genre_id = g.id
)
SELECT
movie_id,
STRING_AGG(genre_name, ', ') AS genre_names
FROM
ExpandedGenres
GROUP BY
movie_id;
movie_id | genre_names |
---|---|
615656 | Action, Science Fiction, Horror |
976573 | Animation, Comedy, Family, Fantasy, Romance |
724209 | Thriller, Action |
569094 | Animation, Action, Adventure |
346698 | Comedy, Adventure, Fantasy |
667538 | Action, Adventure, Science Fiction |
298618 | Action, Adventure, Science Fiction |
758769 | Horror |
709631 | Horror |
1121575 | Animation, Science Fiction, Action, Adventure |
The query is transforming a structure where movies have a list of genre IDs into a more readable format where you get a single row for each movie and a comma-separated string of all its genres by name.
In Depth Explanation:
Common Table Expression (CTE) -
ExpandedGenres
:
This CTE is aiming to “expand” or “flatten” movies based on their genres. It looks like each movie in the movies table has an array (or similar list-type structure) of genre IDs in
genre_ids
.SELECT UNNEST(movies.genre_ids) as movie_genre_id, movies.id FROM movies
: This line is unnesting (or exploding) thegenre_ids
list for each movie. This means if a movie has multiple genre IDs in itsgenre_ids
, each genre ID will become a separate row, paired with the movie’s ID.The resulting table of
movie_id
andmovie_genre_id
is then joined with the original movies table (to fetch the movie’s full details, although only the id is used in this CTE) and the genres table to fetch the genre’s name corresponding to eachmovie_genre_id
.The result of the CTE will be a table with movie_id,
movie_genre_id
, and the genre’s name (genre_name
) for every movie. Note that if a movie has multiple genres, it will appear in multiple rows, one for each genre.
Main Query:
The main query then operates on the
ExpandedGenres
CTE.It groups the rows by
movie_id
(i.e., each movie will only appear once in the final output).For each
movie_id
, it aggregates the genre names usingSTRING_AGG
. TheSTRING_AGG
function is concatenating the genre names together with a comma and space (’, ‘) in between them. So, for each movie, you’ll get a single string that lists all its genres.The result will be a table where each row contains a movie’s ID and a concatenated string of all its genres.
Creating a table with movie and genres#
We can now create a new table in the DuckDB instance that contains detailed information about movies, including a concatenated string of their genre names. We will call it movie_genre_data
.
%%sql --no-execute
CREATE TABLE IF NOT EXISTS movie_genre_data AS
WITH ExpandedGenres AS (
SELECT
m.id AS movie_id,
mg.movie_genre_id,
g.name AS genre_name
FROM
(SELECT UNNEST(movies.genre_ids) as movie_genre_id, movies.id FROM movies) AS mg
JOIN
movies m ON mg.id = m.id
JOIN
genres g ON mg.movie_genre_id = g.id
),
genre_names AS (
SELECT
movie_id,
STRING_AGG(genre_name, ', ') AS genre_names
FROM
ExpandedGenres
GROUP BY
movie_id
)
SELECT gn.genre_names, m.id, m.original_language,
m.overview, m.popularity, m.release_date,
m.title, m.vote_average, m.vote_count
FROM genre_names gn
JOIN movies m
ON gn.movie_id = m.id
WHERE m.vote_count != 0
The query is creating a new table where each row represents a movie that has received at least one vote. For each movie, the table will have the movie’s details and a concatenated string of all its genres by name.
In Depth Explanation:
Table Creation:
`CREATE TABLE IF NOT EXISTS movie_genre_data AS``: This creates a new table named movie_genre_data if it doesn’t already exist.
Common Table Expression (CTE) -
ExpandedGenres
:
As in the previous query, the
ExpandedGenres
CTE is responsible for “expanding” or “flattening” movies based on their genres. Each movie in the movies table has an array (or similar list-type structure) of genre IDs ingenre_ids
.The result of the CTE is a table with columns:
movie_id
,movie_genre_id
, andgenre_name
.
CTE -
genre_names
:
This CTE operates on the
ExpandedGenres
CTE.It groups the rows by movie_id (i.e., each movie will only appear once in the output of this CTE).
For each
movie_id
, it aggregates the genre names usingSTRING_AGG
. This function concatenates the genre names together, separated by a comma and space.The result will be a table where each row contains a movie’s ID (
movie_id
) and a concatenated string of all its genres (genre_names
).
Main Query:
The main query operates on the
genre_names
CTE and the original movies table.It joins the
genre_names
CTE with the movies table on the movie_id field.For each movie, it selects: Concatenated genre names (
genre_names
fromgenre_names
CTE). The movie’s ID, original language, overview, popularity, release date, title, average vote, and vote count.The
WHERE m.vote_count != 0
condition ensures that only movies with at least one vote are included in the final table.
%%sql
SELECT *
FROM movie_genre_data
LIMIT 2;
genre_names | id | original_language | overview | popularity | release_date | title | vote_average | vote_count |
---|---|---|---|---|---|---|---|---|
Animation, Comedy, Family, Fantasy, Romance | 976573 | en | In a city where fire, water, land and air residents live together, a fiery young woman and a go-with-the-flow guy will discover something elemental: how much they have in common. | 3413.036 | 2023-06-14 00:00:00 | Elemental | 7.8 | 1318 |
Thriller, Action | 724209 | en | An intelligence operative for a shadowy global peacekeeping agency races to stop a hacker from stealing its most valuable — and dangerous — weapon. | 2813.299 | 2023-08-09 00:00:00 | Heart of Stone | 6.9 | 700 |
⚡⚡ Challenge ⚡⚡#
Perform data visualization and analysis on the movie_genre_data
table. Use the %sqlplot
magic to plot the data. What are the most popular genres? What are the most popular movies? What are the most popular movies by genre? What are the most popular movies by year? What are the most popular movies by decade? What are the most popular movies by genre and decade?
A sample notebook can be found here: eda.ipynb
Upgrading our pipeline#
Now that we have our EDA and data wrangling in a Jupyter notebook, we can upgrade our pipeline to incorporate this step. Since Ploomber supports Jupyter notebooks, we can simply add the notebook to our pipeline. We will add the notebook to the etl
folder and call it eda.ipynb
. We will then add the following to our pipeline.yaml
file:
tasks:
- source: movie_rec_system/etl/extract.py
product:
nb: movie_rec_system/products/extract-pipeline.ipynb
data: movies_data.duckdb
- source: movie_rec_system/etl/eda.ipynb
static_analysis: disable
product:
nb: movie_rec_system/products/eda-pipeline.ipynb
Important
Since the pipeline is being executed from the mini-projects/movie-rec-system
folder, we need to update the paths in the eda.ipynb
file to reflect this.
%sql duckdb:///movies_data.duckdb
Execute the pipeline again:
cd mini-projects/movie-rec-system
ploomber build
This will execute the eda.ipynb
notebook and save the output to eda-pipeline.ipynb
in the products
folder. We can now use this notebook to create a recommender system. We will do this in the next tutorial.
Loading pipeline...
Notebook movie_rec_system/etl/extract.py is missing the parameters cell, adding it at the top of the file...
Notebook movie_rec_system/etl/eda.ipynb is missing the parameters cell, adding it at the top of the file...
Executing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.15s/cell]
Executing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:10<00:00, 1.75cell/s]
Building task 'eda': 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.22s/it]
name Ran? Elapsed (s) Percentage
------- ------ ------------- ------------
extract True 11.5321 51.4146
eda True 10.8975 48.5854
⚡⚡ Challenge ⚡⚡#
Explore the SQL Pipelines example in the Ploomber documentation. In particular, take a look at this template that uses .sql
files. How would you incorporate this into your pipeline? What are the advantages and disadvantages of using .sql
files vs. Jupyter notebooks?
Conclusion#
In this tutorial, we used a Jupyter notebook to perform data wrangling and exploratory data analysis. We then incorporated this notebook into our pipeline. In the next tutorial, we will set up a recommendation system with the clean data.