Setting up a recommendation system#

Recall from the previous section that we have already extracted movie data from an API and stored it in a DuckDB database. We will be using this database to build our recommendation system. For a refresher on recommendation systems, visit this guide.

Important

This guide assumes you have executed the Ploomber pipeline introduced in the previous section. If you haven’t, please do so before proceeding. Visit this guide and this guide for a refresher.

A Closer Look at Content-Based Recommendation Systems#

For the remainder of this article, we’ll focus on one specific type: content-based recommendation systems. Instead of relying on user interaction history, these systems suggest items based on their content. Think of it like this: if you’ve enjoyed a book because of its genre and author, a content-based recommender might suggest another book with a similar genre and author.

How do they Work?#

Imagine you’re building a movie recommendation system. The system would look at various movie attributes like its genre, director, lead actors, and even its synopsis. It then compares these attributes with movies you’ve previously shown interest in. In essence, it’s matching movie attributes with your preferences.

Content-based recommenders thrive on the principle of similarity. If you liked a particular item, you’re likely to enjoy another item that’s similar to it. This method shines especially when each item has clear, descriptive attributes. For instance, movies have metadata like genre, director, and cast, which can be compared to find resemblances.

Diving Deeper: Using TF-IDF and Cosine Similarity#

What’s TF-IDF?#

TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a method to quantify the importance of words in a document relative to a collection of documents. Here’s a simple way to understand it:

  • Term Frequency (TF): How often a word appears in a document.

  • Inverse Document Frequency (IDF): Reduces the weight of words that appear frequently across many documents (like “the” or “and”).

For instance, if you’re searching for “latest European soccer games” on a search engine, the word “the” might appear more often than “soccer games”. But for our search, “soccer games” is clearly more important. TF-IDF helps in weighing words in such a manner.

What’s Cosine Similarity?#

Once we’ve transformed our content into numerical form using TF-IDF, we need a way to determine similarity between items. That’s where cosine similarity comes in. It measures the cosine of the angle between two vectors. If the vectors are identical, the cosine is 1, and if they’re completely different, the cosine is 0.

In the context of our movie recommender, after converting movie descriptions into vectors using TF-IDF, we use cosine similarity to find movies that have similar descriptions.

By combining TF-IDF with cosine similarity, we can find movies that are closely related based on their descriptions. This method can be further enhanced by adding more attributes like genres, cast, and crew to the mix.

Setting up a Content-Based Movie Recommender#

Let’s take a closer look at the code that powers our content-based recommendation system.

Let’s start by setting up our environment and importing necessary libraries:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

1. The Core Recommender: content_movie_recommender#

This function is the heart of our recommendation system. Given an input movie, it uses a precomputed similarity matrix to find movies that are most similar to the input.

Parameters:

  • input_movie: The movie title we want recommendations for.

  • similarity_database: A precomputed matrix that stores the similarity scores between movies.

  • movie_database_list: A list of all movie titles in our database.

  • top_n: The number of recommendations we want.

The function first fetches the similarity scores of the input movie with all other movies. It then sorts these scores in descending order and fetches the top n movies.

def content_movie_recommender(
    input_movie: str,
    similarity_database: pd.DataFrame,
    movie_database_list: list,
    top_n=10,
) -> list:
    """
    Function that uses a similarity matrix to find similar movies

    Parameters
    ----------
    input_movie : str
        reference movie to find similarities
    similarity_database : pandas.DataFrame
        similarity matrix of movies
    movie_database_list : numpy.ndarray
        movies in our similarity matrix
    top_n : int
        number of similar movies to output
    """
    try:
        # get movie similarity records
        movie_sim = similarity_database[
            similarity_database.index == input_movie
        ].values[0]

        # get movies sorted by similarity
        sorted_movie_ids = np.argsort(movie_sim)[::-1]
        recommended_movies = movie_database_list[
            sorted_movie_ids[1 : top_n + 1]  # noqa E203
        ]  # noqa E501
        return list(recommended_movies)
    except IndexError:
        return []

2. Evaluating Recommendations: RMSE Functions#

To evaluate how good our recommendations are, we use the Root Mean Square Error (RMSE). RMSE is a standard metric used to measure the differences between predicted and observed values. In our context, we’re using it to measure how similar our recommended movies are to the input movie in terms of popularity, vote average, and vote count.

get_popularity_rmse: This function computes the RMSE between the popularity of the input movie and the popularity of the recommended movies.

This function converts the movie titles in the dataframe (df) and the sample_movie string to lowercase. This ensures that when we compare movie titles, the comparison is case-insensitive, preventing potential mismatches due to different letter cases. It then filters the dataframe to retrieve rows where the movie title matches the sample_movie. The result is stored in filtered_df. We then check if the chosen movie is in the dataframe. If the sample_movie exists in the dataframe, the code fetches its popularity value. It also retrieves the popularity values of all movies in the recommendations list. If it doesn’t exist, a NaN value is returned instead.

The RMSE is a measure of the differences between values. Here, it’s used to measure the difference between the popularity of the sample_movie and the popularity of the recommended movies. The differences are squared, averaged, and then the square root is taken to compute the RMSE.

def get_popularity_rmse(
    df: pd.DataFrame, sample_movie: str, recommendations: list
) -> float:
    """
    Compute RMSE for popularity, vote average, and vote count
    for the provided movie and recommendations.
    
    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame which must contain
        a "title" column.
    
    sample_movie : str
        The title of the movie for which
        recommendations are to be generated.
        
    recommendations : list
        A list of recommended movies.
        
    Returns
    -------
    popularity_rmse : float
        The RMSE for popularity.
        """
    # Convert titles in dataframe and sample_movie to lowercase
    df["title"] = df["title"].str.lower()
    sample_movie = sample_movie.lower()

    filtered_df = df[df["title"] == sample_movie]

    if not filtered_df.empty:
        sample_movie_popularity = filtered_df.popularity.iloc[0]
        recommendations_popularity = df[
            df["title"].isin(recommendations)
        ].popularity.values

        squared_diffs = (
            sample_movie_popularity - recommendations_popularity
        ) ** 2  # noqa E501
        rmse = np.sqrt(squared_diffs.mean())

        return round(float(rmse), 3)
    else:
        return float("nan")

get_vote_avg_rmse: This function calculates the Root Mean Square Error (RMSE) between the vote average of a given movie (sample_movie) and the vote averages of a list of recommended movies.

This function starts by retrieving the vote average for the sample_movie from the dataframe df, it then fetches the vote averages for all the movies in the recommendations list. It then computes the RMSE. The RMSE is calculated by:

  1. Finding the squared differences between the vote average of the sample_movie and the vote averages of the recommended movies.

  2. Averaging these squared differences.

  3. Taking the square root of the average to get the RMSE.

The function then returns the RMSE value rounded to three decimal places.

def get_vote_avg_rmse(
    df: pd.DataFrame, sample_movie: str, recommendations: list
) -> float:
    """
    Compute RMSE for popularity, vote average, and vote count
    for the provided movie and recommendations.
    
    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame which must contain
        a "title" column.
        
    sample_movie : str
        The title of the movie for which
        recommendations are to be generated.
        
    recommendations : list
        A list of recommended movies.
        
    Returns
    -------
    popularity_rmse : float
        The RMSE for popularity."""
    sample_movie_vote_average = df[
        df["title"] == sample_movie
    ].vote_average.iloc[  # noqa E501
        0
    ]
    recommendations_vote_average = df[
        df["title"].isin(recommendations)
    ].vote_average.values

    squared_diffs = (
        sample_movie_vote_average - recommendations_vote_average
    ) ** 2  # noqa E501
    rmse = np.sqrt(squared_diffs.mean())

    return round(float(rmse), 3)

get_vote_count_rmse: This function calculates the Root Mean Square Error (RMSE) between the vote count of a given movie (sample_movie) and the vote counts of a list of recommended movies.

It first retrieves the vote count (which is a measure of popularity) for the sample_movie from the dataframe df, it then fetches the vote counts (popularity measures) for all the movies in the recommendations list.

The RMSE is calculated by:

  1. Finding the squared differences between the vote count of the sample_movie and the vote counts of the recommended movies.

  2. Averaging these squared differences.

  3. Taking the square root of the average to get the RMSE.

def get_vote_count_rmse(
    df: pd.DataFrame, sample_movie: str, recommendations: list
) -> float:
    """
    
    Compute RMSE for popularity, vote average, and vote count
    for the provided movie and recommendations.
    
    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame which must contain
        a "title" column.
        
    sample_movie : str
        The title of the movie for which
        recommendations are to be generated.
        
    recommendations : list
        A list of recommended movies.
        
    Returns
    -------
        popularity_rmse : float
        The RMSE for popularity.
    """
    sample_movie_popularity = df[df["title"] == sample_movie].vote_count.iloc[
        0
    ]  # noqa E501
    recommendations_popularity = df[
        df["title"].isin(recommendations)
    ].vote_count.values  # noqa E501

    squared_diffs = (recommendations_popularity - sample_movie_popularity) ** 2
    rmse = np.sqrt(squared_diffs.mean())

    return round(float(rmse), 3)

Each of these functions works similarly:

  1. Fetch the attribute (popularity, vote average, or vote count) of the input movie.

  2. Fetch the same attribute for all recommended movies.

  3. Compute the squared differences between the input movie’s attribute and the recommended movies’ attributes.

  4. Calculate the mean of these squared differences.

  5. Return the square root of this mean as the RMSE.

In essence, these functions give us a numerical measure of how close our recommendations are to the input movie in terms of popularity, vote average, and vote count. The lower the RMSE, the closer the recommended movies are to the input movie in terms of these attributes.

By combining a robust recommendation function with evaluation metrics, we can ensure that our system not only provides relevant recommendations but also allows us to measure and improve its performance over time.

3. Extracting Movie Data#

In previous sections, we build an ETL pipeline to extract, transform, and load movie data from an API file into a DuckDB database. We’ll use the same database to fetch movie data for our recommender. By this point we have performed all data wrangling operations we need. If you need a refresher, please consult this guide

def get_data() -> pd.DataFrame:
    """
    Function that automatically connects
    to duckdb as a GET call upon launch
    of FastAPI
    """
    con = duckdb.connect("./movies_data.duckdb")
    query = "SELECT * FROM movie_genre_data"
    df = con.execute(query).fetchdf()
    con.close()
    return df

Let’s add a few more functions. First, we’ll need to ensure that df contains a combined column

def create_combined(df: pd.DataFrame, weight=2):
    df["combined"] = df["overview"] + " " + (df["genre_names"] + ", ") * weight
    return df

We can now fetch the data and begin processing it, the function below enforces all titles to be lowercase and adds the combined column:

def retrieve_and_transform_data() -> pd.DataFrame:
    """
    Retrieve data from duckdb and transform it
    into a format that can be used for generating
    movie recommendations.

    Returns
    -------
    pd.DataFrame
        The transformed DataFrame with an additional "combined" column.
    """
    df = get_data()
    df["title"] = df["title"].str.lower()
    df = create_combined(df)
    return df

Once the data is retrieved and transformed, we can compute the TF-IDF vectorization of the combined column:

def compute_tfidf_vectorization(df, stop_words="english"):
    """
    Compute TF-IDF vectorization of the "combined" column
    in the provided DataFrame.

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame which must contain
        a "combined" column.

    stop_words : str, optional
        The language of stop words to be
        used when vectorizing the "combined" column.
        Default is "english".

    Returns
    -------
    tfidf_matrix:    scipy.sparse.csr.csr_matrix
        The TF-IDF vectorization of the "combined" column."""
    tfidf = TfidfVectorizer(stop_words=stop_words)
    tfidf_matrix = tfidf.fit_transform(df["combined"])
    return tfidf_matrix

We can now use the metrics functions we defined previously and compute RMSE for popularity, vote average, and vote count for the input movie and the recommended movies:

def compute_metrics(df, movie, recommendations):
    """
    Compute RMSE for popularity, vote average, and vote count
    for the provided movie and recommendations.

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame which must contain
        a "combined" column.

    movie : str
        The title of the movie for which
        recommendations are to be generated.

    recommendations : list
        A list of recommended movies.

    Returns
    -------
    popularity_rmse : float
        The RMSE for popularity.

    vote_avg_rmse : float
        The RMSE for vote average.

        ote_count_rmse : float
        The RMSE for vote count.
    """
    popularity_rmse = get_popularity_rmse(df, movie, recommendations)
    vote_avg_rmse = get_vote_avg_rmse(df, movie, recommendations)
    vote_count_rmse = get_vote_count_rmse(df, movie, recommendations)
    return popularity_rmse, vote_avg_rmse, vote_count_rmse

We are now ready to obtain recommendations. Our goal will be to provide as input a movie title, and as a result obtain a JSON object containing the recommendations.

The key steps in this function are:

  1. Retrieve and transform data

  2. Compute TF-IDF vectorization

  3. Compute cosine similarity

  4. Create Similarity DataFrame

  5. Generate recommendations

  6. Compute metrics

  7. Prepare results

  8. Convert the results to JSON

Note the structure of the results:

result = {
    "movie": movie,
    "recommendations": recommendations,
    "metrics": {
        "popularity": popularity_rmse,
        "vote_avg": vote_avg_rmse,
        "vote_count": vote_count_rmse,
    },
}

This format will be useful once we start developing an API to serve our recommendations.

def get_recommendation(movie: str, num_rec: int = 10, stop_words="english"):
    """
    Generate movie recommendations based on
    content similarity and computes associated metrics.

    This function retrieves movie data,
    calculates cosine similarity between movies using
    TF-IDF vectorization of their combined overview
    and genre, and returns a list of recommended
    movies along with certain metrics
    (popularity, vote average, and vote count RMSE).

    Parameters
    ----------
    movie : str
        The title of the movie for which
        recommendations are to be generated.

    num_rec : int, optional
        The number of movie recommendations
        to generate. Default is 10.

    stop_words : str, optional
        The language of stop words to be
        used when vectorizing the "combined" column.
        Default is "english".

    Returns
    -------
    str
        A JSON-formatted string containing
        the original movie, a list of recommendations,
        and associated metrics
        (popularity, vote average, and vote count RMSE).

    Examples
    --------
    >>> result = get_recommendation("Inception", num_rec=5)
    >>> print(json.loads(result))
    {
        "movie": "Inception",
        "recommendations": [...],
        "metrics": {
            "popularity": ...,
            "vote_avg": ...,
            "vote_count": ...
        }
    }

    """
    movie = movie.lower()
    df = retrieve_and_transform_data()

    tfidf_matrix = compute_tfidf_vectorization(df, stop_words)
    similarity = cosine_similarity(tfidf_matrix)

    similarity_df = pd.DataFrame(
        similarity, index=df.title.values, columns=df.title.values
    )
    movie_list = similarity_df.columns.values
    recommendations = content_movie_recommender(
        movie, similarity_df, movie_list, num_rec
    )

    if not recommendations:
        return None

    popularity_rmse, vote_avg_rmse, vote_count_rmse = compute_metrics(
        df, movie, recommendations
    )

    result = {
        "movie": movie,
        "recommendations": recommendations,
        "metrics": {
            "popularity": popularity_rmse,
            "vote_avg": vote_avg_rmse,
            "vote_count": vote_count_rmse,
        },
    }

    result_json = json.dumps(result)
    return result_json

Bringing all pieces together#

We can package the functions above as follows. We will create a new folder called app and a set of files called recommender.py and recommenderhelper.py

mini-projects
├──movie-rec-system
├──├── movies_data.duckdb
├──├── pipeline.yaml
├──├── pyproject.toml
├──├── README.md
├──├── app
├──│   └── recommender.py
├──│   └── recommenderhelper.py
├──├── etl
├──│   └── extract.py
├──│   └── eda.ipynb
├──├── products
├──     └── extract-pipeline.ipynb
├──└── tests
├──│       └── __init__.py
├──└── .env

Where:

  • recommender.py contains the get_recommendation function

  • recommenderhelper.py contains all other functions

To import the functions from recommenderhelper.py we need to add the following to recommender.py:

from .recommenderhelper import (
    content_movie_recommender,
    compute_metrics,
    retrieve_and_transform_data,
    compute_tfidf_vectorization,
)

You can find a full version of recommender.py here and a full version of recommenderhelper.py here

Summary#

In this section, we learned how to build a content-based movie recommendation system. We started by understanding the core concepts behind content-based recommendation systems. We then built a recommendation system using TF-IDF and cosine similarity. Finally, we packaged our recommendation system into a Python package. In the next section, we will learn how to build a FastAPI application that can serve our recommendations as an API.