Aggregate functions in SQL#

In the last section, we learned the basics of SQL. We now build upon those basics to learn a common use case of SQL: aggregation functions.

Aggregation functions are useful for summarizing your data and for finding meaningful insights. The most common of these functions are COUNT(), AVG(), SUM(), MIN(), and MAX(). We will go over each of these functions in detail with our bank marketing dataset.

We will also introduce the important clause GROUP BY. This clause can only be used if, and only if, aggregation functions are used. However, note that aggregation functions can be used without GROUP BY. GROUP BY groups unique values in columns together and runs an aggregation function on each unique group. Examples of using GROUP BY will be provided after introducing aggregate functions.

Let’s first run the installations and setup before running any queries, just like the last lesson making-your-first-query.

Install - execute this once.#

Note: If you are following these lessons locally and not on Google Colab, then there is no need to reinstall these packages.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql --quiet

Load the data#

Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.

The data is downloaded from https://archive-beta.ics.uci.edu/dataset/222/bank+marketing via the url https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip.

We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv.

Dataset citation:

Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

Helper script#

We developed a banking.py script to help you extract the data from the URL and load it into a DuckDB database. This script is located here

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()

After running this code, you should have bank_cleaned.csv in the current directory.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db
Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'
Settings changed:
Config value
displaycon False
feedback True
autopandas False
named_parameters True

Let’s now return to our initial dataset of bank marketing records.

Queries#

Creating Table#

Let’s start off with loading our bank_cleaned.csv file from our local directory to our newly created DuckDB database. Here we CREATE OR REPLACE TABLE in DuckDB called ‘bank’ FROM our bank_cleaned.csv file. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')
Count

Count and Distinct Count#

COUNT() allows users to query the count of records in a given query. A common argument to this function is *. COUNT(*) tells SQL to return the total number of rows from our query, including NULL values. This can be combined with queries that filter for certain conditions.

A common function to be used with COUNT() is DISTINCT(). For example, selecting COUNT(DISTINCT(job)) would return the count of distinct values under the “job” column.

%%sql 
SELECT COUNT(DISTINCT(job))
FROM bank 
count(DISTINCT job)
12

The output of the above query can be an eyesore. Instead of getting SQL’s default column name when running these aggregation function, users can utilize the AS clause. The text following AS changes the output of the SELECT clause to that text. Our next example with demonstrate this.

Here’s an example with COUNT() that counts the number of rows of our query WHERE we filter for “balance” greather than or equal to 500 AND where “marital” equals married. We also use the AS clause to change the column of our query to “Count”.

%%sql 
SELECT COUNT(*) AS Count
FROM bank 
WHERE balance > 500 AND marital = 'married'
Count
1331

COUNT() can also have the arguments of a single row, such as COUNT(job). COUNT(job) would count the number of rows of just the “job” column. If “job” were to have any NULL values in its query, those NULL values would be subtracted from the total row counts of the query.

Average#

AVG() allows users to take the average of columns. This clause can also be used with filtering. An example of finding the average balance of unemployed observations is as follows:

%%sql 
SELECT AVG(balance) AS average_unemployed_balance
FROM bank 
WHERE job = 'unemployed'
average_unemployed_balance
1089.421875

The value of our AVG(balance) function can be rounded to better represent a monetary balance with ROUND(). ROUND() accepts two arguments. The first is the actual value to round and the second is the number of decimal places to round to. We apply ROUND() to the same query below.

Note that aliasing query column outputs with the AS clause should not have any spaces. By convention, this makes it easier for SQL to later reference these aliases and avoid ambiguity for the parser.

%%sql 
SELECT ROUND(AVG(balance),2) AS average_unemployed_balance
FROM bank 
WHERE job = 'unemployed'
average_unemployed_balance
1089.42

Sum#

SUM() aggregates the sum of columns. Below, we find the SUM() of the balance column WHERE “job” equals ‘management’ OR WHERE “job” equals ‘services’.

%%sql 
SELECT SUM(balance) AS sum_balance_of_managers
FROM bank 
WHERE job = 'management' OR job = 'services'
sum_balance_of_managers
2172504

Minimum and Maximum#

The MIN() and MAX() functions do exactly what you would think. Below we find the MIN() and MAX() of “balance”.

%%sql 
SELECT MIN(balance) AS minimum_balance
FROM bank 
minimum_balance
-3313
%%sql 
SELECT MAX(balance) AS maximum_balance
FROM bank 
maximum_balance
71188

Accumulating everything we have learned so far, can you think of another way of finding the minimum and maximum of balance without MIN() and MAX()? Hint: try recreating these queries with ORDER BY.

Grouping#

Grouping is an extremely useful clause. It allows users to examine the results of aggregate functions within each unique group. Note that grouping with GROUP BY comes after filtering with WHERE. Below, we find the COUNT() of all rows GROUP BY “housing”. Since housing only has the unique values of ‘yes’ and ‘no’, there will be only two groups.

%%sql 
SELECT COUNT(*) AS Count
FROM bank 
GROUP BY housing
Count
1962
2559

The problem with the above query is we can not determine what value belongs to which group. To fix this, SELECT both the “housing” variable and the COUNT(*) function. Let’s also change ‘housing’ to ‘Housing’ in our final output just for demonstration purposes.

%%sql 
SELECT housing as Housing, COUNT(*) AS Count
FROM bank 
GROUP BY housing
Housing Count
no 1962
yes 2559

We can now clearly see which count belongs to which group under “housing”.

GROUP BY also allows for grouping with several variables. For instance, let’s first GROUP BY “housing” and then GROUP BY “marital”. Then, find the COUNT() of these groups.

%%sql 
SELECT housing AS Housing, marital AS Marital, COUNT(*) AS Count
FROM bank 
GROUP BY housing, marital
Housing Marital Count
no married 1172
yes married 1625
yes single 636
no single 560
yes divorced 298
no divorced 230

There are six total groups from our query. “housing” has two groups and “marital” has three groups. Since “housing” has only two groups, ‘yes’ and ‘no’, let’s think about these groups as two seperate blocks. The ‘yes’ and ‘no’ blocks will then each be GROUP BY the groups under the “marital” column, which are ‘married’, ‘single’, and ‘divorced’. Thus, since each ‘yes’ and ‘no’ group has three groups each, there are a total of six groups.

diagram

Having#

The HAVING clause filters for a query post-aggregation. Unlike the WHERE query, HAVING is exclusively used when a GROUP BY clause is present in the SQL query. We demonstrate its functionality below.

%%sql 
SELECT housing AS Housing, marital AS Marital, COUNT(*) AS Count
FROM bank 
GROUP BY housing, marital
HAVING Count < 700
Housing Marital Count
yes single 636
no single 560
yes divorced 298
no divorced 230

You try: Use JupySQL to perform the queries and answer the questions.#

Question 1 (Easy):#

Find the average “duration” of phone calls. Name the column of your output as “Average Phone Call Length”. Round to 0 decimal places.

Answers

You can use the ROUND clause to specify where 0 decimal places.

%%sql 
SELECT ROUND(AVG(duration),0) AS average_phone_call_length
FROM bank 
average_phone_call_length
264.0

Question 2 (Medium):#

Show each education group’s count of whether they have housing or not. Also, make it so that this query only include married individuals.

Answers

The WHERE clause must appear first before the GROUP BY clause. Then, grouping by first education and then housing gives the correct COUNT() of each group. Notice that ‘primary’ under the “education” column does not have a “housing” group of ‘yes’ If no records exist where an observation with an “education” of ‘primary’ has a “housing” value of ‘yes’, then it is omitted from GROUP BY queries. This is saying that this particular group does not exist.

%%sql
SELECT education, housing, COUNT(*)
FROM bank  
WHERE marital = 'married'
GROUP BY education, housing
education housing count_star()
primary no 230
secondary yes 920
tertiary yes 360
primary yes 296
tertiary no 367
secondary no 507
unknown no 68
unknown yes 49

Question 3 (Medium):#

Find the average, minimum, and maximum of balance and the count of records where there has not been a default. Group this query by “job” and “married”. Round the averages by 2 decimal places.

Hint COUNT() is a aggregating function in SQL (more on aggregation later!). Try experimenting with COUNT() in your SELECT clause to see if you can find the correct count.

Answers

You may have had some problems with the WHERE clause. If you had “default” as just default, you encountered an error. Reading this error would inform that DEFAULT is an existing SQL clause and having it in our WHERE clause is not acceptable. In situations like this, you can double quotes “” surrounding the column name. This helps SQL distinguish clauses and literal strings of text.

%%sql
SELECT job, marital, AVG(balance), COUNT(*), MIN(balance), MAX(balance)
FROM bank 
WHERE "default" = 'no'
GROUP BY job, marital
job marital avg(balance) count_star() min(balance) max(balance)
unemployed married 1131.5866666666666 75 -872 8131
services married 1252.271551724138 232 -1202 26394
management single 1771.4006968641115 287 -1313 19358
management married 1878.038112522686 551 -1746 27359
blue-collar married 1079.1854014598541 685 -1400 16353
self-employed married 1410.0564516129032 124 -370 16430
technician married 1463.7407407407406 405 -1310 27069
entrepreneur married 1665.84496124031 129 -139 42045
admin. married 1365.4772727272727 264 -967 22171
student single 1689.162162162162 74 -148 11555
Truncated to displaylimit of 10.

Wrapping Up#

In this section, we introduced aggregate functions and the GROUP BY utility. To summarize:

  • COUNT() : Returns the number of rows in our query. This function can count the number of rows of a specific column or of the entire query by passing “*” into the function.

  • AVG() : Returns the average of a numeric column in our query. ROUND() is a useful function often applied with this aggregate function.

  • MIN() and MAX() : Returns the minimum and maximum of a numeric column in our query.

  • GROUP BY : A powerful clause that groups the data based on a given column. This clause must be used with aggregation functions.

We have also learned other details from this section, such as how WHERE must appear before any GROUP BY clause and how to tell SQL to distinguish a literal string when that string is a clause.

Next up, we will use everything we have thus far learned in joining tables.

Delete table

%%sql
DROP TABLE bank;
Success