Aggregate functions in SQL

Aggregate functions in SQL#

In the last section, we learned the basics of SQL. We now build upon those basics to learn a common use case of SQL: aggregation functions.

Aggregation functions are useful for summarizing your data and for finding meaningful insights. The most common of these functions are COUNT(), AVG(), SUM(), MIN(), and MAX(). We will go over each of these functions in detail with our bank marketing dataset.

We will also introduce the important clause GROUP BY. This clause can only be used if, and only if, aggregation functions are used. However, note that aggregation functions can be used without GROUP BY. GROUP BY groups unique values in columns together and runs an aggregation function on each unique group. Examples of using GROUP BY will be provided after introducing aggregate functions.

Let’s first run the installations and setup before running any queries, just like the last lesson making-your-first-query.

Install - execute this once.#

Note: If you are following these lessons locally and not on Google Colab, then there is no need to reinstall these packages.

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql --quiet

Load the data#

Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.

The data is downloaded from https://archive-beta.ics.uci.edu/dataset/222/bank+marketing via the url https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip.

We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv.

Dataset citation:

Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

Helper script#

We developed a banking.py script to help you extract the data from the URL and load it into a DuckDB database. This script is located here

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()

After running this code, you should have bank_cleaned.csv in the current directory.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db

Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'

Settings changed:

Config	value
displaycon	False
feedback	True
autopandas	False
named_parameters	True

Let’s now return to our initial dataset of bank marketing records.

Queries#

Creating Table#

Let’s start off with loading our bank_cleaned.csv file from our local directory to our newly created DuckDB database. Here we CREATE OR REPLACE TABLE in DuckDB called ‘bank’ FROM our bank_cleaned.csv file. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')

Count

Count and Distinct Count#

COUNT() allows users to query the count of records in a given query. A common argument to this function is *. COUNT(*) tells SQL to return the total number of rows from our query, including NULL values. This can be combined with queries that filter for certain conditions.

A common function to be used with COUNT() is DISTINCT(). For example, selecting COUNT(DISTINCT(job)) would return the count of distinct values under the “job” column.

%%sql 
SELECT COUNT(DISTINCT(job))
FROM bank 

count(DISTINCT job)
12

The output of the above query can be an eyesore. Instead of getting SQL’s default column name when running these aggregation function, users can utilize the AS clause. The text following AS changes the output of the SELECT clause to that text. Our next example with demonstrate this.

Here’s an example with COUNT() that counts the number of rows of our query WHERE we filter for “balance” greather than or equal to 500 AND where “marital” equals married. We also use the AS clause to change the column of our query to “Count”.

%%sql 
SELECT COUNT(*) AS Count
FROM bank 
WHERE balance > 500 AND marital = 'married'

Count
1331

COUNT() can also have the arguments of a single row, such as COUNT(job). COUNT(job) would count the number of rows of just the “job” column. If “job” were to have any NULL values in its query, those NULL values would be subtracted from the total row counts of the query.

Average#

AVG() allows users to take the average of columns. This clause can also be used with filtering. An example of finding the average balance of unemployed observations is as follows:

%%sql 
SELECT AVG(balance) AS average_unemployed_balance
FROM bank 
WHERE job = 'unemployed'

average_unemployed_balance
1089.421875

The value of our AVG(balance) function can be rounded to better represent a monetary balance with ROUND(). ROUND() accepts two arguments. The first is the actual value to round and the second is the number of decimal places to round to. We apply ROUND() to the same query below.

Note that aliasing query column outputs with the AS clause should not have any spaces. By convention, this makes it easier for SQL to later reference these aliases and avoid ambiguity for the parser.

%%sql 
SELECT ROUND(AVG(balance),2) AS average_unemployed_balance
FROM bank 
WHERE job = 'unemployed'

average_unemployed_balance
1089.42

Sum#

SUM() aggregates the sum of columns. Below, we find the SUM() of the balance column WHERE “job” equals ‘management’ OR WHERE “job” equals ‘services’.

%%sql 
SELECT SUM(balance) AS sum_balance_of_managers
FROM bank 
WHERE job = 'management' OR job = 'services'

sum_balance_of_managers
2172504

Minimum and Maximum#

The MIN() and MAX() functions do exactly what you would think. Below we find the MIN() and MAX() of “balance”.

%%sql 
SELECT MIN(balance) AS minimum_balance
FROM bank 

minimum_balance
-3313

%%sql 
SELECT MAX(balance) AS maximum_balance
FROM bank 

maximum_balance
71188

Accumulating everything we have learned so far, can you think of another way of finding the minimum and maximum of balance without MIN() and MAX()? Hint: try recreating these queries with ORDER BY.

Grouping#

Grouping is an extremely useful clause. It allows users to examine the results of aggregate functions within each unique group. Note that grouping with GROUP BY comes after filtering with WHERE. Below, we find the COUNT() of all rows GROUP BY “housing”. Since housing only has the unique values of ‘yes’ and ‘no’, there will be only two groups.

%%sql 
SELECT COUNT(*) AS Count
FROM bank 
GROUP BY housing

Count
1962
2559

The problem with the above query is we can not determine what value belongs to which group. To fix this, SELECT both the “housing” variable and the COUNT(*) function. Let’s also change ‘housing’ to ‘Housing’ in our final output just for demonstration purposes.

%%sql 
SELECT housing as Housing, COUNT(*) AS Count
FROM bank 
GROUP BY housing

Housing	Count
no	1962
yes	2559

We can now clearly see which count belongs to which group under “housing”.

GROUP BY also allows for grouping with several variables. For instance, let’s first GROUP BY “housing” and then GROUP BY “marital”. Then, find the COUNT() of these groups.

%%sql 
SELECT housing AS Housing, marital AS Marital, COUNT(*) AS Count
FROM bank 
GROUP BY housing, marital

Housing	Marital	Count
no	married	1172
yes	married	1625
yes	single	636
no	single	560
yes	divorced	298
no	divorced	230

There are six total groups from our query. “housing” has two groups and “marital” has three groups. Since “housing” has only two groups, ‘yes’ and ‘no’, let’s think about these groups as two seperate blocks. The ‘yes’ and ‘no’ blocks will then each be GROUP BY the groups under the “marital” column, which are ‘married’, ‘single’, and ‘divorced’. Thus, since each ‘yes’ and ‘no’ group has three groups each, there are a total of six groups.

diagram

Having#

The HAVING clause filters for a query post-aggregation. Unlike the WHERE query, HAVING is exclusively used when a GROUP BY clause is present in the SQL query. We demonstrate its functionality below.

%%sql 
SELECT housing AS Housing, marital AS Marital, COUNT(*) AS Count
FROM bank 
GROUP BY housing, marital
HAVING Count < 700

Housing	Marital	Count
yes	single	636
no	single	560
yes	divorced	298
no	divorced	230

You try: Use JupySQL to perform the queries and answer the questions.#

Question 1 (Easy):#

Find the average “duration” of phone calls. Name the column of your output as “Average Phone Call Length”. Round to 0 decimal places.

Answers

You can use the ROUND clause to specify where 0 decimal places.

%%sql 
SELECT ROUND(AVG(duration),0) AS average_phone_call_length
FROM bank 

average_phone_call_length
264.0

Question 2 (Medium):#

Show each education group’s count of whether they have housing or not. Also, make it so that this query only include married individuals.

Answers

The WHERE clause must appear first before the GROUP BY clause. Then, grouping by first education and then housing gives the correct COUNT() of each group. Notice that ‘primary’ under the “education” column does not have a “housing” group of ‘yes’ If no records exist where an observation with an “education” of ‘primary’ has a “housing” value of ‘yes’, then it is omitted from GROUP BY queries. This is saying that this particular group does not exist.

%%sql
SELECT education, housing, COUNT(*)
FROM bank  
WHERE marital = 'married'
GROUP BY education, housing

education	housing	count_star()
primary	no	230
secondary	yes	920
tertiary	yes	360
primary	yes	296
tertiary	no	367
secondary	no	507
unknown	no	68
unknown	yes	49

Question 3 (Medium):#

Find the average, minimum, and maximum of balance and the count of records where there has not been a default. Group this query by “job” and “married”. Round the averages by 2 decimal places.

Hint COUNT() is a aggregating function in SQL (more on aggregation later!). Try experimenting with COUNT() in your SELECT clause to see if you can find the correct count.

Answers

You may have had some problems with the WHERE clause. If you had “default” as just default, you encountered an error. Reading this error would inform that DEFAULT is an existing SQL clause and having it in our WHERE clause is not acceptable. In situations like this, you can double quotes “” surrounding the column name. This helps SQL distinguish clauses and literal strings of text.

%%sql
SELECT job, marital, AVG(balance), COUNT(*), MIN(balance), MAX(balance)
FROM bank 
WHERE "default" = 'no'
GROUP BY job, marital

job	marital	avg(balance)	count_star()	min(balance)	max(balance)
unemployed	married	1131.5866666666666	75	-872	8131
services	married	1252.271551724138	232	-1202	26394
management	single	1771.4006968641115	287	-1313	19358
management	married	1878.038112522686	551	-1746	27359
blue-collar	married	1079.1854014598541	685	-1400	16353
self-employed	married	1410.0564516129032	124	-370	16430
technician	married	1463.7407407407406	405	-1310	27069
entrepreneur	married	1665.84496124031	129	-139	42045
admin.	married	1365.4772727272727	264	-967	22171
student	single	1689.162162162162	74	-148	11555

Truncated to displaylimit of 10.

Wrapping Up#

In this section, we introduced aggregate functions and the GROUP BY utility. To summarize:

COUNT() : Returns the number of rows in our query. This function can count the number of rows of a specific column or of the entire query by passing “*” into the function.
AVG() : Returns the average of a numeric column in our query. ROUND() is a useful function often applied with this aggregate function.
MIN() and MAX() : Returns the minimum and maximum of a numeric column in our query.
GROUP BY : A powerful clause that groups the data based on a given column. This clause must be used with aggregation functions.

We have also learned other details from this section, such as how WHERE must appear before any GROUP BY clause and how to tell SQL to distinguish a literal string when that string is a clause.

Next up, we will use everything we have thus far learned in joining tables.

Delete table

%%sql
DROP TABLE bank;

Success

Aggregate functions in SQL

Contents

Aggregate functions in SQL#

Install - execute this once.#

Load the data#

Helper script#

Load Engine#

Queries#

Creating Table#

Count and Distinct Count#

Average#

Sum#

Minimum and Maximum#

Grouping#

Having#

You try: Use JupySQL to perform the queries and answer the questions.#

Question 1 (Easy):#

Question 2 (Medium):#

Question 3 (Medium):#

Wrapping Up#