Aggregate functions in SQL#
In the last section, we learned the basics of SQL. We now build upon those basics to learn a common use case of SQL: aggregation functions.
Aggregation functions are useful for summarizing your data and for finding meaningful insights. The most common of these functions are COUNT()
, AVG()
, SUM()
, MIN()
, and MAX()
. We will go over each of these functions in detail with our bank marketing dataset.
We will also introduce the important clause GROUP BY
. This clause can only be used if, and only if, aggregation functions are used. However, note that aggregation functions can be used without GROUP BY
. GROUP BY
groups unique values in columns together and runs an aggregation function on each unique group. Examples of using GROUP BY
will be provided after introducing aggregate functions.
Let’s first run the installations and setup before running any queries, just like the last lesson making-your-first-query
.
Install - execute this once.#
Note: If you are following these lessons locally and not on Google Colab, then there is no need to reinstall these packages.
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql --quiet
Load the data#
Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.
The data is downloaded from https://archive-beta.ics.uci.edu/dataset/222/bank+marketing via the url https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip.
We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv
.
Dataset citation:
Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.
Helper script#
We developed a banking.py
script to help you extract the data from the URL and load it into a DuckDB database. This script is located here
import sys
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()
After running this code, you should have bank_cleaned.csv
in the current directory.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Let’s now return to our initial dataset of bank marketing records.
Queries#
Creating Table#
Let’s start off with loading our bank_cleaned.csv
file from our local directory to our newly created DuckDB database. Here we CREATE OR REPLACE TABLE
in DuckDB called ‘bank’ FROM
our bank_cleaned.csv
file. The read_csv_auto
is a function that helps SQL understand our local .csv file for creation into our database.
%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')
Count |
---|
Count and Distinct Count#
COUNT()
allows users to query the count of records in a given query. A common argument to this function is *
. COUNT(*)
tells SQL to return the total number of rows from our query, including NULL values. This can be combined with queries that filter for certain conditions.
A common function to be used with COUNT()
is DISTINCT()
. For example, selecting COUNT(DISTINCT(job))
would return the count of distinct values under the “job” column.
%%sql
SELECT COUNT(DISTINCT(job))
FROM bank
count(DISTINCT job) |
---|
12 |
The output of the above query can be an eyesore. Instead of getting SQL’s default column name when running these aggregation function, users can utilize the AS
clause. The text following AS
changes the output of the SELECT
clause to that text. Our next example with demonstrate this.
Here’s an example with COUNT()
that counts the number of rows of our query WHERE
we filter for “balance” greather than or equal to 500 AND
where “marital” equals married. We also use the AS
clause to change the column of our query to “Count”.
%%sql
SELECT COUNT(*) AS Count
FROM bank
WHERE balance > 500 AND marital = 'married'
Count |
---|
1331 |
COUNT()
can also have the arguments of a single row, such as COUNT(job)
. COUNT(job)
would count the number of rows of just the “job” column. If “job” were to have any NULL values in its query, those NULL values would be subtracted from the total row counts of the query.
Average#
AVG()
allows users to take the average of columns. This clause can also be used with filtering. An example of finding the average balance of unemployed observations is as follows:
%%sql
SELECT AVG(balance) AS average_unemployed_balance
FROM bank
WHERE job = 'unemployed'
average_unemployed_balance |
---|
1089.421875 |
The value of our AVG(balance)
function can be rounded to better represent a monetary balance with ROUND()
. ROUND()
accepts two arguments. The first is the actual value to round and the second is the number of decimal places to round to. We apply ROUND()
to the same query below.
Note that aliasing query column outputs with the AS clause should not have any spaces. By convention, this makes it easier for SQL to later reference these aliases and avoid ambiguity for the parser.
%%sql
SELECT ROUND(AVG(balance),2) AS average_unemployed_balance
FROM bank
WHERE job = 'unemployed'
average_unemployed_balance |
---|
1089.42 |
Sum#
SUM()
aggregates the sum of columns. Below, we find the SUM()
of the balance column WHERE
“job” equals ‘management’ OR
WHERE
“job” equals ‘services’.
%%sql
SELECT SUM(balance) AS sum_balance_of_managers
FROM bank
WHERE job = 'management' OR job = 'services'
sum_balance_of_managers |
---|
2172504 |
Minimum and Maximum#
The MIN()
and MAX()
functions do exactly what you would think. Below we find the MIN()
and MAX()
of “balance”.
%%sql
SELECT MIN(balance) AS minimum_balance
FROM bank
minimum_balance |
---|
-3313 |
%%sql
SELECT MAX(balance) AS maximum_balance
FROM bank
maximum_balance |
---|
71188 |
Accumulating everything we have learned so far, can you think of another way of finding the minimum and maximum of balance without MIN()
and MAX()
? Hint: try recreating these queries with ORDER BY
.
Grouping#
Grouping is an extremely useful clause. It allows users to examine the results of aggregate functions within each unique group. Note that grouping with GROUP BY
comes after filtering with WHERE
. Below, we find the COUNT()
of all rows GROUP BY
“housing”. Since housing only has the unique values of ‘yes’ and ‘no’, there will be only two groups.
%%sql
SELECT COUNT(*) AS Count
FROM bank
GROUP BY housing
Count |
---|
1962 |
2559 |
The problem with the above query is we can not determine what value belongs to which group. To fix this, SELECT
both the “housing” variable and the COUNT(*)
function. Let’s also change ‘housing’ to ‘Housing’ in our final output just for demonstration purposes.
%%sql
SELECT housing as Housing, COUNT(*) AS Count
FROM bank
GROUP BY housing
Housing | Count |
---|---|
no | 1962 |
yes | 2559 |
We can now clearly see which count belongs to which group under “housing”.
GROUP BY
also allows for grouping with several variables. For instance, let’s first GROUP BY
“housing” and then GROUP BY
“marital”. Then, find the COUNT()
of these groups.
%%sql
SELECT housing AS Housing, marital AS Marital, COUNT(*) AS Count
FROM bank
GROUP BY housing, marital
Housing | Marital | Count |
---|---|---|
no | married | 1172 |
yes | married | 1625 |
yes | single | 636 |
no | single | 560 |
yes | divorced | 298 |
no | divorced | 230 |
There are six total groups from our query. “housing” has two groups and “marital” has three groups. Since “housing” has only two groups, ‘yes’ and ‘no’, let’s think about these groups as two seperate blocks. The ‘yes’ and ‘no’ blocks will then each be GROUP BY
the groups under the “marital” column, which are ‘married’, ‘single’, and ‘divorced’. Thus, since each ‘yes’ and ‘no’ group has three groups each, there are a total of six groups.
Having#
The HAVING
clause filters for a query post-aggregation. Unlike the WHERE
query, HAVING
is exclusively used when a GROUP BY
clause is present in the SQL query. We demonstrate its functionality below.
%%sql
SELECT housing AS Housing, marital AS Marital, COUNT(*) AS Count
FROM bank
GROUP BY housing, marital
HAVING Count < 700
Housing | Marital | Count |
---|---|---|
yes | single | 636 |
no | single | 560 |
yes | divorced | 298 |
no | divorced | 230 |
You try: Use JupySQL to perform the queries and answer the questions.#
Question 1 (Easy):#
Find the average “duration” of phone calls. Name the column of your output as “Average Phone Call Length”. Round to 0 decimal places.
Answers
You can use the ROUND
clause to specify where 0 decimal places.
%%sql
SELECT ROUND(AVG(duration),0) AS average_phone_call_length
FROM bank
average_phone_call_length |
---|
264.0 |
Question 2 (Medium):#
Show each education group’s count of whether they have housing or not. Also, make it so that this query only include married individuals.
Answers
The WHERE
clause must appear first before the GROUP BY
clause. Then, grouping by first education and then housing gives the correct COUNT()
of each group. Notice that ‘primary’ under the “education” column does not have a “housing” group of ‘yes’ If no records exist where an observation with an “education” of ‘primary’ has a “housing” value of ‘yes’, then it is omitted from GROUP BY
queries. This is saying that this particular group does not exist.
%%sql
SELECT education, housing, COUNT(*)
FROM bank
WHERE marital = 'married'
GROUP BY education, housing
education | housing | count_star() |
---|---|---|
primary | no | 230 |
secondary | yes | 920 |
tertiary | yes | 360 |
primary | yes | 296 |
tertiary | no | 367 |
secondary | no | 507 |
unknown | no | 68 |
unknown | yes | 49 |
Question 3 (Medium):#
Find the average, minimum, and maximum of balance and the count of records where there has not been a default. Group this query by “job” and “married”. Round the averages by 2 decimal places.
Hint COUNT()
is a aggregating function in SQL (more on aggregation later!). Try experimenting with COUNT()
in your SELECT
clause to see if you can find the correct count.
Answers
You may have had some problems with the WHERE
clause. If you had “default” as just default, you encountered an error. Reading this error would inform that DEFAULT
is an existing SQL clause and having it in our WHERE
clause is not acceptable. In situations like this, you can double quotes “” surrounding the column name. This helps SQL distinguish clauses and literal strings of text.
%%sql
SELECT job, marital, AVG(balance), COUNT(*), MIN(balance), MAX(balance)
FROM bank
WHERE "default" = 'no'
GROUP BY job, marital
job | marital | avg(balance) | count_star() | min(balance) | max(balance) |
---|---|---|---|---|---|
unemployed | married | 1131.5866666666666 | 75 | -872 | 8131 |
services | married | 1252.271551724138 | 232 | -1202 | 26394 |
management | single | 1771.4006968641115 | 287 | -1313 | 19358 |
management | married | 1878.038112522686 | 551 | -1746 | 27359 |
blue-collar | married | 1079.1854014598541 | 685 | -1400 | 16353 |
self-employed | married | 1410.0564516129032 | 124 | -370 | 16430 |
technician | married | 1463.7407407407406 | 405 | -1310 | 27069 |
entrepreneur | married | 1665.84496124031 | 129 | -139 | 42045 |
admin. | married | 1365.4772727272727 | 264 | -967 | 22171 |
student | single | 1689.162162162162 | 74 | -148 | 11555 |
Wrapping Up#
In this section, we introduced aggregate functions and the GROUP BY
utility. To summarize:
COUNT()
: Returns the number of rows in our query. This function can count the number of rows of a specific column or of the entire query by passing “*” into the function.AVG()
: Returns the average of a numeric column in our query.ROUND()
is a useful function often applied with this aggregate function.MIN()
andMAX()
: Returns the minimum and maximum of a numeric column in our query.GROUP BY
: A powerful clause that groups the data based on a given column. This clause must be used with aggregation functions.
We have also learned other details from this section, such as how WHERE
must appear before any GROUP BY
clause and how to tell SQL to distinguish a literal string when that string is a clause.
Next up, we will use everything we have thus far learned in joining tables.
Delete table
%%sql
DROP TABLE bank;
Success |
---|