Advanced aggregation operations in SQL#
In the last section, we learned about advanced joins. In similar fashion, aggregate functions, which were introduced in the Introduction to SQL module, also have advanced operations.
Recall that aggregation functions are useful for summarizing your data and for finding meaningful insights. The most common of these functions are COUNT()
, AVG()
, SUM()
, MIN()
, MAX()
, GROUPBY()
, and HAVING
. However, in this section, we will focus on operations that help us handle tasks that are hard to implement efficiently with
basic aggregation features.
Specifically, we will learn about ranking (RANK()
), windowing (OVER()
), pivoting (PIVOT()
), and rollup (ROLLUP()
).
Let’s first run the installations and setup before running any queries.
Set up and data access#
Important
Note: The --save
and %sqlcmd
features used require the latest JupySQL version. Ensure you run the code below to update JupySQL.
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql pandas --quiet
Load the data#
Note: If you are following these lessons locally and not on Google Colab, then there is no need to load the data again.
This section was covered in detail in the tutorial: Joining Data in SQL. We will be using the same data in this tutorial as well.
import sys
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.MarketData(
"https://web.archive.org/web/20070214120527/http://lisp.vse.cz/pkdd99/DATA/data_berka.zip", # noqa E501
"expanded_data",
)
_.convert_asc_to_csv(banking.district_column_names)
Converted expanded_data/trans.asc to CSV.
Converted expanded_data/order.asc to CSV.
Converted expanded_data/loan.asc to CSV.
Converted expanded_data/district.asc to CSV.
Converted expanded_data/disp.asc to CSV.
Converted expanded_data/client.asc to CSV.
Converted expanded_data/card.asc to CSV.
Converted expanded_data/account.asc to CSV.
All ASC files converted to CSV.
If you ran the above cell, you should have a folder expanded_data
in your current directory that contains the .csv
files we will be using. However, in this tutorial, we will focus on one file: loan.csv
.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank_data.duck.db' to run SQL queries
%sql duckdb:///bank_data.duck.db
There's a new jupysql version available (0.10.7), you're running 0.10.0. To upgrade: pip install jupysql --upgrade
Deploy AI and data apps for free on Ploomber Cloud! Learn more: https://docs.cloud.ploomber.io/en/latest/quickstart/signup.html
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Let’s now return to our initial dataset of bank marketing records.
Queries#
Creating Table#
Let’s start off with loading the loan.csv
file from the expanded_data
folder in the current directory to our newly created DuckDB database. Because we will be working with one table, creating a schema is not required.
%%sql
CREATE TABLE loan AS
FROM read_csv_auto('expanded_data/loan.csv', header=True, sep=',');
Count |
---|
Ranking#
RANK()
finds the position of a value within a result set. For instance, we may wish to assign customers a rank based on the date when the loan was granted, with the rank 1 going to the customer with the highest date, the rank 2 to the customer with the next highest date, and so on.
Note: The date
variable in the loan
data is an integer, in the format YYMMDD, and will be used throughout the examples below.
General Syntax#
The general syntax of the rank function is:
%%sql
RANK() OVER (ORDER BY expr [ASC|DESC] [, expr [ASC|DESC]] ...)
RANK()
is executed by the attributes (or expressions) specified in the OVER (ORDER BY())
clause. Without this clause, RANK()
will return the same value for each row.
Each order by expression optionally it can be followed by ASC
or DESC
to indicate the sort direction. The default is ASC
if no direction is specified. NULL
values are sorted first for ascending sorts and last for descending sorts.
The following query gives the rank of each customer:
%%sql
SELECT account_id, RANK() OVER (ORDER BY (date) DESC) AS c_rank
FROM loan;
account_id | c_rank |
---|---|
8645 | 1 |
6922 | 2 |
1928 | 2 |
105 | 4 |
1284 | 4 |
2262 | 6 |
276 | 6 |
9156 | 8 |
3293 | 9 |
309 | 10 |
Although the tuples above are ordered by rank, sometimes they may not. Therefore, an extra, outer ORDER BY()
clause is needed to ensure they are:
%%sql
SELECT account_id, RANK() OVER (ORDER BY (date) DESC) as c_rank
FROM loan
ORDER BY c_rank;
Important
The RANK()
function gives the same rank to all tuples that are equal on the ORDER BY()
attributes. For instance, if the highest date is shared by two customers, both would get rank 1. The next rank given would be 3, not 2, so if three customers have the next highest date, they would all get rank 3, and the next 10 customer(s) would get rank 6, and so on.
There is also a DENSE_RANK
clause that does not create gaps in the ordering and an example is as follows:
%%sql
SELECT account_id, RANK() OVER (ORDER BY (date) DESC) as c_rank, DENSE_RANK() OVER (ORDER BY (date) DESC) AS d_rank
FROM loan
ORDER BY c_rank;
account_id | c_rank | d_rank |
---|---|---|
8645 | 1 | 1 |
6922 | 2 | 2 |
1928 | 2 | 2 |
105 | 4 | 3 |
1284 | 4 | 3 |
2262 | 6 | 4 |
276 | 6 | 4 |
9156 | 8 | 5 |
3293 | 9 | 6 |
309 | 10 | 7 |
Several other functions can be used apart from RANK
or DENSE_RANK
:
PERCENT_RANK
returns the rank of each tuple as a fraction between 0 and 1CUME_DIST
returns the cumulative distribution of each tupleNTILE
takes tuples in each partition (more below!) and divides them into bucketsROW_NUMBER
sorts the rows and gives each row a unique number corresponding to its position in the sort order
Partitioning/Grouping#
RANK()
can also be used to rank tuples within groups. For instance, we may wish to rank customers by the date when the loan was granted, grouped by their status of paying off the loan. This can be done by partitioning the tuples, with a PARTITION BY
clause within OVER()
, by status
(A, B, C, or D) and then ordering them by date
within each status
:
%%sql
SELECT account_id, status, DENSE_RANK() OVER (PARTITION BY status ORDER BY (date) DESC) AS grouped_rank
FROM loan
ORDER BY grouped_rank;
account_id | status | grouped_rank |
---|---|---|
8645 | C | 1 |
5196 | A | 1 |
37 | D | 1 |
3273 | B | 1 |
1928 | C | 2 |
6922 | C | 2 |
8505 | A | 2 |
4858 | D | 2 |
1888 | B | 2 |
1284 | C | 3 |
For reference, the different statuses represent:
‘A’ stands for contract finished, no problems
‘B’ stands for contract finished, loan not payed
‘C’ stands for running contract, OK so far
‘D’ stands for running contract, client in debt
Question 1 (Medium):#
Rank, in descending order, the customers by the date
when the loan was granted, grouped by their status
of paying off the loan. Also, find the average loan amount
of customers by status
and date
and round it to 0 decimal places.
Hint Think about which basic aggregation clauses you will need to use to find the average loan amount by status
and date
. Once that is done, arrange the tuples in descending order of their rank.
Answers
The SQL query here is very similar to the DENSE_RANK
query above. The only difference is that we need to find the average loan amount by status
and date
and round it to 0 decimal places. To do this we need two basic aggregation clauses: AVG()
and GROUP BY()
. Without the GROUP BY()
clause, the appropriate average amount cannot be found. Moreover, the correct columns need to be specified in it, including account_id
, status
, and date
, to obtain the rank of each customer by status
and date
. Lastly, to order the tuples in descending order of their rank, we need to add the DESC
clause.
%%sql
SELECT account_id, status, ROUND(AVG(amount), 0) as avg_amount, DENSE_RANK() OVER (PARTITION BY status ORDER BY date DESC) AS grouped_rank
FROM loan
GROUP BY account_id, status, date
ORDER BY grouped_rank DESC;
account_id | status | avg_amount | grouped_rank |
---|---|---|---|
1071 | C | 253200.0 | 345 |
5313 | C | 300660.0 | 344 |
10079 | C | 167100.0 | 343 |
5385 | C | 149340.0 | 342 |
8321 | C | 89040.0 | 341 |
2933 | C | 272520.0 | 340 |
8558 | C | 288360.0 | 339 |
1811 | C | 239460.0 | 338 |
4260 | C | 244560.0 | 337 |
2824 | C | 50460.0 | 336 |
Important
Try changing the DENSE_RANK()
clause to RANK()
. What do you notice? Why do you think this is the case?
Windowing#
It is relatively easy to write an SQL query using those features we have already studied to compute an aggregate over one window, for example, loan amounts over a fixed 3-day period. However, if we want to do this for every 3-day period, like a moving-average, the query becomes cumbersome.
Window queries compute an aggregate function over ranges of tuples by accessing the records right before or after the current record. A set of rows to which this aggregation function applies to is referred to as a window.
Important
Windows may overlap, in which case a tuple may contribute to more than one window. This is unlike the partitions we learnt about earlier, where a tuple could contribute to only one partition.
General Syntax#
Windowing is performed with the following syntax: OVER
( [partition] [order] [frame] )
It allows a frame to move within a partition depending on the position of the current row within its partition. The offsets of the current row and frame rows are the row numbers if the frame unit is ROWS
and row values if the frame unit is RANGE
.
Frames and its operations can be represented with the following:
frame: {frame_start | frame_between}
frame_between:
BETWEEN
frame_start AND frame_endframe_start, frame_end: {
CURRENT ROW
|UNBOUNDED PRECEDING
|UNBOUNDED FOLLOWING
| exprPRECEDING
| exprFOLLOWING
}
Examples#
The first 14 rows of the
loan
table have unique, non-consecutive dates ranging from July 5th 1993 to December 1st 1993. Therefore, we can compute the average loan amount over the three preceding tuples in the specified sort order. Note that this example makes sense only because each date, for the first 14 records, appears only once inloan
. The example is as follows:
%%sql
SELECT date, AVG(amount) OVER (ORDER BY date ROWS 3 PRECEDING) AS avg_amount
FROM loan
WHERE date <= 931201;
date | avg_amount |
---|---|
930705 | 96396.0 |
930711 | 131178.0 |
930728 | 129812.0 |
930803 | 123810.0 |
930906 | 168396.0 |
930913 | 148866.0 |
930915 | 130293.0 |
930924 | 147528.0 |
931013 | 117447.0 |
931104 | 124743.0 |
This is not the case across the whole table, which means there are several possible orderings of tuples since tuples for the same date could be in any order. To tackle this, we introduce in the example below a windowing query that uses a range of values instead of a specific number of tuples.
Suppose that instead of going back a fixed number of tuples, we want the window to consist of all prior dates. That means the number of prior dates considered is not fixed and, hence, we can perform this on the entire dataset. To get the average amount over all prior dates, we write:
%%sql
SELECT date, AVG(amount) OVER (ORDER BY date RANGE UNBOUNDED PRECEDING) AS avg_amount
FROM loan;
date | avg_amount |
---|---|
930705 | 96396.0 |
930711 | 131178.0 |
930728 | 129812.0 |
930803 | 123810.0 |
930906 | 153996.0 |
930913 | 142970.0 |
930915 | 130086.85714285714 |
930924 | 135669.0 |
931013 | 137752.0 |
931104 | 135679.2 |
Important
The ROUND()
function is not recognized as an aggregate function in DuckDB. In the above query, If we modified the above SELECT
clause to have ROUND(AVG(amount),0)
, we would get an error because we tried to use the ROUND()
function within the OVER
clause, which is not supported in DuckDB. To fix it, use a subquery in the FROM
clause as follows:
%%sql
SELECT date, ROUND(avg_amount, 2) AS rounded_avg_amount
FROM (SELECT date, AVG(amount) OVER (ORDER BY date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS avg_amount FROM loan);
date | rounded_avg_amount |
---|---|
930705 | 151410.18 |
930711 | 151490.96 |
930728 | 151469.68 |
930803 | 151505.6 |
930906 | 151573.01 |
930913 | 151391.08 |
930915 | 151485.09 |
930924 | 151631.31 |
931013 | 151597.01 |
931104 | 151592.83 |
Note: It is possible to use the keyword FOLLOWING
in place of PRECEDING
. If we did this in our example, the year value specifies the beginning of the window instead of the end. However, we will need to specify it using BETWEEN
, explained below!
Using
BETWEEN
forUNBOUNDED FOLLOWING
is necessary because DuckDB does not allow a frame to start withUNBOUNDED FOLLOWING
:
%%sql
SELECT date, ROUND(avg_amount, 2) AS rounded_avg_amount
FROM (SELECT date, AVG(amount) OVER (ORDER BY date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS avg_amount FROM loan);
date | rounded_avg_amount |
---|---|
930705 | 151410.18 |
930711 | 151490.96 |
930728 | 151469.68 |
930803 | 151505.6 |
930906 | 151573.01 |
930913 | 151391.08 |
930915 | 151485.09 |
930924 | 151631.31 |
931013 | 151597.01 |
931104 | 151592.83 |
Question 2 (Hard):#
Return the maximum loan amount
, rounded to 0 decimals, by the customer’s status
for paying off the loan for every three dates preceding and every two dates following the current date of a tuple . Also, order the output by ascending order of date
.
Hint Think about which advanced aggregation clause is needed to return the loan amount
by status
. Then, think about how to build the query for rounding the values.
Answers
The SQL query here is very similar to the third windowing function example. We can write windowing queries that treat each status separately by partitioning by status
. Therefore, we need to add a PARTITION BY
clause before the ORDER BY
clause, like the ranking example, to account for status
, use the MAX
aggregation clause instead of AVG
, add an extra ORDER BY
clause at the end of the query, and change the frame selection to include both PRECEDING
and FOLLOWING
as follows:
%%sql
SELECT date, status, ROUND(max_amount, 0) AS rounded_max_amount
FROM (SELECT date, status, MAX(amount) OVER (PARTITION BY status ORDER BY date ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING) AS max_amount FROM loan)
ORDER BY date;
date | status | rounded_max_amount |
---|---|---|
930705 | B | 464520.0 |
930711 | A | 165960.0 |
930728 | A | 274740.0 |
930803 | A | 274740.0 |
930906 | A | 274740.0 |
930913 | A | 274740.0 |
930915 | A | 274740.0 |
930924 | B | 464520.0 |
931013 | A | 274740.0 |
931104 | A | 154416.0 |
Try playing around with different frame selections and aggregation functions to develop your intuition!
Pivoting#
Using SQL aggregation functions, we can easily create cross-tabulations (or cross-tabs for short), which takes a column of data and turns it into multiple columns, one for each unique value in the original column. It can also perform aggregations on the data, such as calculating the average or sum of the values in each column, useful for summarizing data.
Important
DuckDB supports the PIVOT
operator, but other databases, such as MySQL, do not. Yet, pivoting can be made possible and we shall explore below.
General Syntax#
The syntax for PIVOT
is as follows:
PIVOT [dataset]
ON [column(s)]
USING [value(s)]
GROUP BY [row(s)]
For other databases (and DuckDB), the CASE WHEN
clause can be used to pivot data:
SELECT pivot_column,
SUM(
CASE
WHEN pivot_column = pivot_value THEN aggregate_column
WHEN pivot_column = pivot_value THEN aggregate_column
ELSE 0
END
) AS alias
FROM table
GROUP BY pivot_column;
Examples#
Suppose we want to obtain the total loan amount
by both the customer’s status
of paying off the loan and the duration
of the loan. We can do this by pivoting the status
column, summing the amount
column, and grouping the duration
column:
%%sql
PIVOT loan ON status USING SUM(amount) GROUP BY duration;
duration | A | B | C | D |
---|
Using, CASE WHEN
we get identical results:
%%sql
SELECT duration,
SUM(CASE WHEN status = 'A' THEN amount ELSE 0 END) AS A,
SUM(CASE WHEN status = 'B' THEN amount ELSE 0 END) AS B,
SUM(CASE WHEN status = 'C' THEN amount ELSE 0 END) AS C,
SUM(CASE WHEN status = 'D' THEN amount ELSE 0 END) AS D,
FROM loan
GROUP BY duration;
duration | A | B | C | D |
---|---|---|---|---|
12 | 5136444 | 584256 | 1269348 | 36204 |
24 | 5966688 | 1502016 | 5598336 | 625032 |
36 | 5108508 | 1516212 | 10627164 | 1474380 |
48 | 1730016 | 295344 | 22383024 | 3963408 |
60 | 661560 | 464520 | 29200500 | 5118780 |
Question 3 (Medium):#
For all the customers’ status
of paying off the loan, duration
of the loan, and date
of the loan, return both the average loan amount
. Make sure to return date
in the first column and only those columns that have combinations of status
and duration
values in them (i.e no columns with only None
values should be displayed).
Hint Because the output should contain the date
column first, we use a GROUP BY
for it. Multiple ON
columns and expressions can be specified in the PIVOT
clause along with multiple USING
expressions. The expression ||'_'||
can be used not only to concatenate the columns, but also to pivot only the combinations of values that are present in the data. The order of the columns will, hence, be: date, A_12_sum(amount), A_24_sum(amount), A_36_sum(amount),…, D_60_sum(amount)
Answers
%%sql
PIVOT loan ON status ||'_'|| duration USING SUM(amount) GROUP BY date;
date | A_12 | A_24 | A_36 | A_48 | A_60 | B_12 | B_24 | B_36 | B_48 | B_60 | C_12 | C_24 | C_36 | C_48 | C_60 | D_12 | D_24 | D_36 | D_48 | D_60 |
---|
Try playing with multiple variables in the ON
, USING
, and GROUP BY
clauses and see if you can explore the data more closely!
Grouping Sets, Rollup, and Cube#
To perform a grouping over multiple dimensions within the same query, the following clauses can be used with GROUP BY
:
GROUPING SETS
perform the same aggregate across differentGROUP BY
clauses in a single query.ROLLUP
produces all “sub-groups” of a grouping set, e.g.ROLLUP (country, city, zip)
produces the grouping sets (country, city, zip), (country, city), (country), () where () denotes an empty group by list. Therefore, placement of variables matters here because only the first variable’s individual aggregation is output. This produces n+1 grouping sets where n is the number of terms in theROLLUP
clause.CUBE
: produces grouping sets for all combinations of the inputs, e.g.CUBE (country, city, zip)
will produce (country, city, zip), (country, city), (country, zip), (city, zip), (country), (city), (zip), (). This produces 2^n grouping sets.
Important
Neither the ROLLUP
nor the CUBEclause gives complete control on the groupings that are generated. Therefore,
GROUPING SETS` is the most flexible of the three.
Examples#
Suppose we want to obtain the number of account_id
’s by the customer’s status
of paying off the loan and the duration
of the loan together and separately. We can do this by using GROUPING SETS
:
%%sql
SELECT status, duration, COUNT(account_id) AS count_account_id
FROM loan
GROUP BY GROUPING SETS ((), status, (status, duration), duration);
status | duration | count_account_id |
---|---|---|
None | None | 682 |
B | None | 31 |
A | None | 203 |
C | None | 403 |
D | None | 45 |
B | 12 | 10 |
A | 36 | 32 |
A | 60 | 3 |
A | 24 | 64 |
A | 12 | 93 |
Using CUBE
, which helps us condense the above GROUP BY
clause, we can get identical results:
%%sql
SELECT status, duration, COUNT(account_id) AS count_account_id
FROM loan
GROUP BY CUBE (status, duration);
status | duration | count_account_id |
---|---|---|
None | None | 682 |
B | None | 31 |
A | None | 203 |
C | None | 403 |
D | None | 45 |
B | 12 | 10 |
A | 36 | 32 |
A | 60 | 3 |
A | 24 | 64 |
A | 12 | 93 |
Question 4 (Easy):#
Find the maximum loan amount
in the following groupings: {(date, duration), (status, duration)}.
Answers
Recall the note at the beginning of this section, which stated that neither ROLLUPS
nor CUBE
can be used to specify restricted groupings, like in the question above. Therefore, we use GROUPING SETS
:
%%sql
SELECT date, duration, status, MAX(amount) AS max_amount
FROM loan
GROUP BY GROUPING SETS ((date, duration), (status, duration));
date | duration | status | max_amount |
---|---|---|---|
930705 | 12 | None | 96396 |
930711 | 36 | None | 165960 |
930728 | 60 | None | 127080 |
930803 | 36 | None | 105804 |
930906 | 60 | None | 274740 |
930913 | 24 | None | 87840 |
930915 | 12 | None | 52788 |
930924 | 24 | None | 174744 |
931013 | 48 | None | 154416 |
931104 | 24 | None | 117024 |
Delete table
%%sql
DROP TABLE IF EXISTS loan;
DROP SCHEMA IF EXISTS s1;
Success |
---|
Wrapping Up#
In this section, we introduced advanced aggregation functions. To summarize:
RANK()
: Returns the rank of each row in the result set. It needs to be used with theOVER (ORDER BY())
clause. It can also be used withPARTITION BY
to rank within a group.Windowing : Helps obtain moving-aggregations either through the whole dataset or a subset of it, depending on the frame selection. Executed with the syntax
OVER
( [partition] [order] [frame] ).PIVOT()
: Produces a cross-tab for summarizing datasets based on one or many column(s). Can be emulated withCASE WHEN
statements for compatibility with other SQL dialects.Groupings :
GROUPING SETS
provides the most flexibility out ofROLLUP
andCUBE
.
This ends the module Advanced querying techniques and we hope you enjoyed it! Next, we will learn about how to visualize your queries using popular Python libraries, including matplotlib
and seaborn
, and ggplot
!
References#
Silberschatz, A., Korth, H. F., & Sudarshan, S. (2020). Database system concepts. McGraw-Hill.
DuckDB. (n.d.). https://duckdb.org/docs/