Making your first SQL query#
SQL Overview#
SQL (Structured Query Language) is the widely adopted language used for managing and manipulating data. It’s the language that inspired its other popular variants you may have heard of, such as PostgreSQL, MySQL, and more.
In this lesson, you will learn how to make your first SQL query.
Dataset#
To perform your first SQL query, we will be working with one main dataset throughout this course:
Bank Marketing Data
Source: UCI Machine Learning Repository
URL: https://archive-beta.ics.uci.edu/dataset/222/bank+marketing
Data Citation
Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.
Data Description#
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
The data contains the following categories:
age (numeric)
job: type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)
marital: marital status (categorical: ‘divorced’,’married’,’single’,’unknown’)
education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
default: has credit in default? (categorical: ‘no’,’yes’,’unknown’)
housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’)
loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’)
contact: contact communication type (categorical: ‘cellular’,’telephone’)
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
duration: last contact duration, in seconds (numeric)
campaign: number of contacts performed during this campaign and for this client (numeric)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)
y: has the client subscribed a term deposit? (binary: ‘yes’,’no’)
Install - execute this once.#
This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.
%pip install jupysql duckdb-engine --quiet
Helper script#
We developed a banking.py
script to help you extract the data from the URL and load it into a DuckDB database. This script is located here
Load the data#
We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv
.
import sys
sys.path.insert(0, "../../")
import banking # noqa: E402
_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()
After running this code, you should have bank_cleaned.csv
in the current directory.
Load Engine#
We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db
Config | value |
---|---|
displaycon | False |
feedback | True |
autopandas | False |
named_parameters | True |
Queries#
Creating Table#
Let’s start off with loading our bank_cleaned.csv
file from our local directory to our newly created DuckDB database. Here we CREATE OR REPLACE TABLE
in DuckDB called ‘bank’ FROM
our bank_cleaned.csv
file. The read_csv_auto
is a function that helps SQL understand our local .csv file for creation into our database.
%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')
Count |
---|
Simple Query#
Now that we have our bank
table in our DuckDB database, we can run our first query on the table. Let’s start off with a simple query that looks at the first five rows from our table.
%%sql
SELECT *
FROM bank
LIMIT 5
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
SELECT
, FROM
and LIMIT
are considered “clauses” in SQL. You can think of these clauses as functions that serve specific task. SELECT
is used to specify what the user wants FROM
the table. The “*” next to our SELECT
clause means to “select all” FROM
our bank table. The LIMIT
clause tells SQL to show only the top 5 rows from our SELECT
clause.
Filtering#
The WHERE
clause allows users to filter the query on specific conditions. Below, we query the table WHERE
the “job” variable equals ‘unemployed’.
%%sql
SELECT *
FROM bank
WHERE job = 'unemployed'
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
40 | unemployed | married | secondary | no | 219 | yes | no | cellular | 17 | nov | 204 | 2 | 196 | 1 | failure | no |
31 | unemployed | single | primary | no | 406 | no | no | cellular | 4 | feb | 736 | 1 | -1 | 0 | unknown | yes |
56 | unemployed | married | primary | no | 3391 | no | no | cellular | 21 | apr | 243 | 1 | -1 | 0 | unknown | yes |
24 | unemployed | single | secondary | no | 119 | no | no | unknown | 15 | sep | 51 | 1 | -1 | 0 | unknown | no |
45 | unemployed | divorced | primary | yes | -249 | yes | yes | unknown | 1 | jul | 92 | 1 | -1 | 0 | unknown | no |
51 | unemployed | married | tertiary | no | 1634 | yes | no | cellular | 22 | jul | 168 | 4 | -1 | 0 | unknown | no |
32 | unemployed | single | secondary | no | 0 | no | no | cellular | 27 | may | 123 | 2 | -1 | 0 | unknown | no |
28 | unemployed | single | secondary | no | 16 | no | no | cellular | 12 | aug | 119 | 4 | -1 | 0 | unknown | no |
38 | unemployed | married | primary | no | 1147 | yes | yes | unknown | 8 | may | 249 | 5 | -1 | 0 | unknown | no |
The ‘unemployed’ is in quotes because the “job” variable has values which are strings. If you unfamiliar with strings, you can find a quick introduction here.
We can extend filtering even further by filtering on two or more conditions. This introduces the AND
and OR
clauses.
%%sql
SELECT *
FROM bank
WHERE job = 'unemployed' AND education = 'primary'
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
31 | unemployed | single | primary | no | 406 | no | no | cellular | 4 | feb | 736 | 1 | -1 | 0 | unknown | yes |
56 | unemployed | married | primary | no | 3391 | no | no | cellular | 21 | apr | 243 | 1 | -1 | 0 | unknown | yes |
45 | unemployed | divorced | primary | yes | -249 | yes | yes | unknown | 1 | jul | 92 | 1 | -1 | 0 | unknown | no |
38 | unemployed | married | primary | no | 1147 | yes | yes | unknown | 8 | may | 249 | 5 | -1 | 0 | unknown | no |
45 | unemployed | divorced | primary | no | 757 | no | no | cellular | 20 | apr | 103 | 1 | 325 | 4 | failure | no |
59 | unemployed | married | primary | no | 0 | no | no | cellular | 30 | jan | 3025 | 2 | -1 | 0 | unknown | no |
47 | unemployed | married | primary | no | 168 | yes | no | telephone | 30 | jan | 66 | 1 | 241 | 1 | other | no |
32 | unemployed | divorced | primary | no | 0 | no | no | cellular | 15 | jul | 175 | 4 | -1 | 0 | unknown | no |
54 | unemployed | married | primary | no | 541 | yes | no | cellular | 3 | feb | 237 | 3 | -1 | 0 | unknown | no |
This query filters the data WHERE
“job” equals ‘unemployed’ AND
where “education” equals ‘primary’. The OR
clause behaves identically to its verbal counterpart.
%%sql
SELECT *
FROM bank
WHERE job = 'unemployed' OR job = 'blue-collar'
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
40 | unemployed | married | secondary | no | 219 | yes | no | cellular | 17 | nov | 204 | 2 | 196 | 1 | failure | no |
31 | unemployed | single | primary | no | 406 | no | no | cellular | 4 | feb | 736 | 1 | -1 | 0 | unknown | yes |
56 | unemployed | married | primary | no | 3391 | no | no | cellular | 21 | apr | 243 | 1 | -1 | 0 | unknown | yes |
24 | unemployed | single | secondary | no | 119 | no | no | unknown | 15 | sep | 51 | 1 | -1 | 0 | unknown | no |
45 | unemployed | divorced | primary | yes | -249 | yes | yes | unknown | 1 | jul | 92 | 1 | -1 | 0 | unknown | no |
51 | unemployed | married | tertiary | no | 1634 | yes | no | cellular | 22 | jul | 168 | 4 | -1 | 0 | unknown | no |
32 | unemployed | single | secondary | no | 0 | no | no | cellular | 27 | may | 123 | 2 | -1 | 0 | unknown | no |
28 | unemployed | single | secondary | no | 16 | no | no | cellular | 12 | aug | 119 | 4 | -1 | 0 | unknown | no |
38 | unemployed | married | primary | no | 1147 | yes | yes | unknown | 8 | may | 249 | 5 | -1 | 0 | unknown | no |
Sorting#
We can sort the outputs of our query based on certain conditions. Below, we sort our query by “balance” using the ORDER BY
clause in DESC
(descending) order.
%%sql
SELECT *
FROM bank
ORDER BY balance DESC
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
60 | retired | married | primary | no | 71188 | no | no | cellular | 6 | oct | 205 | 1 | -1 | 0 | unknown | no |
42 | entrepreneur | married | tertiary | no | 42045 | no | no | cellular | 8 | aug | 205 | 2 | -1 | 0 | unknown | no |
43 | technician | single | tertiary | no | 27733 | yes | no | unknown | 3 | jun | 164 | 7 | -1 | 0 | unknown | no |
36 | management | married | tertiary | no | 27359 | yes | no | unknown | 3 | jun | 71 | 2 | -1 | 0 | unknown | no |
57 | technician | married | tertiary | no | 27069 | no | yes | unknown | 20 | jun | 174 | 3 | -1 | 0 | unknown | no |
31 | housemaid | single | primary | no | 26965 | no | no | cellular | 21 | apr | 654 | 2 | -1 | 0 | unknown | yes |
75 | retired | married | secondary | no | 26452 | no | no | telephone | 15 | jul | 219 | 2 | -1 | 0 | unknown | no |
50 | services | married | secondary | no | 26394 | no | no | cellular | 25 | aug | 206 | 4 | -1 | 0 | unknown | no |
54 | management | divorced | tertiary | no | 26306 | yes | no | cellular | 11 | feb | 27 | 1 | 84 | 3 | failure | no |
49 | retired | single | primary | no | 25824 | no | no | unknown | 17 | jun | 94 | 1 | -1 | 0 | unknown | no |
To order the query by ascending order, you can omit the DESC
or add ASC
in the above SQL statement.
You try: Use JupySQL to perform the queries and answer the questions.#
Example: show the first 5 rows of the “job” variable.
%%sql
SELECT job
FROM bank
LIMIT 5
job |
---|
unemployed |
services |
management |
management |
blue-collar |
Question 1 (Easy):#
Query records where the month is in April (“apr”)
Answers
You can use the WHERE
clause to specify where month equals ‘apr’.
%%sql
SELECT *
FROM bank
WHERE month = 'apr'
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
43 | services | married | primary | no | -88 | yes | yes | cellular | 17 | apr | 313 | 1 | 147 | 2 | failure | no |
43 | admin. | married | secondary | no | 264 | yes | no | cellular | 17 | apr | 113 | 2 | -1 | 0 | unknown | no |
20 | student | single | secondary | no | 502 | no | no | cellular | 30 | apr | 261 | 1 | -1 | 0 | unknown | yes |
37 | admin. | single | tertiary | no | 2317 | yes | no | cellular | 20 | apr | 114 | 1 | 152 | 2 | failure | no |
38 | admin. | married | secondary | no | 424 | yes | no | cellular | 17 | apr | 279 | 1 | -1 | 0 | unknown | no |
32 | management | single | tertiary | no | 574 | yes | no | cellular | 14 | apr | 259 | 2 | 145 | 3 | failure | no |
56 | unemployed | married | primary | no | 3391 | no | no | cellular | 21 | apr | 243 | 1 | -1 | 0 | unknown | yes |
77 | retired | divorced | tertiary | no | 4659 | no | no | cellular | 14 | apr | 161 | 1 | -1 | 0 | unknown | yes |
34 | technician | married | secondary | no | 1641 | yes | no | cellular | 17 | apr | 380 | 1 | -1 | 0 | unknown | no |
Question 2 (Medium):#
Query the first 5 records where “balance” is greater than or equal to 1000. Sorted this query by ascending order. Hint Equal, greater than, and less can be represented as =, <, >, respectively. Greater than or equal to can be represented as >=
Answers
You can use the WHERE
clause and with the greater than operator “>=” to declare records with a balance greater than or equal to 1000. The query is then sorted by balance in descending order in the last line with ORDER BY
and DESC
.
%%sql
SELECT *
FROM bank
WHERE balance >= 1000
ORDER BY balance DESC
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
60 | retired | married | primary | no | 71188 | no | no | cellular | 6 | oct | 205 | 1 | -1 | 0 | unknown | no |
42 | entrepreneur | married | tertiary | no | 42045 | no | no | cellular | 8 | aug | 205 | 2 | -1 | 0 | unknown | no |
43 | technician | single | tertiary | no | 27733 | yes | no | unknown | 3 | jun | 164 | 7 | -1 | 0 | unknown | no |
36 | management | married | tertiary | no | 27359 | yes | no | unknown | 3 | jun | 71 | 2 | -1 | 0 | unknown | no |
57 | technician | married | tertiary | no | 27069 | no | yes | unknown | 20 | jun | 174 | 3 | -1 | 0 | unknown | no |
31 | housemaid | single | primary | no | 26965 | no | no | cellular | 21 | apr | 654 | 2 | -1 | 0 | unknown | yes |
75 | retired | married | secondary | no | 26452 | no | no | telephone | 15 | jul | 219 | 2 | -1 | 0 | unknown | no |
50 | services | married | secondary | no | 26394 | no | no | cellular | 25 | aug | 206 | 4 | -1 | 0 | unknown | no |
54 | management | divorced | tertiary | no | 26306 | yes | no | cellular | 11 | feb | 27 | 1 | 84 | 3 | failure | no |
49 | retired | single | primary | no | 25824 | no | no | unknown | 17 | jun | 94 | 1 | -1 | 0 | unknown | no |
Question 3 (BONUS):#
Show the count of records where ‘housing’ is ‘no’ and where ‘loan’ is yes.
Hint COUNT()
is a aggregating function in SQL (more on aggregation later!). Try experimenting with COUNT()
in your SELECT
clause to see if you can find the correct count.
Answers
You can use COUNT(*)
to get the count of total records after filtering WHERE
‘housing’ is ‘no’ and ‘loan’ is yes.
%%sql
SELECT COUNT(*)
FROM bank
WHERE housing = 'no' AND loan = 'yes'
count_star() |
---|
285 |
Delete table
%%sql
DROP TABLE bank;
Success |
---|
Wrapping Up#
To summarize this section, we first introduced our primary dataset we will be using for the next few sections. Then, we ran our first query by first installing JupySQL
and other packages into our notebook, properly loaded our data with some Python
, and established a connection to a DuckDB database.
We learned the basics of SQL
by going over some of its most necessary clauses:
SELECT
: “Selects” what to extract from the query. This clause can be followed by a specific variable name or by using “*” to select the whole table.FROM
: Tells SQL what table in the database to run our query on. We used this clause primarily on ‘bank’ which we first created when setting up our DuckDB database.LIMIT
: Limits the number of rows from our queryWHERE
: Filters the query on specific conditions. This clause can be combined withAND
andOR
clauses for more complex filters.ORDER BY
: Sorts the output on variables in our query. This clause can includeDESC
to sort by descending order.
These clauses lay the foundation of SQL
. They will be necessary for our next section, which will introduce aggregation functions. We showed a sneak peek of an aggregation function COUNT()
in Question 3. More on that in the next section!
A Little Extra#
As you may have noticed, SQL code is straight forward. It’s clauses translate well to what you want SQL to do in natural verbal terms. These clauses make it so easy it’s like you are “declaring” SQL to do what you would like it to do. This nature is what defines SQL to be a “declarative programming language”.
This article is a great resource if you’re curious on this topic.