Making your first SQL query

Making your first SQL query#

SQL Overview#

SQL (Structured Query Language) is the widely adopted language used for managing and manipulating data. It’s the language that inspired its other popular variants you may have heard of, such as PostgreSQL, MySQL, and more.

In this lesson, you will learn how to make your first SQL query.

Dataset#

To perform your first SQL query, we will be working with one main dataset throughout this course:

Bank Marketing Data

Source: UCI Machine Learning Repository

URL: https://archive-beta.ics.uci.edu/dataset/222/bank+marketing

Data Citation

Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

Data Description#

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

The data contains the following categories:

age (numeric)
job: type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)
marital: marital status (categorical: ‘divorced’,’married’,’single’,’unknown’)
education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
default: has credit in default? (categorical: ‘no’,’yes’,’unknown’)
housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’)
loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’)
contact: contact communication type (categorical: ‘cellular’,’telephone’)
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
duration: last contact duration, in seconds (numeric)
campaign: number of contacts performed during this campaign and for this client (numeric)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)
y: has the client subscribed a term deposit? (binary: ‘yes’,’no’)

Install - execute this once.#

This code installs JupySQL, DuckDB, and Pandas in your environment. We will be using these moving forward.

%pip install jupysql duckdb-engine --quiet

Helper script#

We developed a banking.py script to help you extract the data from the URL and load it into a DuckDB database. This script is located here

Load the data#

We extract the bank marketing data by retrieving it from it’s URL download link. The link may be a zip file (which it is in this case), so we extract the zip file, read the file containing the data within the zip file, and clean the data. Finally, we save this cleaned data to it’s own seperate file called bank_cleaned.csv.

import sys

sys.path.insert(0, "../../")
import banking  # noqa: E402

_ = banking.BankingData("https://tinyurl.com/jb-bank", "bank")
_.extract_to_csv()

After running this code, you should have bank_cleaned.csv in the current directory.

Load Engine#

We now load in our SQL extension that allows us to execute SQL queries in Jupyter Notebooks.

Note Ensure you restart any previous notebook that has the same database name as the one initialized below.

# Loading in SQL extension
%reload_ext sql
# Initiating a DuckDB database named 'bank.duck.db' to run our SQL queries on
%sql duckdb:///bank.duck.db

Found pyproject.toml from '/home/docs/checkouts/readthedocs.org/user_builds/ploomber-sql/checkouts/latest'

Settings changed:

Config	value
displaycon	False
feedback	True
autopandas	False
named_parameters	True

Queries#

Creating Table#

Let’s start off with loading our bank_cleaned.csv file from our local directory to our newly created DuckDB database. Here we CREATE OR REPLACE TABLE in DuckDB called ‘bank’ FROM our bank_cleaned.csv file. The read_csv_auto is a function that helps SQL understand our local .csv file for creation into our database.

%%sql
CREATE OR REPLACE TABLE bank AS
FROM read_csv_auto('bank_cleaned.csv', header=True, sep=',')

Count

Simple Query#

Now that we have our bank table in our DuckDB database, we can run our first query on the table. Let’s start off with a simple query that looks at the first five rows from our table.

%%sql
SELECT *
FROM bank
LIMIT 5

age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure	no
35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown	no
59	blue-collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown	no

SELECT, FROM and LIMIT are considered “clauses” in SQL. You can think of these clauses as functions that serve specific task. SELECT is used to specify what the user wants FROM the table. The “*” next to our SELECT clause means to “select all” FROM our bank table. The LIMIT clause tells SQL to show only the top 5 rows from our SELECT clause.

Filtering#

The WHERE clause allows users to filter the query on specific conditions. Below, we query the table WHERE the “job” variable equals ‘unemployed’.

%%sql 
SELECT *
FROM bank 
WHERE job = 'unemployed'

age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
40	unemployed	married	secondary	no	219	yes	no	cellular	17	nov	204	2	196	1	failure	no
31	unemployed	single	primary	no	406	no	no	cellular	4	feb	736	1	-1	0	unknown	yes
56	unemployed	married	primary	no	3391	no	no	cellular	21	apr	243	1	-1	0	unknown	yes
24	unemployed	single	secondary	no	119	no	no	unknown	15	sep	51	1	-1	0	unknown	no
45	unemployed	divorced	primary	yes	-249	yes	yes	unknown	1	jul	92	1	-1	0	unknown	no
51	unemployed	married	tertiary	no	1634	yes	no	cellular	22	jul	168	4	-1	0	unknown	no
32	unemployed	single	secondary	no	0	no	no	cellular	27	may	123	2	-1	0	unknown	no
28	unemployed	single	secondary	no	16	no	no	cellular	12	aug	119	4	-1	0	unknown	no
38	unemployed	married	primary	no	1147	yes	yes	unknown	8	may	249	5	-1	0	unknown	no

Truncated to displaylimit of 10.

The ‘unemployed’ is in quotes because the “job” variable has values which are strings. If you unfamiliar with strings, you can find a quick introduction here.

We can extend filtering even further by filtering on two or more conditions. This introduces the AND and OR clauses.

%%sql 
SELECT *
FROM bank 
WHERE job = 'unemployed' AND education = 'primary'

age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
31	unemployed	single	primary	no	406	no	no	cellular	4	feb	736	1	-1	0	unknown	yes
56	unemployed	married	primary	no	3391	no	no	cellular	21	apr	243	1	-1	0	unknown	yes
45	unemployed	divorced	primary	yes	-249	yes	yes	unknown	1	jul	92	1	-1	0	unknown	no
38	unemployed	married	primary	no	1147	yes	yes	unknown	8	may	249	5	-1	0	unknown	no
45	unemployed	divorced	primary	no	757	no	no	cellular	20	apr	103	1	325	4	failure	no
59	unemployed	married	primary	no	0	no	no	cellular	30	jan	3025	2	-1	0	unknown	no
47	unemployed	married	primary	no	168	yes	no	telephone	30	jan	66	1	241	1	other	no
32	unemployed	divorced	primary	no	0	no	no	cellular	15	jul	175	4	-1	0	unknown	no
54	unemployed	married	primary	no	541	yes	no	cellular	3	feb	237	3	-1	0	unknown	no

Truncated to displaylimit of 10.

This query filters the data WHERE “job” equals ‘unemployed’ AND where “education” equals ‘primary’. The OR clause behaves identically to its verbal counterpart.

%%sql 
SELECT *
FROM bank 
WHERE job = 'unemployed' OR job = 'blue-collar'

age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
40	unemployed	married	secondary	no	219	yes	no	cellular	17	nov	204	2	196	1	failure	no
31	unemployed	single	primary	no	406	no	no	cellular	4	feb	736	1	-1	0	unknown	yes
56	unemployed	married	primary	no	3391	no	no	cellular	21	apr	243	1	-1	0	unknown	yes
24	unemployed	single	secondary	no	119	no	no	unknown	15	sep	51	1	-1	0	unknown	no
45	unemployed	divorced	primary	yes	-249	yes	yes	unknown	1	jul	92	1	-1	0	unknown	no
51	unemployed	married	tertiary	no	1634	yes	no	cellular	22	jul	168	4	-1	0	unknown	no
32	unemployed	single	secondary	no	0	no	no	cellular	27	may	123	2	-1	0	unknown	no
28	unemployed	single	secondary	no	16	no	no	cellular	12	aug	119	4	-1	0	unknown	no
38	unemployed	married	primary	no	1147	yes	yes	unknown	8	may	249	5	-1	0	unknown	no

Truncated to displaylimit of 10.

Sorting#

We can sort the outputs of our query based on certain conditions. Below, we sort our query by “balance” using the ORDER BY clause in DESC (descending) order.

%%sql 
SELECT *
FROM bank 
ORDER BY balance DESC

age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
60	retired	married	primary	no	71188	no	no	cellular	6	oct	205	1	-1	0	unknown	no
42	entrepreneur	married	tertiary	no	42045	no	no	cellular	8	aug	205	2	-1	0	unknown	no
43	technician	single	tertiary	no	27733	yes	no	unknown	3	jun	164	7	-1	0	unknown	no
36	management	married	tertiary	no	27359	yes	no	unknown	3	jun	71	2	-1	0	unknown	no
57	technician	married	tertiary	no	27069	no	yes	unknown	20	jun	174	3	-1	0	unknown	no
31	housemaid	single	primary	no	26965	no	no	cellular	21	apr	654	2	-1	0	unknown	yes
75	retired	married	secondary	no	26452	no	no	telephone	15	jul	219	2	-1	0	unknown	no
50	services	married	secondary	no	26394	no	no	cellular	25	aug	206	4	-1	0	unknown	no
54	management	divorced	tertiary	no	26306	yes	no	cellular	11	feb	27	1	84	3	failure	no
49	retired	single	primary	no	25824	no	no	unknown	17	jun	94	1	-1	0	unknown	no

Truncated to displaylimit of 10.

To order the query by ascending order, you can omit the DESC or add ASC in the above SQL statement.

You try: Use JupySQL to perform the queries and answer the questions.#

Example: show the first 5 rows of the “job” variable.

%%sql
SELECT job
FROM bank 
LIMIT 5

job
unemployed
services
management
management
blue-collar

Question 1 (Easy):#

Query records where the month is in April (“apr”)

Answers

You can use the WHERE clause to specify where month equals ‘apr’.

%%sql
SELECT *
FROM bank
WHERE month = 'apr'

age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
43	services	married	primary	no	-88	yes	yes	cellular	17	apr	313	1	147	2	failure	no
43	admin.	married	secondary	no	264	yes	no	cellular	17	apr	113	2	-1	0	unknown	no
20	student	single	secondary	no	502	no	no	cellular	30	apr	261	1	-1	0	unknown	yes
37	admin.	single	tertiary	no	2317	yes	no	cellular	20	apr	114	1	152	2	failure	no
38	admin.	married	secondary	no	424	yes	no	cellular	17	apr	279	1	-1	0	unknown	no
32	management	single	tertiary	no	574	yes	no	cellular	14	apr	259	2	145	3	failure	no
56	unemployed	married	primary	no	3391	no	no	cellular	21	apr	243	1	-1	0	unknown	yes
77	retired	divorced	tertiary	no	4659	no	no	cellular	14	apr	161	1	-1	0	unknown	yes
34	technician	married	secondary	no	1641	yes	no	cellular	17	apr	380	1	-1	0	unknown	no

Truncated to displaylimit of 10.

Question 2 (Medium):#

Query the first 5 records where “balance” is greater than or equal to 1000. Sorted this query by ascending order. Hint Equal, greater than, and less can be represented as =, <, >, respectively. Greater than or equal to can be represented as >=

Answers

You can use the WHERE clause and with the greater than operator “>=” to declare records with a balance greater than or equal to 1000. The query is then sorted by balance in descending order in the last line with ORDER BY and DESC.

%%sql
SELECT *
FROM bank
WHERE balance >= 1000
ORDER BY balance DESC

age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
60	retired	married	primary	no	71188	no	no	cellular	6	oct	205	1	-1	0	unknown	no
42	entrepreneur	married	tertiary	no	42045	no	no	cellular	8	aug	205	2	-1	0	unknown	no
43	technician	single	tertiary	no	27733	yes	no	unknown	3	jun	164	7	-1	0	unknown	no
36	management	married	tertiary	no	27359	yes	no	unknown	3	jun	71	2	-1	0	unknown	no
57	technician	married	tertiary	no	27069	no	yes	unknown	20	jun	174	3	-1	0	unknown	no
31	housemaid	single	primary	no	26965	no	no	cellular	21	apr	654	2	-1	0	unknown	yes
75	retired	married	secondary	no	26452	no	no	telephone	15	jul	219	2	-1	0	unknown	no
50	services	married	secondary	no	26394	no	no	cellular	25	aug	206	4	-1	0	unknown	no
54	management	divorced	tertiary	no	26306	yes	no	cellular	11	feb	27	1	84	3	failure	no
49	retired	single	primary	no	25824	no	no	unknown	17	jun	94	1	-1	0	unknown	no

Truncated to displaylimit of 10.

Question 3 (BONUS):#

Show the count of records where ‘housing’ is ‘no’ and where ‘loan’ is yes.

Hint COUNT() is a aggregating function in SQL (more on aggregation later!). Try experimenting with COUNT() in your SELECT clause to see if you can find the correct count.

Answers

You can use COUNT(*) to get the count of total records after filtering WHERE ‘housing’ is ‘no’ and ‘loan’ is yes.

%%sql
SELECT COUNT(*)
FROM bank
WHERE housing = 'no' AND loan = 'yes'

count_star()
285

Delete table

%%sql
DROP TABLE bank;

Success

Wrapping Up#

To summarize this section, we first introduced our primary dataset we will be using for the next few sections. Then, we ran our first query by first installing JupySQL and other packages into our notebook, properly loaded our data with some Python, and established a connection to a DuckDB database.

We learned the basics of SQL by going over some of its most necessary clauses:

SELECT : “Selects” what to extract from the query. This clause can be followed by a specific variable name or by using “*” to select the whole table.
FROM : Tells SQL what table in the database to run our query on. We used this clause primarily on ‘bank’ which we first created when setting up our DuckDB database.
LIMIT : Limits the number of rows from our query
WHERE : Filters the query on specific conditions. This clause can be combined with AND and OR clauses for more complex filters.
ORDER BY : Sorts the output on variables in our query. This clause can include DESC to sort by descending order.

These clauses lay the foundation of SQL. They will be necessary for our next section, which will introduce aggregation functions. We showed a sneak peek of an aggregation function COUNT() in Question 3. More on that in the next section!

A Little Extra#

As you may have noticed, SQL code is straight forward. It’s clauses translate well to what you want SQL to do in natural verbal terms. These clauses make it so easy it’s like you are “declaring” SQL to do what you would like it to do. This nature is what defines SQL to be a “declarative programming language”.

This article is a great resource if you’re curious on this topic.

Making your first SQL query

Contents

Making your first SQL query#

SQL Overview#

Dataset#

Data Description#

Install - execute this once.#

Helper script#

Load the data#

Load Engine#

Queries#

Creating Table#

Simple Query#

Filtering#

Sorting#

You try: Use JupySQL to perform the queries and answer the questions.#

Question 1 (Easy):#

Question 2 (Medium):#

Question 3 (BONUS):#

Wrapping Up#

A Little Extra#