Exploratory data analysis with JupySQL#
Dataset: Absenteeism at work#
Source: UCI Machine Learning Repository
URL: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
Dataset description#
The data set allows for several new combinations of attributes and attribute exclusions, or the modification of the attribute type (categorical, integer, or real) depending on the purpose of the research.The data set (Absenteeism at work - Part I) was used in academic research at the Universidade Nove de Julho - Postgraduate Program in Informatics and Knowledge Management.
Categorical data information#
The data contains the following categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
Individual identification (ID)
Reason for absence (ICD).
Month of absence
Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
Seasons (summer (1), autumn (2), winter (3), spring (4))
Transportation expense
Distance from Residence to Work (kilometers)
Service time
Age
Work load Average/day
Hit target
Disciplinary failure (yes=1; no=0)
Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
Son (number of children)
Social drinker (yes=1; no=0)
Social smoker (yes=1; no=0)
Pet (number of pet)
Weight
Height
Body mass index
Absenteeism time in hours (target)
5 minute crash course into JupySQL#
Play the following video to get familiar with JupySQL to execute queries on Jupyter using DuckDB.
If you get stuck, join our Slack community! https://ploomber.io/community
Install - execute this once.#
%pip install jupysql --upgrade duckdb-engine pandas --quiet
Load the data#
from urllib.request import urlretrieve
from zipfile import ZipFile
import pandas as pd
url = "https://archive.ics.uci.edu/static/public/445/absenteeism+at+work.zip"
# download the file
urlretrieve(url, "./raw-data/Absenteeism_at_work_AAA.zip")
# Extract the CSV file
with ZipFile("./raw-data/Absenteeism_at_work_AAA.zip", "r") as zf:
zf.extractall("./raw-data/")
# Check the extracted CSV file name
# (in this case, it's "Absenteeism_at_work.csv")
csv_file_name = "./raw-data/Absenteeism_at_work.csv"
# Data clean up
df = pd.read_csv(csv_file_name, sep=",")
df.columns = df.columns.str.replace(" ", "_")
# Save the cleaned up CSV file
df.to_csv("Absenteeism_at_work_cleaned.csv", index=False)
Load Engine#
Note Ensure you restart any previous notebook that has the same database name as the one initialized below.
%reload_ext sql
%sql duckdb:///absenteeism.duck.db
create or replace table absenteeism as
from read_csv_auto('Absenteeism_at_work_cleaned.csv', header=True, sep=';')
SELECT count(*) FROM absenteeism
Use JupySQL to perform the queries and answer the questions.#
Example: show the first 5 rows.
SELECT *
FROM absenteeism
LIMIT 5
Question 1.1 (Easy):#
How many records are there in the ‘absenteeism’ table?
Answers
You can use the %%sql
magic and the COUNT(*)
function to count the total number of records.
%%sql
SELECT COUNT(*)
FROM absenteeism
Question 1.2 (Easy):#
How many unique employees are listed in the dataset?
Answers
You can use the %%sql
magic and the COUNT(DISTINCT ID)
function to count the total number of unique instances of the Age
column.
%%sql
SELECT COUNT(DISTINCT ID)
FROM absenteeism;
Question 1.3 (Easy):#
What is the average distance from residence to work?
Show Answers
You can use the %%sql
magic and the AVG(Distance_from_Residence_to_Work)
function to calculate the average distance from residence to work..
%%sql
SELECT AVG(Distance_from_Residence_to_Work)
FROM absenteeism;
Question 2.1 (Medium):#
On which days of the week does the average absenteeism time exceed 4 hours?
Answers
You can use the %%sql
magic and break down the query as follows:
Select the column with name
Day_of_the_week
From the table called
absenteeism
Then group the values by day of the week that have an average value (use
AVG
) of more than 4 hours in absenteeism.
%%sql
SELECT Day_of_the_week
FROM absenteeism
GROUP BY Day_of_the_week
HAVING AVG(Absenteeism_time_in_hours) > 4;
Question 2.2 (Medium):#
What is the average transportation expense for each season?
Answers
You can use the %%sql
magic and. Use the AVG(Transportation_expense)
with the alias AVG_Transportation_Expense
function to count the average transporation expense, then group by seasons.
%%sql
SELECT Seasons, AVG(Transportation_expense) AS AVG_Transportation_Expense
FROM absenteeism
GROUP BY Seasons;
Question 2.3 (Medium):#
What is the average absenteeism time for employees with BMI higher than the average BMI
Show Answers
You can use the %%sql
magic and. Use the AVG(Absenteeism_time_in_hours)
with the alias AVG_Absenteeism_time_in_hours
function to count the average absenteeism (time units hours).
WHERE Body_mass_index > (
: This part begins a condition that the data must meet to be included in our average calculation. Here, we’re only interested in rows where the Body_mass_index
is greater than a certain value.
SELECT AVG(Body_mass_index) FROM absenteeism)
: This is a subquery, a query within a query. It’s calculating the average Body_mass_index
for the entire absenteeism table.
%%sql
SELECT AVG(Absenteeism_time_in_hours) as AVG_Absenteeism_time_in_hours
FROM absenteeism
WHERE Body_mass_index > (
SELECT AVG(Body_mass_index)
FROM absenteeism);
Question 3.1 (Hard):#
Find the top 3 ages with the highest total absenteeism hours, excluding disciplinary failures.
Answers
You can use the %%sql
magic and break down the query as follows:
Select the column with name
Age
, compute the Sum ofAbsenteeism_time_in_hours
. Give this sum an aliasSum_Absenteeism
.From the table called
absenteeism
The keywork WHERE is used to filter the data that meets a specific condition, in this case
Disciplinary_failure
is equal to zero.Group values by the
Age
column.Sort the values by the sum and show the first 3 values.
%%sql
SELECT Age, SUM(Absenteeism_time_in_hours) AS Sum_Absenteeism
FROM absenteeism
WHERE Disciplinary_failure = 0
GROUP BY Age
ORDER BY Sum_Absenteeism
DESC LIMIT 3;
Question 3.2 (Hard):#
Find the age of employees who have been absent for more than 5 hours with an unjustified absence.
Hint: investigate encoding on the data source.
Answers
You can use the %%sql
magic. ‘Unjustified absence’ is coded with 26. From there all that is required is selecting the age, and using WHERE
to set up the appropriate conditions.
%%sql
SELECT Age
FROM absenteeism
WHERE Reason_for_absence = 26 AND Absenteeism_time_in_hours > 5;
Question 3.3 (Hard):#
Which reasons for absence are more frequent for social drinkers than social non-drinkers?
Show Answers
You can use the %%sql
magic. We use SELECT
to extract the Reason_for_absence
from the absenteeism
table.
The column Social_drinker
is encoded using binary notation, 0=is not a social drinker, 1=is a social drinker.
We next group by their reason for absence.
HAVING COUNT() > (
begins the condition that the groups must meet to be included in the results. Only groups where the count of rows (representing the number of instances of each Reason_for_absence
among social drinkers) is greater than a certain value will be included.
SELECT COUNT() FROM absenteeism WHERE Social_drinker = 0 GROUP BY Reason_for_absence)
is a subquery that calculates the count of rows for each Reason_for_absence
where Social_drinker
is 0 (indicating the employee is not a social drinker), effectively giving us the number of instances of each Reason_for_absence
among non-social drinkers.
%%sql
SELECT Reason_for_absence
FROM absenteeism
WHERE Social_drinker = 1
GROUP BY Reason_for_absence
HAVING COUNT() > (
SELECT COUNT()
FROM absenteeism
WHERE Social_drinker = 0
GROUP BY Reason_for_absence);
Bonus: Save the tables you created using the --save
option, use the saved tables to generate visualizations.#
Here are a few tutorials to get you started:
Parameterizing SQL queries: https://jupysql.ploomber.io/en/latest/user-guide/template.html
SQL Plot: https://jupysql.ploomber.io/en/latest/api/magic-plot.html
Organizing Large queries: https://jupysql.ploomber.io/en/latest/compose.html
Plotting with ggplot: https://jupysql.ploomber.io/en/latest/user-guide/ggplot.html
Turning your notebook into a Voila dashboard: https://ploomber.io/blog/voila-tutorial/
References#
Martiniano, A., Ferreira, R. P., Sassi, R. J., & Affonso, C. (2012). Application of a neuro fuzzy network in prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.
Acknowledgements#
Thank you Mark Needham for producing the 5 minute crash course on using JupySQL.