Machine learning with tidymodels: Binary Classification Model

Manipulation

Visualization

Modelling

Author

Affiliation

Masumbuko Semba

Nelson Mandela African Institution of Science and Technology

Published

April 13, 2023

A gentle introduction

In this post, we’ll learn how to create Machine learning models using R. Machine learning is the foundation for predictive modeling and artificial intelligence. We’ll learn the core principles of machine learning and how to use common tools and frameworks to train, evaluate, and use machine learning models. in this post we are going to manipulate and visualize diabetes dataset and train and evaluate binary regression models. But before we proceed, we need some packages to accomplish the step in this post. We can have them installed as:

install.packages(c('tidyverse', 'tidymodels', 'ranger', 'vip', 'palmerpenguins', 'skimr', 'paletteer', 'nnet', 'here'))

Then we load the packages in the session

require(tidyverse)
require(tidymodels)
require(magrittr)
require(patchwork)

Binary classification

Let’s start by looking at an example of binary classification, where the model must predict a label that belongs to one of two classes. In this exercise, we’ll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.

Explore the data

Let’s first import a file of patients data direct from the internet with read_csv function of readr package (Wickham, Hester, & Francois, 2017), part of the tidyverse ecosystem (Wickham & Wickham, 2017);

diabetes <- read_csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/diabetes.csv")

We then print the dataset to explore the variables and records contained in the dataset;

diabetes

# A tibble: 15,000 x 10
   PatientID Pregn~1 Plasm~2 Diast~3 Trice~4 Serum~5   BMI Diabe~6   Age Diabe~7
       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>   <dbl>
 1   1354778       0     171      80      34      23  43.5  1.21      21       0
 2   1147438       8      92      93      47      36  21.2  0.158     23       0
 3   1640031       7     115      47      52      35  41.5  0.0790    23       0
 4   1883350       9     103      78      25     304  29.6  1.28      43       1
 5   1424119       1      85      59      27      35  42.6  0.550     22       0
 6   1619297       0      82      92       9     253  19.7  0.103     26       0
 7   1660149       0     133      47      19     227  21.9  0.174     21       0
 8   1458769       0      67      87      43      36  18.3  0.236     26       0
 9   1201647       8      80      95      33      24  26.6  0.444     53       1
10   1403912       1      72      31      40      42  36.9  0.104     26       0
# ... with 14,990 more rows, and abbreviated variable names 1: Pregnancies,
#   2: PlasmaGlucose, 3: DiastolicBloodPressure, 4: TricepsThickness,
#   5: SerumInsulin, 6: DiabetesPedigree, 7: Diabetic

Though the printed output tell us that there are fifteen thousand records and ten variable, but sometimes you may wish to explore the internal structure of the dataset. The glimpse function from dplyr package can do that by listing the available variables and their types;

diabetes %>% 
  glimpse()

Rows: 15,000
Columns: 10
$ PatientID              <dbl> 1354778, 1147438, 1640031, 1883350, 1424119, 16~
$ Pregnancies            <dbl> 0, 8, 7, 9, 1, 0, 0, 0, 8, 1, 1, 3, 5, 7, 0, 3,~
$ PlasmaGlucose          <dbl> 171, 92, 115, 103, 85, 82, 133, 67, 80, 72, 88,~
$ DiastolicBloodPressure <dbl> 80, 93, 47, 78, 59, 92, 47, 87, 95, 31, 86, 96,~
$ TricepsThickness       <dbl> 34, 47, 52, 25, 27, 9, 19, 43, 33, 40, 11, 31, ~
$ SerumInsulin           <dbl> 23, 36, 35, 304, 35, 253, 227, 36, 24, 42, 58, ~
$ BMI                    <dbl> 43.50973, 21.24058, 41.51152, 29.58219, 42.6045~
$ DiabetesPedigree       <dbl> 1.21319135, 0.15836498, 0.07901857, 1.28286985,~
$ Age                    <dbl> 21, 23, 23, 43, 22, 26, 21, 26, 53, 26, 22, 23,~
$ Diabetic               <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,~

This data consists of 15,000 patients with 10 variables that were used to diagnose diabetes. In this post we tread a Diabetic as outome and the remaining variables as predictor. A predictor variable is used to predict the occurrence and/or level of another variable, called the outcome variable. Let’s tidy and reorganize in format that easy for model to understand. The variable names do not follow the recommended standard therefore we clean them using clean_names function from janitor package (Firke, 2020). We also notice that some of the variable like PatientId adds no effect in the model and was dropped from the dataset.

Further, since we are going to use classification algorithm in this we need to convert our our predictor variable–diabetic–from numerical to factor. This is the outcome (label) and other variable like Pregnancies, PlasmaGlucose, DiastolicBloodPressure, BMI and so on are the predictor (features) we will use to predict the Diabetic label.

diabetes.tidy = diabetes %>% 
  janitor::clean_names() %>% 
  select(-patient_id) %>% 
  mutate(diabetic = if_else(diabetic == 1, "Diabetic", "Non diabetic"),
         diabetic = as.factor(diabetic))

diabetes.tidy

# A tibble: 15,000 x 9
   pregnancies plasma_gluc~1 diast~2 trice~3 serum~4   bmi diabe~5   age diabe~6
         <dbl>         <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl> <fct>  
 1           0           171      80      34      23  43.5  1.21      21 Non di~
 2           8            92      93      47      36  21.2  0.158     23 Non di~
 3           7           115      47      52      35  41.5  0.0790    23 Non di~
 4           9           103      78      25     304  29.6  1.28      43 Diabet~
 5           1            85      59      27      35  42.6  0.550     22 Non di~
 6           0            82      92       9     253  19.7  0.103     26 Non di~
 7           0           133      47      19     227  21.9  0.174     21 Non di~
 8           0            67      87      43      36  18.3  0.236     26 Non di~
 9           8            80      95      33      24  26.6  0.444     53 Diabet~
10           1            72      31      40      42  36.9  0.104     26 Non di~
# ... with 14,990 more rows, and abbreviated variable names 1: plasma_glucose,
#   2: diastolic_blood_pressure, 3: triceps_thickness, 4: serum_insulin,
#   5: diabetes_pedigree, 6: diabetic

Our primary goal of data exploration is to try to understand the existing relationship patterns between variables in the dataset. Therefore, we should assess any apparent correlation using picture through data visualization. To make it easy to plot multiple plot at once, the format of our dataset at this stage is wide, and that prevent us doing that. In order to plot all variable in single plot, we must first convert the dataset from wide format to long format, and we can do that using pivot_longer function from dplyr package (Wickham, François, Henry, & Müller, 2019).

diabetes.tidy.long = diabetes.tidy %>% 
  pivot_longer(
    cols = !diabetic, 
    names_to = "features", 
    values_to = "values"
               )

Once we have pivoted the data to long format, we can create plot in a single plot using multiple facet;

theme_set(theme_light())

diabetes.tidy.long %>% 
  ggplot(aes(x = diabetic, y = values, fill = features)) +
  geom_boxplot() + 
  facet_wrap(~ features, scales = "free", ncol = 4) +
  scale_color_viridis_d(option = "plasma", end = .7) +
  scale_y_continuous(name = "Values", trans = scales::sqrt_trans())+
  theme(legend.position = "none", axis.title.x = element_blank(), 
        strip.background = element_rect(fill = "grey60"), 
        strip.text = element_text(color = "white", face = "bold"))

Figure 1: Boxplot of the predictor variables between diabetic and non-diabetic patients

The values of the predictor vary between diabetic and non-diabetic individuals. In particular, with exception of diabetes_pedigree and triceps_thickness, other predictors show that diabetic individual with relatively high values than non-diabetic (Figure 1). These features may help predict whether or not a patient is diabetic.

Split the data

Our dataset includes known values for the label, so we can use this to train a classifier so that it finds a statistical relationship between the features and the label value; but how will we know if our model is any good? How do we know it will predict correctly when we use it with new data that it wasn’t trained with?

It is best practice to hold out part of the data for testing in order to get a better estimate of how models will perform on new data by comparing the predicted labels with the already known labels in the test set. Well, we can take advantage of the fact we have a large dataset with known label values, use only some of it to train the model, and hold back some to test the trained model - enabling us to compare the predicted labels with the already known labels in the test set.

In R, the tidymodels framework provides a collection of packages for modeling and machine learning using tidyverse principles (Kuhn & Wickham, 2020). For instance, rsample, a package in tidymodels, provides infrastructure for efficient data splitting and resampling (Kuhn, Chow, & Wickham, 2020):

initial_split(): specifies how data will be split into a training and testing set
training() and testing() functions extract the data in each split

# Split data into 70% for training and 30% for testing
set.seed(2056)
diabetes_split <- diabetes.tidy %>% 
  initial_split(prop = 0.70)


# Extract the data in each split
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)


# Print the number of cases in each split
cat("Training cases: ", nrow(diabetes_train), "\n",
    "Test cases: ", nrow(diabetes_test), sep = "")

Training cases: 10500
Test cases: 4500

Train and Evaluate a Binary Classification Model

Once we have separated the dataset into train and test set, now we’re ready to train our model by fitting the training features to the training labels (diabetic). There are various algorithms we can use to train the model.

Logistic regression Algorithm

In this section, we’ll use Logistic Regression, which is a well-established algorithm for classification. Logistic regression is a binary classification algorithm, meaning it predicts two categories. There are quite a number of ways to fit a logistic regression model in tidymodels. For now, let’s fit a logistic regression model via the default stats::glm() engine.

# Make a model specification
logreg_spec <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")


# Print the model specification
logreg_spec

Logistic Regression Model Specification (classification)

Computational engine: glm

After a model has been specified, the model can be estimated or trained using the fit() function, typically using a symbolic description of the model (a formula) and some data.

# Train a logistic regression model
logreg_fit <- logreg_spec %>% 
  fit(diabetic ~ ., data = diabetes_train)


# Print the model object
logreg_fit

parsnip model object


Call:  stats::glm(formula = diabetic ~ ., family = stats::binomial, 
    data = data)

Coefficients:
             (Intercept)               pregnancies            plasma_glucose  
                8.624243                 -0.266296                 -0.009615  
diastolic_blood_pressure         triceps_thickness             serum_insulin  
               -0.012297                 -0.022807                 -0.003932  
                     bmi         diabetes_pedigree                       age  
               -0.049028                 -0.923262                 -0.056876  

Degrees of Freedom: 10499 Total (i.e. Null);  10491 Residual
Null Deviance:      13290 
Residual Deviance: 9221     AIC: 9239

The model print out shows the coefficients learned during training. Now we’ve trained the model using the training data, we can use the test data we held back to evaluate how well it predicts using parsnip::predict(). Let’s start by using the model to predict labels for our test set, and compare the predicted labels to the known labels:

# Make predictions then bind them to the test set
results <- diabetes_test %>% 
  select(diabetic) %>% 
  bind_cols(
    logreg_fit %>% predict(new_data = diabetes_test)
    )

# Compare predictions
results %>% 
  slice_head(n = 10)

# A tibble: 10 x 2
   diabetic     .pred_class 
   <fct>        <fct>       
 1 Non diabetic Non diabetic
 2 Non diabetic Non diabetic
 3 Non diabetic Non diabetic
 4 Non diabetic Non diabetic
 5 Diabetic     Diabetic    
 6 Non diabetic Non diabetic
 7 Non diabetic Non diabetic
 8 Diabetic     Non diabetic
 9 Non diabetic Non diabetic
10 Non diabetic Non diabetic

Comparing each prediction with its corresponding “ground truth” actual value isn’t a very efficient way to determine how well the model is predicting. Fortunately, tidymodels has a few more tricks up its sleeve: yardstick - a package used to measure the effectiveness of models using performance metrics (Kuhn & Vaughan, 2020). The most obvious thing you might want to do is to check the accuracy of the predictions - in simple terms, what proportion of the labels did the model predict correctly? yardstick::accuracy() does just that!

# Calculate accuracy: proportion of data predicted correctly
accuracy(
  data = results, 
  truth = diabetic, 
  estimate = .pred_class
  )

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.789

The accuracy is returned as a decimal value - a value of 1.0 would mean that the model got 100% of the predictions right; while an accuracy of 0.0 is, well, pretty useless! Accuracy seems like a sensible metric to evaluate (and to a certain extent it is), but you need to be careful about drawing too many conclusions from the accuracy of a classifier. Remember that it’s simply a measure of how many cases were predicted correctly. Suppose only 3% of the population is diabetic. You could create a classifier that always just predicts 0, and it would be 97% accurate - but not terribly helpful in identifying patients with diabetes!

Fortunately, there are some other metrics that reveal a little more about how our classification model is performing.One performance metric associated with classification problems is the confusion matrix. A confusion matrix describes how well a classification model performs by tabulating how many examples in each class were correctly classified by a model. In our case, it will show you how many cases were classified as negative (0) and how many as positive (1); the confusion matrix also shows you how many were classified into the wrong categories. The conf_mat() function from yardstick calculates this cross-tabulation of observed and predicted classes.

# Confusion matrix for prediction results
results %>% 
  conf_mat(truth = diabetic, estimate = .pred_class)

              Truth
Prediction     Diabetic Non diabetic
  Diabetic          897          293
  Non diabetic      657         2653

The descriptive statistic of confusion matrix presented above can be presented visually as shown in Figure 2

# Visualize conf mat
update_geom_defaults(geom = "rect", new = list(fill = "midnightblue", alpha = 0.7))

results %>% 
  conf_mat(diabetic, .pred_class) %>% 
  autoplot()

Figure 2: Visualize confusion matrix plot

tidymodels provides yet another succinct way of evaluating all these metrics. Using yardstick::metric_set(), you can combine multiple metrics together into a new function that calculates all of them at once.

# Combine metrics and evaluate them all at once
eval_metrics <- 
  metric_set(ppv, recall, accuracy, f_meas)


eval_metrics(
  data = results, 
  truth = diabetic, 
  estimate = .pred_class
  )

# A tibble: 4 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 ppv      binary         0.754
2 recall   binary         0.577
3 accuracy binary         0.789
4 f_meas   binary         0.654

Until now, we’ve considered the predictions from the model as being either 1 or 0 class labels. Actually, things are a little more complex than that. Statistical machine learning algorithms, like logistic regression, are based on probability; so what actually gets predicted by a binary classifier is the probability that the label is true \(P(y)\) ) and the probability that the label is false \(1−P(y)\). A threshold value of 0.5 is used to decide whether the predicted label is a 1 \(P(y)>0.5\) or a 0 \(P(y)<=0.5\). Let’s see the probability pairs for each case:

# Predict class probabilities and bind them to results
results <- results %>% 
  bind_cols(logreg_fit %>% 
              predict(new_data = diabetes_test, type = "prob"))

  


# Print out the results
results %>% 
  slice_head(n = 10)

# A tibble: 10 x 4
   diabetic     .pred_class  .pred_Diabetic `.pred_Non diabetic`
   <fct>        <fct>                 <dbl>                <dbl>
 1 Non diabetic Non diabetic         0.417                 0.583
 2 Non diabetic Non diabetic         0.0985                0.902
 3 Non diabetic Non diabetic         0.0469                0.953
 4 Non diabetic Non diabetic         0.0561                0.944
 5 Diabetic     Diabetic             0.581                 0.419
 6 Non diabetic Non diabetic         0.331                 0.669
 7 Non diabetic Non diabetic         0.288                 0.712
 8 Diabetic     Non diabetic         0.270                 0.730
 9 Non diabetic Non diabetic         0.275                 0.725
10 Non diabetic Non diabetic         0.131                 0.869

The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilities are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the true positive rate (which is another name for recall) and the false positive rate \((1 - specificity)\) for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a received operator characteristic (ROC) chart, like this:

Important

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

# Make a roc_chart
results %>% 
  roc_curve(truth = diabetic, .pred_Diabetic) %>% 
  autoplot()

The Figure 3 shows the curve of the true and false positive rates for different threshold values between 0 and 1. A perfect classifier would have a curve that goes straight up the left side and straight across the top. The diagonal line across the chart represents the probability of predicting correctly with a \(\frac{50}{50}\) random prediction; so you obviously want the curve to be higher than that (or your model is no better than simply guessing!).

The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall performance of the model. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. The closer to 1 this value is, the better the model. Once again, tidymodels includes a function to calculate this metric: yardstick::roc_auc()

# Compute the AUC
results %>% 
  roc_auc(diabetic, .pred_Diabetic)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.860

Random Forest Algorithm

We have been dealing with logistic regression, which is a linear algorithm. tidymodels provide a swift approach to change algorithms in the model. For instance , we can change the logistic regression to other kind of classification algorithms inluding:

Support Vector Machine algorithms: Algorithms that define a hyperplane that separates classes.
Tree-based algorithms: Algorithms that build a decision tree to reach a prediction
Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.

To make life simple, let us train the model using an ensemble algorithm named Random Forest that averages the outputs of multiple random decision trees. Random forests help to reduce tree correlation by injecting more randomness into the tree-growing process. More specifically, instead of considering all predictors in the data, for calculating a given split, random forests pick a random sample of predictors to be considered for that split.

Given the experience of the logistic regression model, the power of tidymodels is consistence and therefore we do not need to start over, the only thing we need to do is simply to specify and fit a random forest algorithm.

# Preprocess the data for modelling
diabetes_recipe <- recipe(diabetic ~ ., data = diabetes_train) %>% 
  step_mutate(age = factor(age)) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(age)

# specify a random forest model specification
rf_spec <- rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Bundle recipe and model spec into a workflow
rf_wf <- workflow() %>% 
  add_recipe(diabetes_recipe) %>% 
  add_model(rf_spec)

# Fit a model
rf_wf_fit <- rf_wf %>% 
  fit(data = diabetes_train)

# # spefiy the model
# rf.fit  = rand_forest() %>% 
#   set_engine(engine = "ranger") %>% 
#   set_mode(mode = "classification") %>% 
#   # fit the model
#   fit(diabetic ~ ., data = diabetes_train)

Then we make prediction of the fitted model with the test dataset

rf.prediction = rf_wf_fit %>% 
  predict(new_data = diabetes_test)

rf.prediction.prob  =  rf_wf_fit %>% 
  predict(new_data = diabetes_test, type = "prob")

rf.prediction =  rf.prediction %>% 
  bind_cols(rf.prediction.prob)

rf.result = diabetes_test %>% 
  select(diabetic) %>% 
  bind_cols(rf.prediction)

rf.result

# A tibble: 4,500 x 4
   diabetic     .pred_class  .pred_Diabetic `.pred_Non diabetic`
   <fct>        <fct>                 <dbl>                <dbl>
 1 Non diabetic Non diabetic        0.275                 0.725 
 2 Non diabetic Non diabetic        0.0137                0.986 
 3 Non diabetic Non diabetic        0.00739               0.993 
 4 Non diabetic Non diabetic        0.0108                0.989 
 5 Diabetic     Diabetic            0.910                 0.0902
 6 Non diabetic Non diabetic        0.103                 0.897 
 7 Non diabetic Non diabetic        0.237                 0.763 
 8 Diabetic     Non diabetic        0.270                 0.730 
 9 Non diabetic Non diabetic        0.0484                0.952 
10 Non diabetic Non diabetic        0.0326                0.967 
# ... with 4,490 more rows

The printed predicted of the random forest looks complelling, let’s evaluate its metrics!

rf.result %>% 
  conf_mat(truth = diabetic, estimate = .pred_class)

              Truth
Prediction     Diabetic Non diabetic
  Diabetic         1368           99
  Non diabetic      186         2847

rf.result %>% 
  conf_mat(truth = diabetic, estimate = .pred_class) %>% 
  autoplot()

We notice that there a considerable increase in the number of True Positives and True Negatives, which is a step in the right direction. Let’s take a look at other evaluation metrics

rf.result %>% 
  accuracy(truth = diabetic, estimate = .pred_class)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.937

There is considerable improvement of the accuracy from .75 to around 0.93. The high accuracy of the random forest is also clearly visible in ROC curve (Figure 4) with considervable high value of area under the curve (AUC).

roc.values = rf.result %>% 
  roc_auc(truth = diabetic, .pred_Diabetic) %>% 
  pull(.estimate) %>% 
  round(2)

rf.result %>% 
  roc_curve(truth = diabetic, .pred_Diabetic) %>% 
  autoplot()+
  annotate(geom = "text", x = .75, y = .25, label = paste("AUC-value = ", roc.values))

Figure 5 is a Variable Importance Plot (VIP) that illustrates importance of each predictor variables used to predict the diabetic outcome in a random forest model.

# Load vip
library(vip)

# Extract the fitted model from the workflow
rf_wf_fit %>% 
  extract_fit_parsnip() %>% 
# Make VIP plot
  vip()

Figure 5: Model-specific VIPs of tree-based model random forest

Summary

We notice that random forest has high predictive power compared to the logistic regression model and hence you can choose the model that give you high accuracy in prediction. The random forest lines of code is bunded in single chunk;

## spliting
my.split = diabetes.tidy %>% 
  initial_split(prop = .8)
## test set
my.test = my.split %>%  testing()
## train set
my.train = my.split %>% training()

# spefiy the model
my.fit  = rand_forest() %>% 
  set_engine(engine = "ranger") %>% 
  set_mode(mode = "classification") %>% 
  # fit the model
  fit(diabetic ~ ., data = my.train)

# test the model
my.prediction = my.fit %>% 
  predict(new_data = my.test)

my.prediction.prob  =  my.fit %>% 
  predict(new_data = my.test, type = "prob")

my.prediction =  my.prediction %>% 
  bind_cols(my.prediction.prob)

my.result = my.test %>% 
  select(diabetic) %>% 
  bind_cols(my.prediction)

# validate the model

my.result %>% 
  conf_mat(truth = diabetic, estimate = .pred_class)

my.result %>% 
  accuracy(truth = diabetic, estimate = .pred_class)

# Combine metrics and evaluate them all at once
eval_metrics <- 
  metric_set(ppv, recall, accuracy, f_meas)


eval_metrics(
  data = my.result, 
  truth = diabetic, 
  estimate = .pred_class
  )

## plotting

my.result %>% 
  roc_curve(truth = diabetic, .pred_Diabetic) %>% 
  autoplot()

my.result %>% 
  roc_auc(truth = diabetic, .pred_Diabetic)

Cited Materials

Firke, S. (2020). Janitor: Simple tools for examining and cleaning dirty data. Retrieved from https://CRAN.R-project.org/package=janitor

Kuhn, M., Chow, F., & Wickham, H. (2020). Rsample: General resampling infrastructure. Retrieved from https://CRAN.R-project.org/package=rsample

Kuhn, M., & Vaughan, D. (2020). Yardstick: Tidy characterizations of model performance. Retrieved from https://CRAN.R-project.org/package=yardstick

Kuhn, M., & Wickham, H. (2020). Tidymodels: Easily install and load the ’tidymodels’ packages. Retrieved from https://CRAN.R-project.org/package=tidymodels

Wickham, H., François, R., Henry, L., & Müller, K. (2019). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr

Wickham, H., Hester, J., & Francois, R. (2017). Readr: Read rectangular text data. Retrieved from https://CRAN.R-project.org/package=readr

Wickham, H., & Wickham, M. H. (2017). Tidyverse: Easily install and load the ’tidyverse’. Retrieved from https://CRAN.R-project.org/package=tidyverse