Introduction
Summarizing data is a critical first step in any data analysis project. The summarytools
package in R is a powerful tool designed to facilitate this process, providing a range of summary statistics, descriptive statistics, and frequency tables that allow users to explore datasets quickly and efficiently. In this blog post, we will walk through how to use summarytools
with the famous Palmer Penguins dataset from the palmerpenguins
package. By the end of this guide, you’ll be equipped with the skills to explore and summarize your data with ease.
Load Packages and Data
The summarytools
package provides multiple functions that allow you to summarize data in different formats, making it a perfect tool for exploratory data analysis (EDA). First, let’s load the necessary packages and data. We’ll be using summarytools
for summarizing and palmerpenguins
for the dataset.
The penguins
dataset consists of observations on three species of penguins from the Palmer Archipelago in Antarctica. It includes various measurements like body mass, bill length, and flipper length.
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Data Frame Summary with dfSummary()
The dfSummary()
function provides a comprehensive summary of the entire dataframe, including the number of missing values, unique values, and the distribution of categorical and numerical variables.
Data Frame Summary
Dimensions: 344 x 7Duplicates: 0
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | species [factor] |
|
|
344 (100.0%) | 0 (0.0%) | ||||||||||||||||
2 | island [factor] |
|
|
344 (100.0%) | 0 (0.0%) | ||||||||||||||||
3 | bill_length_mm [numeric] |
|
164 distinct values | 342 (99.4%) | 2 (0.6%) | ||||||||||||||||
4 | bill_depth_mm [numeric] |
|
80 distinct values | 342 (99.4%) | 2 (0.6%) | ||||||||||||||||
5 | flipper_length_mm [integer] |
|
55 distinct values | 342 (99.4%) | 2 (0.6%) | ||||||||||||||||
6 | body_mass_g [integer] |
|
94 distinct values | 342 (99.4%) | 2 (0.6%) | ||||||||||||||||
7 | sex [factor] |
|
|
333 (96.8%) | 11 (3.2%) |
Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17
The output will give you a detailed overview of each column, highlighting potential issues like missing values or unbalanced classes. This is especially useful when working with large datasets, as it provides insights at a glance.
Descriptive Statistics with descr()
For numerical data, descr()
is the go-to function for descriptive statistics. It calculates key summary statistics like the mean, median, standard deviation, and more.
Descriptive Statistics
N: 344bill_depth_ mm |
bill_length_ mm |
body_mass_g | flipper_ length_mm |
|
---|---|---|---|---|
Mean | 17.15 | 43.92 | 4201.75 | 200.92 |
Std.Dev | 1.97 | 5.46 | 801.95 | 14.06 |
Min | 13.10 | 32.10 | 2700.00 | 172.00 |
Q1 | 15.60 | 39.20 | 3550.00 | 190.00 |
Median | 17.30 | 44.45 | 4050.00 | 197.00 |
Q3 | 18.70 | 48.50 | 4750.00 | 213.00 |
Max | 21.50 | 59.60 | 6300.00 | 231.00 |
MAD | 2.22 | 7.04 | 889.56 | 16.31 |
IQR | 3.10 | 9.27 | 1200.00 | 23.00 |
CV | 0.12 | 0.12 | 0.19 | 0.07 |
Skewness | -0.14 | 0.05 | 0.47 | 0.34 |
SE.Skewness | 0.13 | 0.13 | 0.13 | 0.13 |
Kurtosis | -0.92 | -0.89 | -0.74 | -1.00 |
N.Valid | 342 | 342 | 342 | 342 |
Pct.Valid | 99.42 | 99.42 | 99.42 | 99.42 |
Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17
This will provide a table showing the descriptive statistics for numerical variables such as bill_length_mm
, flipper_length_mm
, and body_mass_g
. The function is flexible and can also handle weighted statistics if needed.
Frequency Table with freq()
For categorical variables, the freq()
function generates frequency tables that show the count and proportion of each category. This is particularly useful for understanding the distribution of factors like species or island.
Frequencies
species
Type: FactorValid | Total | ||||
---|---|---|---|---|---|
species | Freq | % | % Cum. | % | % Cum. |
Adelie | 152 | 44.19 | 44.19 | 44.19 | 44.19 |
Chinstrap | 68 | 19.77 | 63.95 | 19.77 | 63.95 |
Gentoo | 124 | 36.05 | 100.00 | 36.05 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 344 | 100.00 | 100.00 | 100.00 | 100.00 |
Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17
In the case of the species
variable, this function will provide the count and percentage of each species of penguin (Adelie, Chinstrap, Gentoo) in the dataset, giving you insights into the dataset’s composition.
Cross tabulation
The ctable()
function from the summarytools package allows to perform cross-tabulation, which helps analyze the relationship between two categorical variables by displaying the frequency distribution of their combinations.
penguins |>
drop_na() %$%
ctable(
x = species, y = sex,
OR = TRUE,
RR = TRUE
) |>
print(method = "render")
Cross-Tabulation, Row Proportions
species * sex
sex | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
species | female | male | Total | |||||||||
Adelie | 73 | ( | 50.0% | ) | 73 | ( | 50.0% | ) | 146 | ( | 100.0% | ) |
Chinstrap | 34 | ( | 50.0% | ) | 34 | ( | 50.0% | ) | 68 | ( | 100.0% | ) |
Gentoo | 58 | ( | 48.7% | ) | 61 | ( | 51.3% | ) | 119 | ( | 100.0% | ) |
Total | 165 | ( | 49.5% | ) | 168 | ( | 50.5% | ) | 333 | ( | 100.0% | ) |
Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17
To include the chi-square statistic in the cross-tabulation, you can set the chisq = TRUE argument in the ctable() function. Here’s how you can do it using the Palmer Penguins dataset:
penguins |>
drop_na() %$%
ctable(
x = species, y = sex,
OR = TRUE,
RR = TRUE,
chisq = TRUE
) |>
print(method = "render")
Cross-Tabulation, Row Proportions
species * sex
sex | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
species | female | male | Total | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Adelie | 73 | ( | 50.0% | ) | 73 | ( | 50.0% | ) | 146 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chinstrap | 34 | ( | 50.0% | ) | 34 | ( | 50.0% | ) | 68 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Gentoo | 58 | ( | 48.7% | ) | 61 | ( | 51.3% | ) | 119 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Total | 165 | ( | 49.5% | ) | 168 | ( | 50.5% | ) | 333 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Χ2 = 0.0486 df = 2 p = .9760 |
Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17
Summary and Conclusion
In this blog post, we’ve demonstrated the use of the summarytools
package in R with the Palmer Penguins dataset. We explored how to use key functions like dfSummary()
for an overall dataframe summary, descr()
for descriptive statistics, and freq()
for frequency tables. These tools are invaluable for gaining a deeper understanding of your data and preparing it for further analysis.
References
Palmer Penguins Data: https://allisonhorst.github.io/palmerpenguins/
Summarytools Documentation: https://cran.r-project.org/web/packages/summarytools/vignettes/Introduction.html
Citation
@online{semba2024,
author = {Semba, Masumbuko},
title = {Summarize Data Frame with Easy Using the `Summarytools`
{Package} in {R}},
date = {2024-08-20},
url = {https://lugoga.github.io/kitaa/posts/summarytool/},
langid = {en}
}