Summarize data frame with easy using the summarytools Package in R

This post shows how to use the summarytools package in R with the Palmer Penguins dataset to summarize data.
visualization
code
Statistics
Author
Affiliation
Published

August 20, 2024

Modified

August 22, 2024

Introduction

Summarizing data is a critical first step in any data analysis project. The summarytools package in R is a powerful tool designed to facilitate this process, providing a range of summary statistics, descriptive statistics, and frequency tables that allow users to explore datasets quickly and efficiently. In this blog post, we will walk through how to use summarytools with the famous Palmer Penguins dataset from the palmerpenguins package. By the end of this guide, you’ll be equipped with the skills to explore and summarize your data with ease.

Load Packages and Data

The summarytools package provides multiple functions that allow you to summarize data in different formats, making it a perfect tool for exploratory data analysis (EDA). First, let’s load the necessary packages and data. We’ll be using summarytools for summarizing and palmerpenguins for the dataset.

require(tidyverse)
require(summarytools)
require(magrittr)

The penguins dataset consists of observations on three species of penguins from the Palmer Archipelago in Antarctica. It includes various measurements like body mass, bill length, and flipper length.

penguins = palmerpenguins::penguins
penguins |> glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Data Frame Summary with dfSummary()

The dfSummary() function provides a comprehensive summary of the entire dataframe, including the number of missing values, unique values, and the distribution of categorical and numerical variables.

penguins |> 
  select(-year) |> 
  dfSummary() |> 
  print(method = "render")

Data Frame Summary

Dimensions: 344 x 7
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 species [factor]
1. Adelie
2. Chinstrap
3. Gentoo
152 ( 44.2% )
68 ( 19.8% )
124 ( 36.0% )
344 (100.0%) 0 (0.0%)
2 island [factor]
1. Biscoe
2. Dream
3. Torgersen
168 ( 48.8% )
124 ( 36.0% )
52 ( 15.1% )
344 (100.0%) 0 (0.0%)
3 bill_length_mm [numeric]
Mean (sd) : 43.9 (5.5)
min ≤ med ≤ max:
32.1 ≤ 44.5 ≤ 59.6
IQR (CV) : 9.3 (0.1)
164 distinct values 342 (99.4%) 2 (0.6%)
4 bill_depth_mm [numeric]
Mean (sd) : 17.2 (2)
min ≤ med ≤ max:
13.1 ≤ 17.3 ≤ 21.5
IQR (CV) : 3.1 (0.1)
80 distinct values 342 (99.4%) 2 (0.6%)
5 flipper_length_mm [integer]
Mean (sd) : 200.9 (14.1)
min ≤ med ≤ max:
172 ≤ 197 ≤ 231
IQR (CV) : 23 (0.1)
55 distinct values 342 (99.4%) 2 (0.6%)
6 body_mass_g [integer]
Mean (sd) : 4201.8 (802)
min ≤ med ≤ max:
2700 ≤ 4050 ≤ 6300
IQR (CV) : 1200 (0.2)
94 distinct values 342 (99.4%) 2 (0.6%)
7 sex [factor]
1. female
2. male
165 ( 49.5% )
168 ( 50.5% )
333 (96.8%) 11 (3.2%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

The output will give you a detailed overview of each column, highlighting potential issues like missing values or unbalanced classes. This is especially useful when working with large datasets, as it provides insights at a glance.

Descriptive Statistics with descr()

For numerical data, descr() is the go-to function for descriptive statistics. It calculates key summary statistics like the mean, median, standard deviation, and more.

penguins |> 
  select(-year) |> 
  descr()|> 
  print(method = "render")

Descriptive Statistics

N: 344
bill_depth_
mm
bill_length_
mm
body_mass_g flipper_
length_mm
Mean 17.15 43.92 4201.75 200.92
Std.Dev 1.97 5.46 801.95 14.06
Min 13.10 32.10 2700.00 172.00
Q1 15.60 39.20 3550.00 190.00
Median 17.30 44.45 4050.00 197.00
Q3 18.70 48.50 4750.00 213.00
Max 21.50 59.60 6300.00 231.00
MAD 2.22 7.04 889.56 16.31
IQR 3.10 9.27 1200.00 23.00
CV 0.12 0.12 0.19 0.07
Skewness -0.14 0.05 0.47 0.34
SE.Skewness 0.13 0.13 0.13 0.13
Kurtosis -0.92 -0.89 -0.74 -1.00
N.Valid 342 342 342 342
Pct.Valid 99.42 99.42 99.42 99.42

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

This will provide a table showing the descriptive statistics for numerical variables such as bill_length_mm, flipper_length_mm, and body_mass_g. The function is flexible and can also handle weighted statistics if needed.

Frequency Table with freq()

For categorical variables, the freq() function generates frequency tables that show the count and proportion of each category. This is particularly useful for understanding the distribution of factors like species or island.

penguins |> 
  select(species) |> 
  freq() |> 
  print(method = "render")

Frequencies

species

Type: Factor
Valid Total
species Freq % % Cum. % % Cum.
Adelie 152 44.19 44.19 44.19 44.19
Chinstrap 68 19.77 63.95 19.77 63.95
Gentoo 124 36.05 100.00 36.05 100.00
<NA> 0 0.00 100.00
Total 344 100.00 100.00 100.00 100.00

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

In the case of the species variable, this function will provide the count and percentage of each species of penguin (Adelie, Chinstrap, Gentoo) in the dataset, giving you insights into the dataset’s composition.

Cross tabulation

The ctable() function from the summarytools package allows to perform cross-tabulation, which helps analyze the relationship between two categorical variables by displaying the frequency distribution of their combinations.

penguins |> 
  drop_na() %$%
  ctable(
    x = species, y = sex,
    OR =  TRUE,
    RR = TRUE
    ) |> 
  print(method = "render")

Cross-Tabulation, Row Proportions

species * sex

sex
species female male Total
Adelie 73 ( 50.0% ) 73 ( 50.0% ) 146 ( 100.0% )
Chinstrap 34 ( 50.0% ) 34 ( 50.0% ) 68 ( 100.0% )
Gentoo 58 ( 48.7% ) 61 ( 51.3% ) 119 ( 100.0% )
Total 165 ( 49.5% ) 168 ( 50.5% ) 333 ( 100.0% )

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

To include the chi-square statistic in the cross-tabulation, you can set the chisq = TRUE argument in the ctable() function. Here’s how you can do it using the Palmer Penguins dataset:

penguins |> 
  drop_na() %$%
  ctable(
    x = species, y = sex,
    OR =  TRUE,
    RR = TRUE,
    chisq = TRUE
    ) |> 
  print(method = "render")

Cross-Tabulation, Row Proportions

species * sex

sex
species female male Total
Adelie 73 ( 50.0% ) 73 ( 50.0% ) 146 ( 100.0% )
Chinstrap 34 ( 50.0% ) 34 ( 50.0% ) 68 ( 100.0% )
Gentoo 58 ( 48.7% ) 61 ( 51.3% ) 119 ( 100.0% )
Total 165 ( 49.5% ) 168 ( 50.5% ) 333 ( 100.0% )
 Χ2 = 0.0486   df = 2   p = .9760

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

Summary and Conclusion

In this blog post, we’ve demonstrated the use of the summarytools package in R with the Palmer Penguins dataset. We explored how to use key functions like dfSummary() for an overall dataframe summary, descr() for descriptive statistics, and freq() for frequency tables. These tools are invaluable for gaining a deeper understanding of your data and preparing it for further analysis.

References

  • Palmer Penguins Data: https://allisonhorst.github.io/palmerpenguins/

  • Summarytools Documentation: https://cran.r-project.org/web/packages/summarytools/vignettes/Introduction.html

Citation

BibTeX citation:
@online{semba2024,
  author = {Semba, Masumbuko},
  title = {Summarize Data Frame with Easy Using the `Summarytools`
    {Package} in {R}},
  date = {2024-08-20},
  url = {https://lugoga.github.io/kitaa/posts/summarytool/},
  langid = {en}
}
For attribution, please cite this work as:
Semba, M., 2024. Summarize data frame with easy using the `summarytools` Package in R [WWW Document]. URL https://lugoga.github.io/kitaa/posts/summarytool/