DATIKA - Summarize data frame with easy using the summarytools Package in R

Introduction

Summarizing data is a critical first step in any data analysis project. The summarytools package in R is a powerful tool designed to facilitate this process, providing a range of summary statistics, descriptive statistics, and frequency tables that allow users to explore datasets quickly and efficiently. In this blog post, we will walk through how to use summarytools with the famous Palmer Penguins dataset from the palmerpenguins package. By the end of this guide, you’ll be equipped with the skills to explore and summarize your data with ease.

Load Packages and Data

The summarytools package provides multiple functions that allow you to summarize data in different formats, making it a perfect tool for exploratory data analysis (EDA). First, let’s load the necessary packages and data. We’ll be using summarytools for summarizing and palmerpenguins for the dataset.

require(tidyverse)
require(summarytools)
require(magrittr)

The penguins dataset consists of observations on three species of penguins from the Palmer Archipelago in Antarctica. It includes various measurements like body mass, bill length, and flipper length.

penguins = palmerpenguins::penguins
penguins |> glimpse()

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Data Frame Summary with `dfSummary()`

The dfSummary() function provides a comprehensive summary of the entire dataframe, including the number of missing values, unique values, and the distribution of categorical and numerical variables.

penguins |> 
  select(-year) |> 
  dfSummary() |> 
  print(method = "render")

Data Frame Summary

Dimensions: 344 x 7
Duplicates: 0

species [factor]

1. Adelie
2. Chinstrap
3. Gentoo

152	(	44.2%	)
68	(	19.8%	)
124	(	36.0%	)

344 (100.0%)

0 (0.0%)

island [factor]

1. Biscoe
2. Dream
3. Torgersen

168	(	48.8%	)
124	(	36.0%	)
52	(	15.1%	)

344 (100.0%)

0 (0.0%)

bill_length_mm [numeric]

Mean (sd) : 43.9 (5.5)
min ≤ med ≤ max:
32.1 ≤ 44.5 ≤ 59.6
IQR (CV) : 9.3 (0.1)

164 distinct values

342 (99.4%)

2 (0.6%)

bill_depth_mm [numeric]

Mean (sd) : 17.2 (2)
min ≤ med ≤ max:
13.1 ≤ 17.3 ≤ 21.5
IQR (CV) : 3.1 (0.1)

80 distinct values

342 (99.4%)

2 (0.6%)

flipper_length_mm [integer]

Mean (sd) : 200.9 (14.1)
min ≤ med ≤ max:
172 ≤ 197 ≤ 231
IQR (CV) : 23 (0.1)

55 distinct values

342 (99.4%)

2 (0.6%)

body_mass_g [integer]

Mean (sd) : 4201.8 (802)
min ≤ med ≤ max:
2700 ≤ 4050 ≤ 6300
IQR (CV) : 1200 (0.2)

94 distinct values

342 (99.4%)

2 (0.6%)

sex [factor]

1. female
2. male

165	(	49.5%	)
168	(	50.5%	)

333 (96.8%)

11 (3.2%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

The output will give you a detailed overview of each column, highlighting potential issues like missing values or unbalanced classes. This is especially useful when working with large datasets, as it provides insights at a glance.

Descriptive Statistics with `descr()`

For numerical data, descr() is the go-to function for descriptive statistics. It calculates key summary statistics like the mean, median, standard deviation, and more.

penguins |> 
  select(-year) |> 
  descr()|> 
  print(method = "render")

Descriptive Statistics

N: 344

	bill_depth_ mm	bill_length_ mm	body_mass_g	flipper_ length_mm
Mean	17.15	43.92	4201.75	200.92
Std.Dev	1.97	5.46	801.95	14.06
Min	13.10	32.10	2700.00	172.00
Q1	15.60	39.20	3550.00	190.00
Median	17.30	44.45	4050.00	197.00
Q3	18.70	48.50	4750.00	213.00
Max	21.50	59.60	6300.00	231.00
MAD	2.22	7.04	889.56	16.31
IQR	3.10	9.27	1200.00	23.00
CV	0.12	0.12	0.19	0.07
Skewness	-0.14	0.05	0.47	0.34
SE.Skewness	0.13	0.13	0.13	0.13
Kurtosis	-0.92	-0.89	-0.74	-1.00
N.Valid	342	342	342	342
Pct.Valid	99.42	99.42	99.42	99.42

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

This will provide a table showing the descriptive statistics for numerical variables such as bill_length_mm, flipper_length_mm, and body_mass_g. The function is flexible and can also handle weighted statistics if needed.

Frequency Table with `freq()`

For categorical variables, the freq() function generates frequency tables that show the count and proportion of each category. This is particularly useful for understanding the distribution of factors like species or island.

penguins |> 
  select(species) |> 
  freq() |> 
  print(method = "render")

Frequencies

species

Type: Factor

		Valid		Total
species	Freq	%	% Cum.	%	% Cum.
Adelie	152	44.19	44.19	44.19	44.19
Chinstrap	68	19.77	63.95	19.77	63.95
Gentoo	124	36.05	100.00	36.05	100.00
<NA>	0			0.00	100.00
Total	344	100.00	100.00	100.00	100.00

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

In the case of the species variable, this function will provide the count and percentage of each species of penguin (Adelie, Chinstrap, Gentoo) in the dataset, giving you insights into the dataset’s composition.

Cross tabulation

The ctable() function from the summarytools package allows to perform cross-tabulation, which helps analyze the relationship between two categorical variables by displaying the frequency distribution of their combinations.

penguins |> 
  drop_na() %$%
  ctable(
    x = species, y = sex,
    OR =  TRUE,
    RR = TRUE
    ) |> 
  print(method = "render")

Cross-Tabulation, Row Proportions

species * sex

	sex
species	female				male				Total
Adelie	73	(	50.0%	)	73	(	50.0%	)	146	(	100.0%	)
Chinstrap	34	(	50.0%	)	34	(	50.0%	)	68	(	100.0%	)
Gentoo	58	(	48.7%	)	61	(	51.3%	)	119	(	100.0%	)
Total	165	(	49.5%	)	168	(	50.5%	)	333	(	100.0%	)

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

To include the chi-square statistic in the cross-tabulation, you can set the chisq = TRUE argument in the ctable() function. Here’s how you can do it using the Palmer Penguins dataset:

penguins |> 
  drop_na() %$%
  ctable(
    x = species, y = sex,
    OR =  TRUE,
    RR = TRUE,
    chisq = TRUE
    ) |> 
  print(method = "render")

Cross-Tabulation, Row Proportions

species * sex

	sex
species	female				male				Total
Adelie	73	(	50.0%	)	73	(	50.0%	)	146	(	100.0%	)
Chinstrap	34	(	50.0%	)	34	(	50.0%	)	68	(	100.0%	)
Gentoo	58	(	48.7%	)	61	(	51.3%	)	119	(	100.0%	)
Total	165	(	49.5%	)	168	(	50.5%	)	333	(	100.0%	)
Χ² = 0.0486 df = 2 p = .9760

Generated by summarytools 1.0.1 (R version 4.3.0)
2024-09-17

Summary and Conclusion

In this blog post, we’ve demonstrated the use of the summarytools package in R with the Palmer Penguins dataset. We explored how to use key functions like dfSummary() for an overall dataframe summary, descr() for descriptive statistics, and freq() for frequency tables. These tools are invaluable for gaining a deeper understanding of your data and preparing it for further analysis.

References

Palmer Penguins Data: https://allisonhorst.github.io/palmerpenguins/
Summarytools Documentation: https://cran.r-project.org/web/packages/summarytools/vignettes/Introduction.html

Citation

BibTeX citation:

@online{semba2024,
  author = {Semba, Masumbuko},
  title = {Summarize Data Frame with Easy Using the `Summarytools`
    {Package} in {R}},
  date = {2024-08-20},
  url = {https://lugoga.github.io/kitaa/posts/summarytool/},
  langid = {en}
}

For attribution, please cite this work as:

Semba, M., 2024. Summarize data frame with easy using the `summarytools` Package in R [WWW Document]. URL https://lugoga.github.io/kitaa/posts/summarytool/

Introduction

Load Packages and Data

Data Frame Summary with dfSummary()

Data Frame Summary

Descriptive Statistics with descr()

Descriptive Statistics

Frequency Table with freq()

Frequencies

species

Cross tabulation

Cross-Tabulation, Row Proportions

species * sex

Cross-Tabulation, Row Proportions

species * sex

Summary and Conclusion

References

Citation

Data Frame Summary with `dfSummary()`

Descriptive Statistics with `descr()`

Frequency Table with `freq()`