Tidying Data frame

visualization
code
analysis
Author
Affiliation
Published

April 29, 2024

1 Introduction

The post Data cleaning, merging, and appending demonstrates data manipulation with a dplyr package (Wickham et al., 2019) and Basic plots with ggplot2 introduce plotting data using a grammar of graphics package ggplot2 (Wickham, 2016). However, before you manipulate and present your result in plots, your dataframe must be arranged in format that R can read and understand it. This post introduce and discuss an important concept in data management widely refereed to tidy.

2 Tidy data

The most common mistake made is treating spreadsheet programs like lab notebooks, that is, relying on context, notes in the margin, spatial layout of data and fields to convey information. As humans, we can (usually) interpret these things, but computers don’t view information the same way, and unless we explain to the computer what every single thing means (and that can be hard!), it will not be able to see how our data fits together. Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to organize our data for the computer to be able to understand it.

This is why it’s extremely important to structure well-formatted tables from the outset - before you even start entering data from your very first preliminary experiment. Data organization is the foundation of any research project. It can make it easier or harder to work with the data throughout your analysis, so it’s worth thinking about when you’re doing your data entry or setting up your experiment. You can organize and structure data in different ways in spreadsheets, but some of these choices can limit your ability to work with the data in other programs.

3 What is Tidy data?

Tidy data is a specific way of organizing data into a consistent format where each variable is a column, each row an observation, and each cell contains a single value (Figure 1). The benefits of data being tidy are myriad which include;

  • Organizing datasets as tidy data makes data cleaning efforts easier
  • Broad range of analytical tools are built upon the assumption to consume tidy data
  • Sharing tidy data increases re-use
Figure 1: Long and wide data organization commonly used to tidy data frames

4 Objectives

In this post we’ll learn some tools to help make our data tidy and more coder-friendly. Those include:

  • Use tidyr::pivot_wider() and tidyr::pivot_longer() to reshape data frames
  • janitor::clean_names() to make column headers more manageable
  • tidyr::unite() and tidyr::separate() to merge or separate information from different columns
  • Detect or replace a string with stringr functions

Lets first load the package we are going to use in this session

library(tidyverse)
library(janitor)
library(readxl)

Then we use read_csv function to load a file called fao_capture.csvv from our working directory and assign as fao.capture.wt.

fao.capture.wt = read_csv("../data/tidy/fao_capture.csv")

We’ve used read_csv() to import data from spreadsheets into R. Before we go further with the imported dataset in (fao-capture?), we need to check whether the fao.capture.wt dataset we have just loaded into our session is tidy. There are three assumptions that makes a dataset tidy, which are;

  • Each variable is its its own column
  • Each records is its own row, and
  • Each value is in its own cell
Table 1: Capture fisheries since 2012. Source FAO

ASFIS species

2012

2013

2014

2015

2016

2017

2018

2019

Marine fishes nei-Osteichthyes

137 683

112 095

110 535

118 078

118 204

124 895

109 938

415 031

Gulf menhaden-Brevoortia patronus

500 162

440 709

385 022

539 198

618 563

461 189

529 231

336 221

Round sardinella-Sardinella aurita

42 344

46 299

50 698

79 851

123 377

126 766

126 400

126 400

American cupped oyster-Crassostrea virginica

77 900

53 866

57 930

55 041

44 313

78 961

49 631

47 723

Ark clams nei-Arca spp

12 687

15 000

17 384

29 073

29 076

29 970

29 970

29 970

Stromboid conchs nei-Strombus spp

36 855

36 610

30 691

34 388

28 774

35 273

34 856

28 308

Blue crab-Callinectes sapidus

43 997

40 570

44 361

50 781

53 616

54 628

56 689

25 937

Yellowfin tuna-Thunnus albacares

19 087

21 865

25 831

27 036

34487

29 310

30 966

24 503

Caribbean spiny lobster-Panulirus argus

31 102

28 468

28 298

31 358

32 410

30 204

28 325

24 311

Atlantic seabob-Xiphopenaeus kroyeri

36 073

34 243

25 203

25 322

30 147

31 319

31 625

22 497

Mexican four-eyed octopus-Octopus maya

12 629

8 806

15 403

23 441

25 722

26 020

28809

20 119

A careful check of fao.capture.wt dataset presented in Table 1 and adhering to tidy assumptions, a fair question would be, is this dataset tidy? The answer is BIG NO. This is not tidy data but rather a MESSY data. We are going to look why we have given that answer that our fao.capture.wt dataset is messy!

5 Reshape from wide-to-long format

Although looking Table 1 our eyes can visually see the way the data is structured, but computers don’t view information the same way, we need to organize the data that computer can understand. If we look Table 1, we can see that the year variable is actually split over eight columns, so we’d say this is currently in wide format (?@fig-fide-long). There may be times when you want to have data in wide format, but often with code it is more efficient to convert to long format by gathering together observations for a variable that is currently split into multiple columns. Schematically, converting from wide to long format using pivot_longer() looks like this:

Conceptual diagram shwoing wide and long formats{# fig-fide-long}

We’ll use pivot_longer() function from dityr package (Wickham and Henry (2018)) to gather data from all years in inverts (columns 2012 to 2019) into two columns: one called year, which contains the year, another called weight containing the weight of each fish species landed and assign a name of new dataset as fao.capture.long.

The new data frame will be stored as inverts_long:

fao.capture.long = fao.capture.wt |>
  pivot_longer(
    cols = '2012':'2019',
    values_to = "wt_mt",
    names_to = "year"
    )

The outcome is the new long-format fao.capture.long data frame:

fao.capture.long |> 
  sample_n(size = 15) |> 
  flextable::flextable() |> 
  flextable::autofit()

ASFIS species

year

wt_mt

Gulf menhaden-Brevoortia patronus

2014

385 022

Marine fishes nei-Osteichthyes

2017

124 895

Stromboid conchs nei-Strombus spp

2015

34 388

Gulf menhaden-Brevoortia patronus

2017

461 189

Blue crab-Callinectes sapidus

2018

56 689

Atlantic seabob-Xiphopenaeus kroyeri

2012

36 073

American cupped oyster-Crassostrea virginica

2014

57 930

Mexican four-eyed octopus-Octopus maya

2017

26 020

American cupped oyster-Crassostrea virginica

2015

55 041

Round sardinella-Sardinella aurita

2012

42 344

Marine fishes nei-Osteichthyes

2013

112 095

Gulf menhaden-Brevoortia patronus

2012

500 162

Caribbean spiny lobster-Panulirus argus

2014

28 298

American cupped oyster-Crassostrea virginica

2013

53 866

Yellowfin tuna-Thunnus albacares

2018

30 966

let’s use a glimpse function from dplyr package to look at the internal structure of the of the fao.capture.long format, we just created. One thing that isn’t obvious at first (but would become obvious if you continued working with this data) is that since those year numbers were initially column names (characters), when they are stacked into the year column, their class wasn’t auto-updated to numeric. Similar way, the weight are separated with a space and are treated as character instead of being numeric.

fao.capture.long |>
  glimpse()
Rows: 88
Columns: 3
$ `ASFIS species` <chr> "Marine fishes nei-Osteichthyes", "Marine fishes nei-O…
$ year            <chr> "2012", "2013", "2014", "2015", "2016", "2017", "2018"…
$ wt_mt           <chr> "137 683", "112 095", "110 535", "118 078", "118 204",…

That’s a good thing! We don’t want R to update classes of our data without our instruction. We’ll use dplyr::mutate() in a different way here: to create a new column (that’s how we’ve used mutate() in ?@sec-dplyr that has the same name of an existing column, in order to update and overwrite the existing column. In this case, we’ll mutate() to add a column called year, which contains an as.integer() version of the existing year variable:

fao.capture.long = fao.capture.long |>
  mutate(year = as.integer(year))

Checking the class again, we see that year has been updated to a numeric variable:

fao.capture.long |>
  glimpse()
Rows: 88
Columns: 3
$ `ASFIS species` <chr> "Marine fishes nei-Osteichthyes", "Marine fishes nei-O…
$ year            <int> 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2012, …
$ wt_mt           <chr> "137 683", "112 095", "110 535", "118 078", "118 204",…

6 Reshape from long-to-wide format

In the previous example, we had information spread over multiple columns that we wanted to gather. Sometimes, we’ll have data that we want to spread over multiple columns. For example, imagine that starting from inverts_long we want each species in the common_name column to exist as its own column. In that case, we would be converting from a longer to a wider format, and will use tidyr::pivot_wider(). Specifically for our data, we’ll use pivot_wider() to spread the common_name across multiple columns as follows:

fao.capture.wide = fao.capture.long |>
  pivot_wider(
    names_from = "ASFIS species",
    values_from = "wt_mt"
    )
fao.capture.wide |> 
  flextable::flextable() |> 
  flextable::autofit()

year

Marine fishes nei-Osteichthyes

Gulf menhaden-Brevoortia patronus

Round sardinella-Sardinella aurita

American cupped oyster-Crassostrea virginica

Ark clams nei-Arca spp

Stromboid conchs nei-Strombus spp

Blue crab-Callinectes sapidus

Yellowfin tuna-Thunnus albacares

Caribbean spiny lobster-Panulirus argus

Atlantic seabob-Xiphopenaeus kroyeri

Mexican four-eyed octopus-Octopus maya

2,012

137 683

500 162

42 344

77 900

12 687

36 855

43 997

19 087

31 102

36 073

12 629

2,013

112 095

440 709

46 299

53 866

15 000

36 610

40 570

21 865

28 468

34 243

8 806

2,014

110 535

385 022

50 698

57 930

17 384

30 691

44 361

25 831

28 298

25 203

15 403

2,015

118 078

539 198

79 851

55 041

29 073

34 388

50 781

27 036

31 358

25 322

23 441

2,016

118 204

618 563

123 377

44 313

29 076

28 774

53 616

34487

32 410

30 147

25 722

2,017

124 895

461 189

126 766

78 961

29 970

35 273

54 628

29 310

30 204

31 319

26 020

2,018

109 938

529 231

126 400

49 631

29 970

34 856

56 689

30 966

28 325

31 625

28809

2,019

415 031

336 221

126 400

47 723

29 970

28 308

25 937

24 503

24 311

22 497

20 119

We can see that now each species has its own column (wider format). But also notice that those column headers (since they have spaces) might not be in the most coder-friendly format…

7 Clean variable (column) names

The janitor package by Sam Firke is a great collection of functions for some quick data cleaning, like:

  • janitor::clean_names(): update column headers to a case of your choosing
  • janitor::get_dupes(): see all rows that are duplicates within variables you choose
  • janitor::remove_empty(): remove empty rows and/or columns
  • janitor::adorn_*(): jazz up tables

Here, we’ll use janitor::clean_names() to convert all of our column headers to a more convenient case - the default is lower_snake_case, which means all spaces and symbols are replaced with an underscore (or a word describing the symbol), all characters are lowercase, and a few other nice adjustments. For example, janitor::clean_names() update ASFIS species variable name into much nicer asfis_species.

fao.capture.long = fao.capture.long |>
  janitor::clean_names()

fao.capture.long |> 
  sample_n(size = 10) |> 
  flextable::flextable() |> 
  flextable::autofit()

asfis_species

year

wt_mt

Ark clams nei-Arca spp

2,015

29 073

Stromboid conchs nei-Strombus spp

2,015

34 388

Ark clams nei-Arca spp

2,012

12 687

Round sardinella-Sardinella aurita

2,012

42 344

Yellowfin tuna-Thunnus albacares

2,017

29 310

Marine fishes nei-Osteichthyes

2,019

415 031

Blue crab-Callinectes sapidus

2,016

53 616

Atlantic seabob-Xiphopenaeus kroyeri

2,013

34 243

Gulf menhaden-Brevoortia patronus

2,015

539 198

Marine fishes nei-Osteichthyes

2,012

137 683

8 Separate merged variables

Sometimes we’ll want to separate contents of a single column into multiple columns, or combine entries from different columns into a single column. For example, the fao.capture.long has a asfis_species variable, which combine an english and scientific name of the particular fish group. That variable should be separated into two variables, one variable we will call english_name and the other species. tidyr package has a function called separate that is dedicated to divide variables that are combined. For our case, a asfis_species variable can be separate with a hyphen, which was used to distinguish between an english and species names. The separate() function accepts arguments for the name of the variable to separate. You also need to specify the names of the variable to separate into, and an optional separator.

fao.capture.long.sep = fao.capture.long %>% 
  separate(
    col = asfis_species, 
    into = c("english_name", "species"),
    sep = "-"
    )
fao.capture.long.sep|> 
  sample_n(size = 10) |> 
  flextable::flextable() |> 
  flextable::autofit()

english_name

species

year

wt_mt

Gulf menhaden

Brevoortia patronus

2,015

539 198

Ark clams nei

Arca spp

2,018

29 970

Mexican four

eyed octopus

2,013

8 806

American cupped oyster

Crassostrea virginica

2,016

44 313

Yellowfin tuna

Thunnus albacares

2,018

30 966

Ark clams nei

Arca spp

2,013

15 000

Ark clams nei

Arca spp

2,012

12 687

Marine fishes nei

Osteichthyes

2,012

137 683

Yellowfin tuna

Thunnus albacares

2,013

21 865

Yellowfin tuna

Thunnus albacares

2,012

19 087

9 Merge separate variables

The unite() function is the exact opposite of separate() in that it combines multiple columns into a single column. While not used nearly as often as separate() , there may be times when you need the functionality provided by unite(). For example, we use tidyr::unite() to combine the variable english_name and english_name to form common_species, and separate them with a hyphen - symbol between the two variables and assign the output as fao.capture.long.merge.

fao.capture.long.merge = fao.capture.long.sep |>
  unite(
    col = "common_species", # What to name the new united column
    c(english_name,  species), # The columns we'll unite (site, year)
    sep = "-" # How to separate the things we're uniting
    )

fao.capture.long.merge|> 
  sample_n(size = 10) |> 
  flextable::flextable() |> 
  flextable::autofit()

common_species

year

wt_mt

Gulf menhaden-Brevoortia patronus

2,013

440 709

Atlantic seabob-Xiphopenaeus kroyeri

2,016

30 147

Ark clams nei-Arca spp

2,014

17 384

Blue crab-Callinectes sapidus

2,018

56 689

Caribbean spiny lobster-Panulirus argus

2,017

30 204

Marine fishes nei-Osteichthyes

2,016

118 204

Gulf menhaden-Brevoortia patronus

2,015

539 198

Stromboid conchs nei-Strombus spp

2,014

30 691

Blue crab-Callinectes sapidus

2,016

53 616

Yellowfin tuna-Thunnus albacares

2,017

29 310

10 Replace a pattern

One thing we noticed in the dataset we imported is that the weight are not in numeric but rather as character. The reason for that is simple because for eye to read long numbers easily, we often separate a thousand with comma or space, which is the case for the data we loaded. though the eye can easily see and distinguish the numbers, that is hard computers can understand, as a key issue in tidying is that a cell should contain a single value. Therefore, since this weigh is in character format, there is nothing we can do to get any statistic values unless we convert them into numeric format. That’s simple and straight forward as R has a as.numeric function that convert character into numeric format values. Unfortunate, we do not get the numeric as we expected but instead we are given an NA values, which simply indicate that process isn’t fit for that value of the data type.

fao.capture.long |>
  mutate(wt_mt = as.numeric(wt_mt))|> 
  sample_n(size = 10) |> 
  flextable::flextable() |> 
  flextable::autofit()

asfis_species

year

wt_mt

Ark clams nei-Arca spp

2,016

Stromboid conchs nei-Strombus spp

2,016

Yellowfin tuna-Thunnus albacares

2,015

Marine fishes nei-Osteichthyes

2,016

Caribbean spiny lobster-Panulirus argus

2,019

Round sardinella-Sardinella aurita

2,017

Mexican four-eyed octopus-Octopus maya

2,018

28,809

Marine fishes nei-Osteichthyes

2,018

Round sardinella-Sardinella aurita

2,019

Stromboid conchs nei-Strombus spp

2,014

When you face that situation, you must keen look at the internal structure of the dataset to understand what lies behind the problem you encounter. Using a glimpse function, we notice the wt_mt variable is character format with number values, but why does the values fail to convert to numeric the answer is simple–presence of space to separate the values into thousands. We must get rid of the space before we convert the character into numeric values.

fao.capture.long |>
  glimpse()
Rows: 88
Columns: 3
$ asfis_species <chr> "Marine fishes nei-Osteichthyes", "Marine fishes nei-Ost…
$ year          <int> 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2012, 20…
$ wt_mt         <chr> "137 683", "112 095", "110 535", "118 078", "118 204", "…

A stringr package has several function that works well with strings values in R. Among these function is str_remove(), which remove a specified string pattern. In our case we remove a space in wt_mt variable and assign the variable name as wt_mt_free

fao.capture.long.free = fao.capture.long |>
  mutate(
    wt_mt_free = str_remove(string = wt_mt, 
                             pattern = " ")
    )

A glimpse of the fao.capture.long.free indicate that a wt_mt_free variable we just created has no spaces to separate values into thousands.

Rows: 88
Columns: 4
$ asfis_species <chr> "Marine fishes nei-Osteichthyes", "Marine fishes nei-Ost…
$ year          <int> 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2012, 20…
$ wt_mt         <chr> "137 683", "112 095", "110 535", "118 078", "118 204", "…
$ wt_mt_free    <chr> "137683", "112095", "110535", "118078", "118204", "12489…

Therefore, once we have removed space in the values, the variable is in the format the R will understand that this is values and can easily convert from character into numeric with as.numeric function;

fao.capture.long.free = fao.capture.long.free |>
  mutate(wt_mt = as.numeric(wt_mt_free)) |>
  select(-wt_mt_free)

fao.capture.long.free|> 
  sample_n(size = 10) |> 
  flextable::flextable() |> 
  flextable::autofit()

asfis_species

year

wt_mt

Mexican four-eyed octopus-Octopus maya

2,017

26,020

Ark clams nei-Arca spp

2,012

12,687

Atlantic seabob-Xiphopenaeus kroyeri

2,019

22,497

American cupped oyster-Crassostrea virginica

2,018

49,631

Atlantic seabob-Xiphopenaeus kroyeri

2,016

30,147

Caribbean spiny lobster-Panulirus argus

2,014

28,298

Caribbean spiny lobster-Panulirus argus

2,018

28,325

Caribbean spiny lobster-Panulirus argus

2,017

30,204

Yellowfin tuna-Thunnus albacares

2,019

24,503

Marine fishes nei-Osteichthyes

2,018

109,938

11 summary

The ultimate of tidy dataset is in the analysis and plotting. When the dataset is in the right format, it makes advanced analysis easy and smooth while exploring the data and expose the inherited pattern is also much easier. This is because the tools in many programming languages and database require tidy structure. For example, Figure 2 is the result of tidy dataset that allows to compute the mean and standard deviation and then use the metric computed to make a plot. We will cover ploting in much detail in ?@sec-ggplot.

##| echo: true
#| 

fao.capture.long.free %>% 
  separate(col = asfis_species, 
           into = c("name", "species"), 
           sep = "-") %>% 
  group_by(species) %>% 
  summarise(
    mean_wt = mean(wt_mt, na.rm = TRUE),
    std = sd(wt_mt, na.rm = TRUE)
    ) %>% 
  arrange(mean_wt) %>% 
  ggplot(aes(x = reorder(species, mean_wt), y = mean_wt))+
  geom_point(size = 3)+
  geom_errorbar(aes(ymin = mean_wt+std, 
                    ymax = mean_wt - std), 
                width = .5)+
  scale_y_continuous(name = "Landed Weight (MT)", 
                     position = "right", 
                     breaks = scales::pretty_breaks(n = 5),
                     labels = scales::label_number(big.mark = ","))+
  scale_x_discrete(name = "Species")+
  coord_flip() +
  theme_bw(base_size = 12)+
  theme(panel.grid = element_blank(), axis.text.y = element_text(face = "italic"))
Figure 2: Mean weight of selected species from Capture fisheries

12 References

Wickham, H., 2016. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York.
Wickham, H., François, R., Henry, L., Müller, K., 2019. Dplyr: A grammar of data manipulation.
Wickham, H., Henry, L., 2018. Tidyr: Easily tidy data with ’spread()’ and ’gather()’ functions.

Citation

BibTeX citation:
@online{semba2024,
  author = {Semba, Masumbuko},
  title = {Tidying {Data} Frame},
  date = {2024-04-29},
  url = {https://lugoga.github.io/kitaa/posts/tidying/},
  langid = {en}
}
For attribution, please cite this work as:
Semba, M., 2024. Tidying Data frame [WWW Document]. URL https://lugoga.github.io/kitaa/posts/tidying/