Data cleaning, merging, and appending

visualization
code
analysis
Author
Affiliation
Published

May 1, 2024

1 Introduction

In reality, data cleaning is not so much its own step in the data wrangling process as it is a constant activity that accompanies every other step, both because most data is not clean when we encounter it, and because how a data set (or part of it) needs to be “cleaned” is often revealed progressively as we work. At a high level, clean data might be summarized as being free from errors or typos – such as mismatched units, multiple spellings of the same word or term, and fields that are not well-separated – and missing or impossible values.

While many of these are at least somewhat straightforward to recognize (though not always to correct), however, deeper data problems may still persist. Measurement changes, calculation errors and other oversights – especially in system-generated data – often don’t reveal themselves until some level of analysis has been done and the data has been “reality-checked” with folks who significant expertise and/or first-hand experience with the subject.

flowchart LR
  A[Main variable types] --> B{Catrgorical}
  A[Main variable types] --> C{Numeric}
  B{Catrgorical} --> D[ordinal]
  B{Catrgorical} --> E[non-ordinal]
  C{Numeric} --> F[continuous]
  C{Numeric} --> G[discrete]

The iterative nature of data cleaning is an example of why data wrangling is a cycle rather than a linear series of steps: as your work reveals more about the data and your understanding of its relationship to the world deepens, you may find that you need to revisit earlier work that you’ve done and repeat or adjust certain aspects of the data

1.1 Importing data

Although creating data frames from existing data structures is extremely useful, by far the most common approach is to create a data frame by importing data from an external file. To do this, you’ll need to have your data correctly formatted and saved in a file format that R is able to recognize. Fortunately for us, R is able to recognize a wide variety of file formats, although in reality you’ll probably end up only using two or three regularly—comma separated, excel spreadsheet and tab-delimited.

The easiest method of creating a data file to import into R is to enter your data into a spreadsheet using either Microsoft Excel or LibreOffice Calc and save the spreadsheet as a comma separated (.csv), excel spreadsheet (.xls or .xlsx) and tab-delimited (.txt). We then read the file into our session, for this mode, we are going to use the LFQ_sample_1.xls, which is an Excel spreadsheet file. To load into the R session, we use a read_excel function from readxl package (Wickham and Bryan, 2018). Before we import the data, we need to load the packages that we will use their functions in this chapter

require(tidyverse)
require(magrittr)
lfq = readxl::read_excel(path = "../data/LFQ_sample_1.xls",
                         sheet = 1, 
                         skip = 0)

There are a few things to note about the above command. First, the file path and the filename (including the file extension) needs to be enclosed in either single or double quotes (i.e. the data/flower.txt bit) as the read_excel() function expects this to be a character string. If your working directory is already set to the directory which contains the file, you don’t need to include the entire file path just the filename.

In the example above, the file path is separated with a single forward slash /. This will work regardless of the operating system you are using. The argument sheet = 1 specify the function to pick the data from first sheet as spreadsheet store data in multiple sheet. The argument skip = 0 specify the function not to skip any row in the datase. This used to specify when the data has metadata information at the top.

After importing our data into R it doesn’t appear that R has done much, at least nothing appears in the R Console! To see the contents of the data frame we need to tell R to do so using a print function.

lfq %>% print()
# A tibble: 779 × 6
   Date                Species `Size (cm)` `Size Class` `Gear type` Landing_site
   <dttm>              <chr>         <dbl>        <dbl> <chr>       <chr>       
 1 2019-05-30 00:00:00 Siganu…        16.5           18 Speargun    Pwani       
 2 2019-05-30 00:00:00 Siganu…        12             12 Speargun    Pwani       
 3 2019-05-30 00:00:00 Siganu…        15             15 Speargun    Pwani       
 4 2019-05-30 00:00:00 Siganu…        33             33 Speargun    Pwani       
 5 2019-05-30 00:00:00 Siganu…        11.2           12 hook and l… Pwani       
 6 2019-05-30 00:00:00 Siganu…        17             18 hook and l… Pwani       
 7 2019-05-30 00:00:00 Siganu…        19             21 Gill net    Pwani       
 8 2019-05-30 00:00:00 Lethri…        10             12 Traps       Pwani       
 9 2019-05-30 00:00:00 Lethri…        11             12 Traps       Pwani       
10 2019-05-30 00:00:00 Lethri…         9             12 Traps       Pwani       
# ℹ 769 more rows

The data frame appears in the console and that prove to us that we successfully imported the data into R. But printing the imported data into console is only useful for small dataset and for large dataset that makes the console messy and there is nothing you can glean by simply printing the dataset in the console. The other common practice often used is looking on the internal structure of the dataset using str() function

lfq %>% 
  str()
tibble [779 × 6] (S3: tbl_df/tbl/data.frame)
 $ Date        : POSIXct[1:779], format: "2019-05-30" "2019-05-30" ...
 $ Species     : chr [1:779] "Siganus sutor" "Siganus sutor" "Siganus sutor" "Siganus sutor" ...
 $ Size (cm)   : num [1:779] 16.5 12 15 33 11.2 17 19 10 11 9 ...
 $ Size Class  : num [1:779] 18 12 15 33 12 18 21 12 12 12 ...
 $ Gear type   : chr [1:779] "Speargun" "Speargun" "Speargun" "Speargun" ...
 $ Landing_site: chr [1:779] "Pwani" "Pwani" "Pwani" "Pwani" ...

and a better solution is to use a tidyverse approach of glimpse function

lfq %>% 
  glimpse()
Rows: 779
Columns: 6
$ Date         <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ Species      <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ `Size (cm)`  <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ `Size Class` <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ `Gear type`  <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ Landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…

The output in the console tell us that lfq dataset we just imported has six columns (variables) and 779 rows (records). Each of the variable are listed along with their data types and few records of the data The first row of the data contains the variable (column) names.

1.2 Common import frustrations

It’s quite common to get a bunch of really frustrating error messages when you first start importing data into R. Perhaps the most common is

lfq = readxl::read_excel(path = "assets/fao_paul_dataset/LFQ_sample_1",
                         sheet = 1, 
                         skip = 0)

This error message is telling you that R cannot find the file you are trying to import. Several reasons can lead to that information error.The first is that you’ve made a mistake in the spelling of either the filename or file path. Another common mistake is that you have forgotten to include the file extension in the filename (i.e. .txt). Lastly, the file is not where you say it is or you’ve used an incorrect file path. Using RStudio Projects and having a logical directory structure goes a long way to avoiding these types of errors.

1.3 clean variable names

lfq %>% 
  names()
[1] "Date"         "Species"      "Size (cm)"    "Size Class"   "Gear type"   
[6] "Landing_site"
lfq.clean = lfq %>% 
  janitor::clean_names() %>% 
  janitor::remove_empty()

1.4 check the structure

lfq.clean %>% 
  glimpse()
Rows: 779
Columns: 6
$ date         <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ species      <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ size_cm      <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ size_class   <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ gear_type    <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…

1.5 creating variables

You notice that the date is datetime and date but the display is date. Thus, we need to create three variables that represent day, month, and year. We can extract these variables from date variables, We can use the day(), month(), and year() functions from lubridate package (Grolemund and Wickham, 2011);

lfq.time = lfq.clean %>% 
  mutate(
    day = lubridate::day(date),
    month = lubridate::month(date),
    year = lubridate::year(date)
    ) %>% 
  relocate(c(landing_site, day, month, year), .before = date)

lfq.time |> 
  sample_n(30) |> 
  select(-c(day, month, year)) |> 
  flextable::flextable() |> 
  flextable::autofit()

landing_site

date

species

size_cm

size_class

gear_type

Pwani

2019-07-12 00:00:00

Anampses spp.

14.0

15

Gill net

Bahari

2020-04-08 00:00:00

Leptoscarus vaigiensis

19.8

21

Speargun

Pwani

2019-07-11 00:00:00

Leptoscarus vaigiensis

25.5

27

Speargun

Pwani

2019-07-11 00:00:00

Lutjanus fulviflamma

27.5

30

hook and line

Bahari

2020-07-25 00:00:00

Siganus sutor

13.8

15

Traps

Pwani

2019-07-11 00:00:00

Calotomus carolinus

16.0

18

Speargun

Pwani

2019-05-30 00:00:00

Leptoscarus vaigiensis

15.6

18

Speargun

Pwani

2019-07-12 00:00:00

Siganus sutor

17.5

18

Traps

Pwani

2019-06-25 00:00:00

Siganus sutor

18.4

21

Traps

Pwani

2019-07-11 00:00:00

Calotomus carolinus

15.5

18

Beach seine

Pwani

2019-05-30 00:00:00

Lethrinus mahsena

11.1

12

hook and line

Pwani

2019-07-12 00:00:00

Siganus sutor

27.5

30

Gill net

Bahari

2020-07-25 00:00:00

Siganus sutor

15.5

18

Traps

Pwani

2019-07-11 00:00:00

Calotomus carolinus

16.0

18

Beach seine

Bahari

2020-07-25 00:00:00

Siganus sutor

16.0

18

Speargun

Bahari

2020-04-08 00:00:00

Siganus sutor

12.0

12

Traps

Bahari

2020-07-25 00:00:00

Siganus sutor

14.2

15

Traps

Pwani

2019-07-11 00:00:00

Calotomus carolinus

17.0

18

Beach seine

Pwani

2019-07-11 00:00:00

Calotomus carolinus

14.3

15

Speargun

Pwani

2019-05-30 00:00:00

Leptoscarus vaigiensis

16.5

18

Speargun

Pwani

2019-05-30 00:00:00

Leptoscarus vaigiensis

30.0

30

Speargun

Pwani

2019-07-11 00:00:00

Calotomus carolinus

18.0

18

Beach seine

Pwani

2019-05-30 00:00:00

Lethrinus mahsena

12.0

12

hook and line

Bahari

2020-07-25 00:00:00

Siganus sutor

14.3

15

Traps

Pwani

2019-07-11 00:00:00

Calotomus carolinus

16.0

18

Beach seine

Pwani

2019-07-11 00:00:00

Calotomus carolinus

13.5

15

Beach seine

Pwani

2019-07-11 00:00:00

Calotomus carolinus

9.5

12

Beach seine

Pwani

2019-05-30 00:00:00

Lethrinus mahsena

14.0

15

hook and line

Pwani

2019-07-11 00:00:00

Calotomus carolinus

17.5

18

Beach seine

Bahari

2020-07-25 00:00:00

Siganus sutor

15.6

18

Traps

1.6 Dropping and keeping variables

lfq.time %>% 
  select(-date)
# A tibble: 779 × 8
   landing_site   day month  year species           size_cm size_class gear_type
   <chr>        <int> <dbl> <dbl> <chr>               <dbl>      <dbl> <chr>    
 1 Pwani           30     5  2019 Siganus sutor        16.5         18 Speargun 
 2 Pwani           30     5  2019 Siganus sutor        12           12 Speargun 
 3 Pwani           30     5  2019 Siganus sutor        15           15 Speargun 
 4 Pwani           30     5  2019 Siganus sutor        33           33 Speargun 
 5 Pwani           30     5  2019 Siganus sutor        11.2         12 hook and…
 6 Pwani           30     5  2019 Siganus sutor        17           18 hook and…
 7 Pwani           30     5  2019 Siganus sutor        19           21 Gill net 
 8 Pwani           30     5  2019 Lethrinus mahsena    10           12 Traps    
 9 Pwani           30     5  2019 Lethrinus mahsena    11           12 Traps    
10 Pwani           30     5  2019 Lethrinus mahsena     9           12 Traps    
# ℹ 769 more rows

1.7 Missing values

R handles missing values differently than some other programs, including Stata. Missing values will appear as NA (whereas in Stata these will appear as large numeric values). Note, though, that NA is not a string, it is a symbol. If you try to conduct logical tests with NA, you are likely to get errors or NULL.

2 Length frequency

Let’s first look at the length data from the catch, which gives an indication of the size structure and health of the population. Using several case studies, we will filter the dataset to choose certain species and then plot histograms of the length data, which reveal the number of individual of each size class were measured in a sample or catch data.

You have several options for dealing with NA values. - na.omit() or na.exclude() will row-wise delete missing values in your dataset - na.fail() will keep an object only if no missing values are present - na.action= is a common option in many functions, for example in a linear model where you might write model <- lm(y~x, data = data, na.action = na.omit). - is.na allows you to logically test for NA values, for example when subsetting

2.1 Merging and combining data

You may want to draw on data from multiple sources or at differing levels of aggregation. To do this, you first must know what your data look like and know what format you ultimately want them to take. Whatever you do, do not attempt to merge datasets by hand—this leads to mistakes and frustration.

left join bind_row bind_col

2.2 Reshaping data

Reshaping data into different formats is a common task. With rectangular type data (data frames have the same number of rows in each column) there are two main data frame shapes that you will come across: the long format (sometimes called stacked) and the wide format (Wickham and Henry, 2018). An example of a long format data frame is given in Table 1. We can see that each row is a single observation from an individual species and each species can have multiple rows with different values of year and catch.

set.seed(123)

aa = read_csv("../data/tidy/fao_capture.csv")  %>% 
  sample_n(size = 5) %>% 
  separate(col = 'ASFIS species', into = c("english_name", "species"), sep = "-") %>% dplyr::select(1:2, 7:10) 
Table 1: Long format data frame data format

english_name

species

year

catch_mt

Round sardinella

Sardinella aurita

2016

123 377

Round sardinella

Sardinella aurita

2017

126 766

Round sardinella

Sardinella aurita

2018

126 400

Round sardinella

Sardinella aurita

2019

126 400

Mexican four

eyed octopus

2016

25 722

Mexican four

eyed octopus

2017

26 020

Mexican four

eyed octopus

2018

28809

Mexican four

eyed octopus

2019

20 119

Gulf menhaden

Brevoortia patronus

2016

618 563

Gulf menhaden

Brevoortia patronus

2017

461 189

Gulf menhaden

Brevoortia patronus

2018

529 231

Gulf menhaden

Brevoortia patronus

2019

336 221

Stromboid conchs nei

Strombus spp

2016

28 774

Stromboid conchs nei

Strombus spp

2017

35 273

Stromboid conchs nei

Strombus spp

2018

34 856

Stromboid conchs nei

Strombus spp

2019

28 308

Atlantic seabob

Xiphopenaeus kroyeri

2016

30 147

Atlantic seabob

Xiphopenaeus kroyeri

2017

31 319

Atlantic seabob

Xiphopenaeus kroyeri

2018

31 625

Atlantic seabob

Xiphopenaeus kroyeri

2019

22 497

We can also format the same data in the wide format as in Table 2. In this format the year are treated as variables and catches are values within the years.

Table 2: Wide format data frame data format

english_name

species

2016

2017

2018

2019

Round sardinella

Sardinella aurita

123 377

126 766

126 400

126 400

Mexican four

eyed octopus

25 722

26 020

28809

20 119

Gulf menhaden

Brevoortia patronus

618 563

461 189

529 231

336 221

Stromboid conchs nei

Strombus spp

28 774

35 273

34 856

28 308

Atlantic seabob

Xiphopenaeus kroyeri

30 147

31 319

31 625

22 497

Whilst there’s no inherent problem with either of these formats we will sometimes need to convert between the two because some functions will require a specific format for them to work. But for the modern tidyverse function, only the dataset that is long format accepted.

There are many ways to convert between these two formats but we’ll use the pivot_longer() and pivot_wider() functions from the tidyr package (Wickham and Henry, 2018). We will use fao_capture.csv dataset in this session, we can simply load it with read_csv function and specify the path our dataset reside;

capture.wide = read_csv("../data/tidy/fao_capture.csv")
capture.wide  |> 
  flextable::flextable() |> 
  flextable::autofit()

ASFIS species

2012

2013

2014

2015

2016

2017

2018

2019

Marine fishes nei-Osteichthyes

137 683

112 095

110 535

118 078

118 204

124 895

109 938

415 031

Gulf menhaden-Brevoortia patronus

500 162

440 709

385 022

539 198

618 563

461 189

529 231

336 221

Round sardinella-Sardinella aurita

42 344

46 299

50 698

79 851

123 377

126 766

126 400

126 400

American cupped oyster-Crassostrea virginica

77 900

53 866

57 930

55 041

44 313

78 961

49 631

47 723

Ark clams nei-Arca spp

12 687

15 000

17 384

29 073

29 076

29 970

29 970

29 970

Stromboid conchs nei-Strombus spp

36 855

36 610

30 691

34 388

28 774

35 273

34 856

28 308

Blue crab-Callinectes sapidus

43 997

40 570

44 361

50 781

53 616

54 628

56 689

25 937

Yellowfin tuna-Thunnus albacares

19 087

21 865

25 831

27 036

34487

29 310

30 966

24 503

Caribbean spiny lobster-Panulirus argus

31 102

28 468

28 298

31 358

32 410

30 204

28 325

24 311

Atlantic seabob-Xiphopenaeus kroyeri

36 073

34 243

25 203

25 322

30 147

31 319

31 625

22 497

Mexican four-eyed octopus-Octopus maya

12 629

8 806

15 403

23 441

25 722

26 020

28809

20 119

The capture.wideobject we just created is the dataframe but is in wide format. To change from wide to long format data table, we use the pivot_longer function. The first argument specified in pivot_longer is cols = 2:9 are index value of the variable in column 2 to colum 9 of the dataset we want to stack, names_to = "year" argument specify that the stacked variables will all be under a new variable name called year and the values in the stacked variables will all be under one variable called catch

capture.long = capture.wide %>% 
  pivot_longer(cols = 2:9, names_to = "year", values_to = "catch")
capture.long |> 
  FSA::headtail(n = 4) |> 
  flextable::flextable() |> 
  flextable::autofit()
Table 3: Data values arranged in tidy wide format

ASFIS species

year

catch

Marine fishes nei-Osteichthyes

2012

137 683

Marine fishes nei-Osteichthyes

2013

112 095

Marine fishes nei-Osteichthyes

2014

110 535

Marine fishes nei-Osteichthyes

2015

118 078

Mexican four-eyed octopus-Octopus maya

2016

25 722

Mexican four-eyed octopus-Octopus maya

2017

26 020

Mexican four-eyed octopus-Octopus maya

2018

28809

Mexican four-eyed octopus-Octopus maya

2019

20 119

The pivot_wider() function is used to convert from a long format data frame to a wide format data frame. The first argument is names_from = year is the variable of the data frame for which we want spread across and the second argument values_from = catch is the corresponding values will reside in each casted variable.

capture.wide = capture.long %>% 
  pivot_wider(names_from = year, values_from = catch)
Table 4: Data values arranged in tidy wide format

ASFIS species

2012

2013

2014

2015

2016

2017

2018

2019

Marine fishes nei-Osteichthyes

137 683

112 095

110 535

118 078

118 204

124 895

109 938

415 031

Gulf menhaden-Brevoortia patronus

500 162

440 709

385 022

539 198

618 563

461 189

529 231

336 221

Round sardinella-Sardinella aurita

42 344

46 299

50 698

79 851

123 377

126 766

126 400

126 400

American cupped oyster-Crassostrea virginica

77 900

53 866

57 930

55 041

44 313

78 961

49 631

47 723

Ark clams nei-Arca spp

12 687

15 000

17 384

29 073

29 076

29 970

29 970

29 970

Stromboid conchs nei-Strombus spp

36 855

36 610

30 691

34 388

28 774

35 273

34 856

28 308

Blue crab-Callinectes sapidus

43 997

40 570

44 361

50 781

53 616

54 628

56 689

25 937

Yellowfin tuna-Thunnus albacares

19 087

21 865

25 831

27 036

34487

29 310

30 966

24 503

Caribbean spiny lobster-Panulirus argus

31 102

28 468

28 298

31 358

32 410

30 204

28 325

24 311

Atlantic seabob-Xiphopenaeus kroyeri

36 073

34 243

25 203

25 322

30 147

31 319

31 625

22 497

Mexican four-eyed octopus-Octopus maya

12 629

8 806

15 403

23 441

25 722

26 020

28809

20 119

2.3 Saving transformed data

Once you have transformed your data and edited variables, you may want to save your new dataframe as an external data file that your collaborators or other researchers can use. As with reading in data, for saving out data you will need the foreign package. Similarly, you can choose from a number of different data formats to export to as well. Most commonly you will want to save out data as a .csv or a tab-delited text file.

myfile %>% write_csv("my.directory/file_name.csv")

3 Transforming, summarising, and analysing data

The R universe basically builds upon two (seemingly contradictive) approaches: base R and the tidyverse. While these two approaches are often seen as two different philosophies, they can form a symbiosis. We therefore recommend to pick whichever works best for you – or to combine the two.Whereas base R is already implemented in R, using the tidyverse requires users to load new packages. People often find base R unintuitive and hard to read. This is why Hadley Wickham developed and introduced the tidyverse – a more intuitive approach to managing and wrangling data (Wickham and Wickham, 2017).

Code written before 2014 was usually written in base R whereas the tidyverse style is becoming increasingly widespread. Again, which approach you prefer is rather a matter of personal taste than a decision between “right or wrong”. We first familiarize ourselves with the basic logic of the tidyverse style using some examples. For this, we use LFQ_sample_1.xls. since is an Excel format, lets import it in our session;

lfq.sample1 = readxl::read_excel("../data/LFQ_sample_1.xls")

Let’s print the assigned lfq.sample1

lfq.sample1 |> 
  FSA::headtail() |> 
  flextable::flextable() |> 
  flextable::autofit()

Date

Species

Size (cm)

Size Class

Gear type

Landing_site

2019-05-30 00:00:00

Siganus sutor

16.5

18

Speargun

Pwani

2019-05-30 00:00:00

Siganus sutor

12.0

12

Speargun

Pwani

2019-05-30 00:00:00

Siganus sutor

15.0

15

Speargun

Pwani

2020-07-25 00:00:00

Lutjanus fulviflamma

13.0

15

Speargun

Bahari

2020-07-25 00:00:00

Lutjanus fulviflamma

17.0

18

Speargun

Bahari

2020-07-25 00:00:00

Lutjanus fulviflamma

18.5

21

Speargun

Bahari

As you can see from the output generated in your console, the lfq.sample1 data frame object contains length information of several species. The fisheries data collected in this dataset ensured that each record on a separate row and each column represents a variable. The dataset contains six variables (columns) and 779 records (rows). These variables includes;

  • Date a sample was collected. The date is in the YYYY-MM-DD, which is the format that accepted
  • Species the scientific name of the species
  • Size (cm) the length of the fish species measured in centimeter
  • Size Class of the individual
  • Gear type used to catch the recorded fish, and
  • Landing-site a landing station where the fish was recorded

This dataset is arranged in appropriate format. The data was entered in spreadsheet with a few basics. These correspond to: what, where, when it only miss who collect the data or the originator of this dataset. The way this dataset is organized is the classical example of how set out our data in a similar fashion. This makes manipulating the data more straightforward and also maintains the multi-purpose nature of the analysis work. Having your data organized is really important!

Warning

Though the size and unit of the dataaset is provide, but it does not tell us much because we do not know whether the measurement is total length (TL), standard length (SL) or fork length (FL). It is imperative to specify in the dataset variable names what was measured and the unit

3.1 Renaming variables

Looking lfq.sample1 dataset presented in ?@tbl-case1, we notice that variable names do not adhere to standard variable names. The tidyverse style guide recommends snake case (words separated by underscores tl_cm) for object and column names. Let’s look back at our column names for a minute. Using the names() function from the and pipe operator %>% from the magrittr package simultaneously serves as the first pipe in a chain of commands and print names in the dataset

lfq.sample1 %>% 
  names()
[1] "Date"         "Species"      "Size (cm)"    "Size Class"   "Gear type"   
[6] "Landing_site"

There are all sorts of capital letters and spaces (e.g. “Gear type” ,” “Size Class” ) as well as symbols (“Size (cm)”). Therefore we mustl convert all of these to snake case for us. We can rename specific variables using the rename function from the dplyr package (Wickham et al., 2019). Since dplyr is part of the tidyverse package (Wickham and Wickham, 2017) that was loaded, we do not need to load it again but rather use its function to rename the variables. We’ll begin by specifying the name of our data frame object lfq.sample1, followed by the = operator so that we can overwrite the existing lfq.sample1 frame object with one that contains the renamed variables.

Next type lfq.sample1 followed by pipe operator %>% and then call arename function, where you parse the new names to their corresponding old names. As the first argument, let’s change the Date variable to date by typing the name of our new variable followed by = and, in quotation marks (” “), the name of the original variable date = "Date". As the second argument, let’s apply the same process as the second argument and change the Species variable to species by typing the name of our new variable followed by = and, in quotation marks (” “), the name of the original variable species = "Species". We follow the same naming procedure to the six variables as the chunk below highlight;

lfq.sample1 = lfq.sample1 %>% 
  rename(
    date = "Date",
    species = "Species",
    size_cm = "Size (cm)",
    size_class = "Size Class",
    gear_type = "Gear type",
    landing_site = "Landing_site"
    )

Using the head function from base R, let’s verify that we renamed the two variables successfully.

lfq.sample1
# A tibble: 779 × 6
   date                species         size_cm size_class gear_type landing_site
   <dttm>              <chr>             <dbl>      <dbl> <chr>     <chr>       
 1 2019-05-30 00:00:00 Siganus sutor      16.5         18 Speargun  Pwani       
 2 2019-05-30 00:00:00 Siganus sutor      12           12 Speargun  Pwani       
 3 2019-05-30 00:00:00 Siganus sutor      15           15 Speargun  Pwani       
 4 2019-05-30 00:00:00 Siganus sutor      33           33 Speargun  Pwani       
 5 2019-05-30 00:00:00 Siganus sutor      11.2         12 hook and… Pwani       
 6 2019-05-30 00:00:00 Siganus sutor      17           18 hook and… Pwani       
 7 2019-05-30 00:00:00 Siganus sutor      19           21 Gill net  Pwani       
 8 2019-05-30 00:00:00 Lethrinus mahs…    10           12 Traps     Pwani       
 9 2019-05-30 00:00:00 Lethrinus mahs…    11           12 Traps     Pwani       
10 2019-05-30 00:00:00 Lethrinus mahs…     9           12 Traps     Pwani       
# ℹ 769 more rows

As you can see, the new names have no spaces and all are written in small letters. Spaces in variables names were replaced with the underscore symbol.

Though the rename function is great tool to rename variables in the dataset, however there are times when you a dataset has hundred of variables names to change and using rename function from dplyr package become tedious. Furtunate , a janitor package has a nifty clean_names() function that clean all the variable names with a single line of code. I highly recommend incorporating it into your workflow.

lfq.sample1 = readxl::read_excel("../data/LFQ_sample_1.xls") %>% 
  janitor::clean_names()

lfq.sample1
# A tibble: 779 × 6
   date                species         size_cm size_class gear_type landing_site
   <dttm>              <chr>             <dbl>      <dbl> <chr>     <chr>       
 1 2019-05-30 00:00:00 Siganus sutor      16.5         18 Speargun  Pwani       
 2 2019-05-30 00:00:00 Siganus sutor      12           12 Speargun  Pwani       
 3 2019-05-30 00:00:00 Siganus sutor      15           15 Speargun  Pwani       
 4 2019-05-30 00:00:00 Siganus sutor      33           33 Speargun  Pwani       
 5 2019-05-30 00:00:00 Siganus sutor      11.2         12 hook and… Pwani       
 6 2019-05-30 00:00:00 Siganus sutor      17           18 hook and… Pwani       
 7 2019-05-30 00:00:00 Siganus sutor      19           21 Gill net  Pwani       
 8 2019-05-30 00:00:00 Lethrinus mahs…    10           12 Traps     Pwani       
 9 2019-05-30 00:00:00 Lethrinus mahs…    11           12 Traps     Pwani       
10 2019-05-30 00:00:00 Lethrinus mahs…     9           12 Traps     Pwani       
# ℹ 769 more rows

As you can see above, the clean_names() function handled every kind of messy variable name that was present in our lfq.sample1 dataset. Everything now looks neat and tidy

3.2 Filtering Data

Our lfq.sample1 contains six variables – date, species, size_cm, size_class, gear_type and landing_site;

lfq.sample1 %>% 
  glimpse()
Rows: 779
Columns: 6
$ date         <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ species      <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ size_cm      <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ size_class   <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ gear_type    <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…

Suppose we are only interested in Siganus sutor species only from the dataset. For this purpose, we can use the filter() function from dplyr package (Wickham et al., 2019)

lfq.sample1 %>% 
  filter(species == "Siganus sutor")
# A tibble: 196 × 6
   date                species       size_cm size_class gear_type   landing_site
   <dttm>              <chr>           <dbl>      <dbl> <chr>       <chr>       
 1 2019-05-30 00:00:00 Siganus sutor    16.5         18 Speargun    Pwani       
 2 2019-05-30 00:00:00 Siganus sutor    12           12 Speargun    Pwani       
 3 2019-05-30 00:00:00 Siganus sutor    15           15 Speargun    Pwani       
 4 2019-05-30 00:00:00 Siganus sutor    33           33 Speargun    Pwani       
 5 2019-05-30 00:00:00 Siganus sutor    11.2         12 hook and l… Pwani       
 6 2019-05-30 00:00:00 Siganus sutor    17           18 hook and l… Pwani       
 7 2019-05-30 00:00:00 Siganus sutor    19           21 Gill net    Pwani       
 8 2019-06-25 00:00:00 Siganus sutor    36.5         39 Speargun    Pwani       
 9 2019-06-25 00:00:00 Siganus sutor    11           12 hook and l… Pwani       
10 2019-06-25 00:00:00 Siganus sutor    11           12 hook and l… Pwani       
# ℹ 186 more rows

3.3 Select

Sometimes, you want to select specific variables. For instance, We are interested in three variables – date, landing_site, species, size_cm but not so much in the other variables. A select() function dplyr package allows you to do exactly this.

lfq.sample1 %>% 
  select(date, landing_site, species, size_cm)
# A tibble: 779 × 4
   date                landing_site species           size_cm
   <dttm>              <chr>        <chr>               <dbl>
 1 2019-05-30 00:00:00 Pwani        Siganus sutor        16.5
 2 2019-05-30 00:00:00 Pwani        Siganus sutor        12  
 3 2019-05-30 00:00:00 Pwani        Siganus sutor        15  
 4 2019-05-30 00:00:00 Pwani        Siganus sutor        33  
 5 2019-05-30 00:00:00 Pwani        Siganus sutor        11.2
 6 2019-05-30 00:00:00 Pwani        Siganus sutor        17  
 7 2019-05-30 00:00:00 Pwani        Siganus sutor        19  
 8 2019-05-30 00:00:00 Pwani        Lethrinus mahsena    10  
 9 2019-05-30 00:00:00 Pwani        Lethrinus mahsena    11  
10 2019-05-30 00:00:00 Pwani        Lethrinus mahsena     9  
# ℹ 769 more rows

3.4 Arranging variables

Let’s say we want to know the species with larges and smallest size throughout the entire dataset. This can be done with the arrange() function.

lfq.sample1 %>% 
  arrange(size_cm)
# A tibble: 779 × 6
   date                species         size_cm size_class gear_type landing_site
   <dttm>              <chr>             <dbl>      <dbl> <chr>     <chr>       
 1 2019-07-12 00:00:00 Lethrinus harak     6            6 Traps     Pwani       
 2 2020-04-08 00:00:00 Lethrinus mahs…     8.3         12 Traps     Bahari      
 3 2020-04-08 00:00:00 Lethrinus mahs…     8.4         12 Traps     Bahari      
 4 2019-05-30 00:00:00 Lethrinus mahs…     9           12 Traps     Pwani       
 5 2019-05-30 00:00:00 Lethrinus mahs…     9           12 Traps     Pwani       
 6 2020-06-15 00:00:00 Siganus sutor       9           12 Traps     Bahari      
 7 2020-06-15 00:00:00 Lethrinus harak     9           12 Traps     Bahari      
 8 2019-05-30 00:00:00 Lethrinus mahs…     9.3         12 hook and… Pwani       
 9 2019-05-30 00:00:00 Lethrinus mahs…     9.3         12 hook and… Pwani       
10 2020-04-08 00:00:00 Lethrinus mahs…     9.4         12 Traps     Bahari      
# ℹ 769 more rows

As we can see, Lethrinus are the species with the smallest size in the datase. The smallest size in the dataset is 6cm. By default, the arrange() function sorts the data in ascending order. To display the data in descending order, we add desc().

lfq.sample1 %>% 
  arrange(desc(size_cm))
# A tibble: 779 × 6
   date                species         size_cm size_class gear_type landing_site
   <dttm>              <chr>             <dbl>      <dbl> <chr>     <chr>       
 1 2019-06-25 00:00:00 Siganus sutor      36.5         39 Speargun  Pwani       
 2 2019-05-30 00:00:00 Siganus sutor      33           33 Speargun  Pwani       
 3 2020-04-08 00:00:00 Calotomus caro…    30.4         33 Traps     Bahari      
 4 2020-06-15 00:00:00 Lethrinus mahs…    30.2         33 hook and… Bahari      
 5 2019-05-30 00:00:00 Leptoscarus va…    30           30 Speargun  Pwani       
 6 2020-04-08 00:00:00 Anampses spp.      30           30 Speargun  Bahari      
 7 2019-07-11 00:00:00 Lutjanus fulvi…    27.5         30 hook and… Pwani       
 8 2019-07-12 00:00:00 Siganus sutor      27.5         30 Gill net  Pwani       
 9 2019-07-11 00:00:00 Calotomus caro…    27           27 Beach se… Pwani       
10 2019-07-12 00:00:00 Siganus sutor      27           27 Gill net  Pwani       
# ℹ 769 more rows

As we can see, Siganus sutor are the species with the larges size in the dataset. The highest size in the dataset is 36.50cm.

3.5 Extracting unique observations

Which unique species are included in the dataset? To get this information, we use the distinct() function in dplyr.

lfq.sample1 %>% 
  distinct(species)
# A tibble: 9 × 1
  species               
  <chr>                 
1 Siganus sutor         
2 Lethrinus mahsena     
3 Leptoscarus vaigiensis
4 Calotomus carolinus   
5 Lutjanus fulviflamma  
6 Lethrinus harak       
7 Anampses spp.         
8 Acanthurus chronixis  
9 Cheilinus chlorourus  

As we see, there are nine species in the dataset arranged in the species variable of the dataframe. However, We might need these species in vector instead of data frame. We use the function pull for that, which convert from dataframe to vector dataset.

lfq.sample1 %>% 
  distinct(species) %>% 
  pull()
[1] "Siganus sutor"          "Lethrinus mahsena"      "Leptoscarus vaigiensis"
[4] "Calotomus carolinus"    "Lutjanus fulviflamma"   "Lethrinus harak"       
[7] "Anampses spp."          "Acanthurus chronixis"   "Cheilinus chlorourus"  

3.6 Creating new variables

The mutate() command allows you to generate a new variable from existing. Let’s say you want to generate a day, month and year from date variables. The lubridate packaged (Grolemund and Wickham, 2011) has functions dedicated for dealing with dates, some of these functions include day(), month(), and year() that are used to extract values.

lfq.sample1 %>% 
  select(date, species, size_cm) %>% 
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    day = lubridate::day(date)
    )
# A tibble: 779 × 6
   date                species           size_cm  year month   day
   <dttm>              <chr>               <dbl> <dbl> <dbl> <int>
 1 2019-05-30 00:00:00 Siganus sutor        16.5  2019     5    30
 2 2019-05-30 00:00:00 Siganus sutor        12    2019     5    30
 3 2019-05-30 00:00:00 Siganus sutor        15    2019     5    30
 4 2019-05-30 00:00:00 Siganus sutor        33    2019     5    30
 5 2019-05-30 00:00:00 Siganus sutor        11.2  2019     5    30
 6 2019-05-30 00:00:00 Siganus sutor        17    2019     5    30
 7 2019-05-30 00:00:00 Siganus sutor        19    2019     5    30
 8 2019-05-30 00:00:00 Lethrinus mahsena    10    2019     5    30
 9 2019-05-30 00:00:00 Lethrinus mahsena    11    2019     5    30
10 2019-05-30 00:00:00 Lethrinus mahsena     9    2019     5    30
# ℹ 769 more rows

3.7 Group-wise operations

Suppose we are interested to know the sample size for each species in the dataset along with the mean, median, and standard deviation of the body size. That’s is not possible with single function in the tidyverse but rather a combination of functions. We use a combination of group_by() (to group our results by month), and parse the specific function we want to compute in the summarise() function.

lfq.sample1 %>% 
  group_by(species) %>% 
  summarise(
    n = n(),
    mean_cm = mean(size_cm),
    median_cm = median(size_cm),
    std_cm = sd(size_cm)
    ) 
# A tibble: 9 × 5
  species                    n mean_cm median_cm std_cm
  <chr>                  <int>   <dbl>     <dbl>  <dbl>
1 Acanthurus chronixis      21    13.7      12.5  2.65 
2 Anampses spp.             17    17.5      16.5  4.43 
3 Calotomus carolinus      176    16.4      16.5  3.05 
4 Cheilinus chlorourus       4    13.9      14    0.810
5 Leptoscarus vaigiensis   146    18.1      17.5  3.53 
6 Lethrinus harak           37    12.6      12    2.72 
7 Lethrinus mahsena        135    11.9      11.9  2.39 
8 Lutjanus fulviflamma      47    16.7      16.9  2.67 
9 Siganus sutor            196    18.2      17.5  4.66 

The computed variables are grouped after performing transformations. In this case, you can pipe another command with ungroup() to ungroup your result.

lfq.sample1 %>% 
  group_by(species) %>% 
  summarise(
    n = n(),
    mean_cm = mean(size_cm),
    median_cm = median(size_cm),
    std_cm = sd(size_cm)
    ) %>% 
  ungroup()
# A tibble: 9 × 5
  species                    n mean_cm median_cm std_cm
  <chr>                  <int>   <dbl>     <dbl>  <dbl>
1 Acanthurus chronixis      21    13.7      12.5  2.65 
2 Anampses spp.             17    17.5      16.5  4.43 
3 Calotomus carolinus      176    16.4      16.5  3.05 
4 Cheilinus chlorourus       4    13.9      14    0.810
5 Leptoscarus vaigiensis   146    18.1      17.5  3.53 
6 Lethrinus harak           37    12.6      12    2.72 
7 Lethrinus mahsena        135    11.9      11.9  2.39 
8 Lutjanus fulviflamma      47    16.7      16.9  2.67 
9 Siganus sutor            196    18.2      17.5  4.66 

In some cases, you might wish to arrange the sample size from lowest to highest sample size along with the computed values. You just add the arrange function

lfq.sample1 %>% 
  group_by(species) %>% 
  summarise(
    n = n(),
    mean_cm = mean(size_cm),
    median_cm = median(size_cm),
    std_cm = sd(size_cm)
    ) %>% 
  arrange(n)
# A tibble: 9 × 5
  species                    n mean_cm median_cm std_cm
  <chr>                  <int>   <dbl>     <dbl>  <dbl>
1 Cheilinus chlorourus       4    13.9      14    0.810
2 Anampses spp.             17    17.5      16.5  4.43 
3 Acanthurus chronixis      21    13.7      12.5  2.65 
4 Lethrinus harak           37    12.6      12    2.72 
5 Lutjanus fulviflamma      47    16.7      16.9  2.67 
6 Lethrinus mahsena        135    11.9      11.9  2.39 
7 Leptoscarus vaigiensis   146    18.1      17.5  3.53 
8 Calotomus carolinus      176    16.4      16.5  3.05 
9 Siganus sutor            196    18.2      17.5  4.66 

You notice that Cheilinus chlorourus has a only four observation where as Siganus sutor has 196 records in the dataset. You can also explore the summary statistics to see whether the mean and median values are close (indicator of normal distribution) or differ (indication of skewness)

4 References

Grolemund, G., Wickham, H., 2011. Dates and times made easy with lubridate. Journal of Statistical Software 40, 1–25.
Wickham, H., Bryan, J., 2018. Readxl: Read excel files.
Wickham, H., François, R., Henry, L., Müller, K., 2019. Dplyr: A grammar of data manipulation.
Wickham, H., Henry, L., 2018. Tidyr: Easily tidy data with ’spread()’ and ’gather()’ functions.
Wickham, H., Wickham, M.H., 2017. Tidyverse: Easily install and load the ’tidyverse’.

Citation

BibTeX citation:
@online{semba2024,
  author = {Semba, Masumbuko},
  title = {Data Cleaning, Merging, and Appending},
  date = {2024-05-01},
  url = {https://lugoga.github.io/kitaa/posts/manipulate/},
  langid = {en}
}
For attribution, please cite this work as:
Semba, M., 2024. Data cleaning, merging, and appending [WWW Document]. URL https://lugoga.github.io/kitaa/posts/manipulate/