DATIKA - Data cleaning, merging, and appending

1 Introduction

In reality, data cleaning is not so much its own step in the data wrangling process as it is a constant activity that accompanies every other step, both because most data is not clean when we encounter it, and because how a data set (or part of it) needs to be “cleaned” is often revealed progressively as we work. At a high level, clean data might be summarized as being free from errors or typos – such as mismatched units, multiple spellings of the same word or term, and fields that are not well-separated – and missing or impossible values.

While many of these are at least somewhat straightforward to recognize (though not always to correct), however, deeper data problems may still persist. Measurement changes, calculation errors and other oversights – especially in system-generated data – often don’t reveal themselves until some level of analysis has been done and the data has been “reality-checked” with folks who significant expertise and/or first-hand experience with the subject.

flowchart LR
  A[Main variable types] --> B{Catrgorical}
  A[Main variable types] --> C{Numeric}
  B{Catrgorical} --> D[ordinal]
  B{Catrgorical} --> E[non-ordinal]
  C{Numeric} --> F[continuous]
  C{Numeric} --> G[discrete]

The iterative nature of data cleaning is an example of why data wrangling is a cycle rather than a linear series of steps: as your work reveals more about the data and your understanding of its relationship to the world deepens, you may find that you need to revisit earlier work that you’ve done and repeat or adjust certain aspects of the data

1.1 Importing data

Although creating data frames from existing data structures is extremely useful, by far the most common approach is to create a data frame by importing data from an external file. To do this, you’ll need to have your data correctly formatted and saved in a file format that R is able to recognize. Fortunately for us, R is able to recognize a wide variety of file formats, although in reality you’ll probably end up only using two or three regularly—comma separated, excel spreadsheet and tab-delimited.

The easiest method of creating a data file to import into R is to enter your data into a spreadsheet using either Microsoft Excel or LibreOffice Calc and save the spreadsheet as a comma separated (.csv), excel spreadsheet (.xls or .xlsx) and tab-delimited (.txt). We then read the file into our session, for this mode, we are going to use the LFQ_sample_1.xls, which is an Excel spreadsheet file. To load into the R session, we use a read_excel function from readxl package (Wickham and Bryan, 2018). Before we import the data, we need to load the packages that we will use their functions in this chapter

require(tidyverse)
require(magrittr)

lfq = readxl::read_excel(path = "../data/LFQ_sample_1.xls",
                         sheet = 1, 
                         skip = 0)

There are a few things to note about the above command. First, the file path and the filename (including the file extension) needs to be enclosed in either single or double quotes (i.e. the data/flower.txt bit) as the read_excel() function expects this to be a character string. If your working directory is already set to the directory which contains the file, you don’t need to include the entire file path just the filename.

In the example above, the file path is separated with a single forward slash /. This will work regardless of the operating system you are using. The argument sheet = 1 specify the function to pick the data from first sheet as spreadsheet store data in multiple sheet. The argument skip = 0 specify the function not to skip any row in the datase. This used to specify when the data has metadata information at the top.

After importing our data into R it doesn’t appear that R has done much, at least nothing appears in the R Console! To see the contents of the data frame we need to tell R to do so using a print function.

lfq %>% print()

# A tibble: 779 × 6
   Date                Species `Size (cm)` `Size Class` `Gear type` Landing_site
   <dttm>              <chr>         <dbl>        <dbl> <chr>       <chr>       
 1 2019-05-30 00:00:00 Siganu…        16.5           18 Speargun    Pwani       
 2 2019-05-30 00:00:00 Siganu…        12             12 Speargun    Pwani       
 3 2019-05-30 00:00:00 Siganu…        15             15 Speargun    Pwani       
 4 2019-05-30 00:00:00 Siganu…        33             33 Speargun    Pwani       
 5 2019-05-30 00:00:00 Siganu…        11.2           12 hook and l… Pwani       
 6 2019-05-30 00:00:00 Siganu…        17             18 hook and l… Pwani       
 7 2019-05-30 00:00:00 Siganu…        19             21 Gill net    Pwani       
 8 2019-05-30 00:00:00 Lethri…        10             12 Traps       Pwani       
 9 2019-05-30 00:00:00 Lethri…        11             12 Traps       Pwani       
10 2019-05-30 00:00:00 Lethri…         9             12 Traps       Pwani       
# ℹ 769 more rows

The data frame appears in the console and that prove to us that we successfully imported the data into R. But printing the imported data into console is only useful for small dataset and for large dataset that makes the console messy and there is nothing you can glean by simply printing the dataset in the console. The other common practice often used is looking on the internal structure of the dataset using str() function

lfq %>% 
  str()

tibble [779 × 6] (S3: tbl_df/tbl/data.frame)
 $ Date        : POSIXct[1:779], format: "2019-05-30" "2019-05-30" ...
 $ Species     : chr [1:779] "Siganus sutor" "Siganus sutor" "Siganus sutor" "Siganus sutor" ...
 $ Size (cm)   : num [1:779] 16.5 12 15 33 11.2 17 19 10 11 9 ...
 $ Size Class  : num [1:779] 18 12 15 33 12 18 21 12 12 12 ...
 $ Gear type   : chr [1:779] "Speargun" "Speargun" "Speargun" "Speargun" ...
 $ Landing_site: chr [1:779] "Pwani" "Pwani" "Pwani" "Pwani" ...

and a better solution is to use a tidyverse approach of glimpse function

lfq %>% 
  glimpse()

Rows: 779
Columns: 6
$ Date         <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ Species      <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ `Size (cm)`  <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ `Size Class` <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ `Gear type`  <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ Landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…

The output in the console tell us that lfq dataset we just imported has six columns (variables) and 779 rows (records). Each of the variable are listed along with their data types and few records of the data The first row of the data contains the variable (column) names.

1.2 Common import frustrations

It’s quite common to get a bunch of really frustrating error messages when you first start importing data into R. Perhaps the most common is

lfq = readxl::read_excel(path = "assets/fao_paul_dataset/LFQ_sample_1",
                         sheet = 1, 
                         skip = 0)

This error message is telling you that R cannot find the file you are trying to import. Several reasons can lead to that information error.The first is that you’ve made a mistake in the spelling of either the filename or file path. Another common mistake is that you have forgotten to include the file extension in the filename (i.e. .txt). Lastly, the file is not where you say it is or you’ve used an incorrect file path. Using RStudio Projects and having a logical directory structure goes a long way to avoiding these types of errors.

1.3 clean variable names

lfq %>% 
  names()

[1] "Date"         "Species"      "Size (cm)"    "Size Class"   "Gear type"   
[6] "Landing_site"

lfq.clean = lfq %>% 
  janitor::clean_names() %>% 
  janitor::remove_empty()

1.4 check the structure

lfq.clean %>% 
  glimpse()

Rows: 779
Columns: 6
$ date         <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ species      <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ size_cm      <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ size_class   <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ gear_type    <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…

1.5 creating variables

You notice that the date is datetime and date but the display is date. Thus, we need to create three variables that represent day, month, and year. We can extract these variables from date variables, We can use the day(), month(), and year() functions from lubridate package (Grolemund and Wickham, 2011);

lfq.time = lfq.clean %>% 
  mutate(
    day = lubridate::day(date),
    month = lubridate::month(date),
    year = lubridate::year(date)
    ) %>% 
  relocate(c(landing_site, day, month, year), .before = date)

lfq.time |> 
  sample_n(30) |> 
  select(-c(day, month, year)) |> 
  flextable::flextable() |> 
  flextable::autofit()

landing_site	date	species	size_cm	size_class	gear_type
Pwani	2019-07-12 00:00:00	Anampses spp.	14.0	15	Gill net
Bahari	2020-04-08 00:00:00	Leptoscarus vaigiensis	19.8	21	Speargun
Pwani	2019-07-11 00:00:00	Leptoscarus vaigiensis	25.5	27	Speargun
Pwani	2019-07-11 00:00:00	Lutjanus fulviflamma	27.5	30	hook and line
Bahari	2020-07-25 00:00:00	Siganus sutor	13.8	15	Traps
Pwani	2019-07-11 00:00:00	Calotomus carolinus	16.0	18	Speargun
Pwani	2019-05-30 00:00:00	Leptoscarus vaigiensis	15.6	18	Speargun
Pwani	2019-07-12 00:00:00	Siganus sutor	17.5	18	Traps
Pwani	2019-06-25 00:00:00	Siganus sutor	18.4	21	Traps
Pwani	2019-07-11 00:00:00	Calotomus carolinus	15.5	18	Beach seine
Pwani	2019-05-30 00:00:00	Lethrinus mahsena	11.1	12	hook and line
Pwani	2019-07-12 00:00:00	Siganus sutor	27.5	30	Gill net
Bahari	2020-07-25 00:00:00	Siganus sutor	15.5	18	Traps
Pwani	2019-07-11 00:00:00	Calotomus carolinus	16.0	18	Beach seine
Bahari	2020-07-25 00:00:00	Siganus sutor	16.0	18	Speargun
Bahari	2020-04-08 00:00:00	Siganus sutor	12.0	12	Traps
Bahari	2020-07-25 00:00:00	Siganus sutor	14.2	15	Traps
Pwani	2019-07-11 00:00:00	Calotomus carolinus	17.0	18	Beach seine
Pwani	2019-07-11 00:00:00	Calotomus carolinus	14.3	15	Speargun
Pwani	2019-05-30 00:00:00	Leptoscarus vaigiensis	16.5	18	Speargun
Pwani	2019-05-30 00:00:00	Leptoscarus vaigiensis	30.0	30	Speargun
Pwani	2019-07-11 00:00:00	Calotomus carolinus	18.0	18	Beach seine
Pwani	2019-05-30 00:00:00	Lethrinus mahsena	12.0	12	hook and line
Bahari	2020-07-25 00:00:00	Siganus sutor	14.3	15	Traps
Pwani	2019-07-11 00:00:00	Calotomus carolinus	16.0	18	Beach seine
Pwani	2019-07-11 00:00:00	Calotomus carolinus	13.5	15	Beach seine
Pwani	2019-07-11 00:00:00	Calotomus carolinus	9.5	12	Beach seine
Pwani	2019-05-30 00:00:00	Lethrinus mahsena	14.0	15	hook and line
Pwani	2019-07-11 00:00:00	Calotomus carolinus	17.5	18	Beach seine
Bahari	2020-07-25 00:00:00	Siganus sutor	15.6	18	Traps

1.6 Dropping and keeping variables

lfq.time %>% 
  select(-date)

# A tibble: 779 × 8
   landing_site   day month  year species           size_cm size_class gear_type
   <chr>        <int> <dbl> <dbl> <chr>               <dbl>      <dbl> <chr>    
 1 Pwani           30     5  2019 Siganus sutor        16.5         18 Speargun 
 2 Pwani           30     5  2019 Siganus sutor        12           12 Speargun 
 3 Pwani           30     5  2019 Siganus sutor        15           15 Speargun 
 4 Pwani           30     5  2019 Siganus sutor        33           33 Speargun 
 5 Pwani           30     5  2019 Siganus sutor        11.2         12 hook and…
 6 Pwani           30     5  2019 Siganus sutor        17           18 hook and…
 7 Pwani           30     5  2019 Siganus sutor        19           21 Gill net 
 8 Pwani           30     5  2019 Lethrinus mahsena    10           12 Traps    
 9 Pwani           30     5  2019 Lethrinus mahsena    11           12 Traps    
10 Pwani           30     5  2019 Lethrinus mahsena     9           12 Traps    
# ℹ 769 more rows

1.7 Missing values

R handles missing values differently than some other programs, including Stata. Missing values will appear as NA (whereas in Stata these will appear as large numeric values). Note, though, that NA is not a string, it is a symbol. If you try to conduct logical tests with NA, you are likely to get errors or NULL.

2 Length frequency

Let’s first look at the length data from the catch, which gives an indication of the size structure and health of the population. Using several case studies, we will filter the dataset to choose certain species and then plot histograms of the length data, which reveal the number of individual of each size class were measured in a sample or catch data.

You have several options for dealing with NA values. - na.omit() or na.exclude() will row-wise delete missing values in your dataset - na.fail() will keep an object only if no missing values are present - na.action= is a common option in many functions, for example in a linear model where you might write model <- lm(y~x, data = data, na.action = na.omit). - is.na allows you to logically test for NA values, for example when subsetting

2.1 Merging and combining data

You may want to draw on data from multiple sources or at differing levels of aggregation. To do this, you first must know what your data look like and know what format you ultimately want them to take. Whatever you do, do not attempt to merge datasets by hand—this leads to mistakes and frustration.

left join bind_row bind_col

2.2 Reshaping data

Reshaping data into different formats is a common task. With rectangular type data (data frames have the same number of rows in each column) there are two main data frame shapes that you will come across: the long format (sometimes called stacked) and the wide format (Wickham and Henry, 2018). An example of a long format data frame is given in Table 1. We can see that each row is a single observation from an individual species and each species can have multiple rows with different values of year and catch.

set.seed(123)

aa = read_csv("../data/tidy/fao_capture.csv")  %>% 
  sample_n(size = 5) %>% 
  separate(col = 'ASFIS species', into = c("english_name", "species"), sep = "-") %>% dplyr::select(1:2, 7:10)

Table 1: Long format data frame data format

english_name	species	year	catch_mt
Round sardinella	Sardinella aurita	2016	123 377
Round sardinella	Sardinella aurita	2017	126 766
Round sardinella	Sardinella aurita	2018	126 400
Round sardinella	Sardinella aurita	2019	126 400
Mexican four	eyed octopus	2016	25 722
Mexican four	eyed octopus	2017	26 020
Mexican four	eyed octopus	2018	28809
Mexican four	eyed octopus	2019	20 119
Gulf menhaden	Brevoortia patronus	2016	618 563
Gulf menhaden	Brevoortia patronus	2017	461 189
Gulf menhaden	Brevoortia patronus	2018	529 231
Gulf menhaden	Brevoortia patronus	2019	336 221
Stromboid conchs nei	Strombus spp	2016	28 774
Stromboid conchs nei	Strombus spp	2017	35 273
Stromboid conchs nei	Strombus spp	2018	34 856
Stromboid conchs nei	Strombus spp	2019	28 308
Atlantic seabob	Xiphopenaeus kroyeri	2016	30 147
Atlantic seabob	Xiphopenaeus kroyeri	2017	31 319
Atlantic seabob	Xiphopenaeus kroyeri	2018	31 625
Atlantic seabob	Xiphopenaeus kroyeri	2019	22 497

We can also format the same data in the wide format as in Table 2. In this format the year are treated as variables and catches are values within the years.

Table 2: Wide format data frame data format

english_name	species	2016	2017	2018	2019
Round sardinella	Sardinella aurita	123 377	126 766	126 400	126 400
Mexican four	eyed octopus	25 722	26 020	28809	20 119
Gulf menhaden	Brevoortia patronus	618 563	461 189	529 231	336 221
Stromboid conchs nei	Strombus spp	28 774	35 273	34 856	28 308
Atlantic seabob	Xiphopenaeus kroyeri	30 147	31 319	31 625	22 497

Whilst there’s no inherent problem with either of these formats we will sometimes need to convert between the two because some functions will require a specific format for them to work. But for the modern tidyverse function, only the dataset that is long format accepted.

There are many ways to convert between these two formats but we’ll use the pivot_longer() and pivot_wider() functions from the tidyr package (Wickham and Henry, 2018). We will use fao_capture.csv dataset in this session, we can simply load it with read_csv function and specify the path our dataset reside;

capture.wide = read_csv("../data/tidy/fao_capture.csv")

capture.wide  |> 
  flextable::flextable() |> 
  flextable::autofit()

ASFIS species	2012	2013	2014	2015	2016	2017	2018	2019
Marine fishes nei-Osteichthyes	137 683	112 095	110 535	118 078	118 204	124 895	109 938	415 031
Gulf menhaden-Brevoortia patronus	500 162	440 709	385 022	539 198	618 563	461 189	529 231	336 221
Round sardinella-Sardinella aurita	42 344	46 299	50 698	79 851	123 377	126 766	126 400	126 400
American cupped oyster-Crassostrea virginica	77 900	53 866	57 930	55 041	44 313	78 961	49 631	47 723
Ark clams nei-Arca spp	12 687	15 000	17 384	29 073	29 076	29 970	29 970	29 970
Stromboid conchs nei-Strombus spp	36 855	36 610	30 691	34 388	28 774	35 273	34 856	28 308
Blue crab-Callinectes sapidus	43 997	40 570	44 361	50 781	53 616	54 628	56 689	25 937
Yellowfin tuna-Thunnus albacares	19 087	21 865	25 831	27 036	34487	29 310	30 966	24 503
Caribbean spiny lobster-Panulirus argus	31 102	28 468	28 298	31 358	32 410	30 204	28 325	24 311
Atlantic seabob-Xiphopenaeus kroyeri	36 073	34 243	25 203	25 322	30 147	31 319	31 625	22 497
Mexican four-eyed octopus-Octopus maya	12 629	8 806	15 403	23 441	25 722	26 020	28809	20 119

The capture.wideobject we just created is the dataframe but is in wide format. To change from wide to long format data table, we use the pivot_longer function. The first argument specified in pivot_longer is cols = 2:9 are index value of the variable in column 2 to colum 9 of the dataset we want to stack, names_to = "year" argument specify that the stacked variables will all be under a new variable name called year and the values in the stacked variables will all be under one variable called catch

capture.long = capture.wide %>% 
  pivot_longer(cols = 2:9, names_to = "year", values_to = "catch")

capture.long |> 
  FSA::headtail(n = 4) |> 
  flextable::flextable() |> 
  flextable::autofit()

Table 3: Data values arranged in tidy wide format

ASFIS species	year	catch
Marine fishes nei-Osteichthyes	2012	137 683
Marine fishes nei-Osteichthyes	2013	112 095
Marine fishes nei-Osteichthyes	2014	110 535
Marine fishes nei-Osteichthyes	2015	118 078
Mexican four-eyed octopus-Octopus maya	2016	25 722
Mexican four-eyed octopus-Octopus maya	2017	26 020
Mexican four-eyed octopus-Octopus maya	2018	28809
Mexican four-eyed octopus-Octopus maya	2019	20 119

The pivot_wider() function is used to convert from a long format data frame to a wide format data frame. The first argument is names_from = year is the variable of the data frame for which we want spread across and the second argument values_from = catch is the corresponding values will reside in each casted variable.

capture.wide = capture.long %>% 
  pivot_wider(names_from = year, values_from = catch)

Table 4: Data values arranged in tidy wide format

ASFIS species	2012	2013	2014	2015	2016	2017	2018	2019
Marine fishes nei-Osteichthyes	137 683	112 095	110 535	118 078	118 204	124 895	109 938	415 031
Gulf menhaden-Brevoortia patronus	500 162	440 709	385 022	539 198	618 563	461 189	529 231	336 221
Round sardinella-Sardinella aurita	42 344	46 299	50 698	79 851	123 377	126 766	126 400	126 400
American cupped oyster-Crassostrea virginica	77 900	53 866	57 930	55 041	44 313	78 961	49 631	47 723
Ark clams nei-Arca spp	12 687	15 000	17 384	29 073	29 076	29 970	29 970	29 970
Stromboid conchs nei-Strombus spp	36 855	36 610	30 691	34 388	28 774	35 273	34 856	28 308
Blue crab-Callinectes sapidus	43 997	40 570	44 361	50 781	53 616	54 628	56 689	25 937
Yellowfin tuna-Thunnus albacares	19 087	21 865	25 831	27 036	34487	29 310	30 966	24 503
Caribbean spiny lobster-Panulirus argus	31 102	28 468	28 298	31 358	32 410	30 204	28 325	24 311
Atlantic seabob-Xiphopenaeus kroyeri	36 073	34 243	25 203	25 322	30 147	31 319	31 625	22 497
Mexican four-eyed octopus-Octopus maya	12 629	8 806	15 403	23 441	25 722	26 020	28809	20 119

2.3 Saving transformed data

Once you have transformed your data and edited variables, you may want to save your new dataframe as an external data file that your collaborators or other researchers can use. As with reading in data, for saving out data you will need the foreign package. Similarly, you can choose from a number of different data formats to export to as well. Most commonly you will want to save out data as a .csv or a tab-delited text file.

myfile %>% write_csv("my.directory/file_name.csv")

3 Transforming, summarising, and analysing data

The R universe basically builds upon two (seemingly contradictive) approaches: base R and the tidyverse. While these two approaches are often seen as two different philosophies, they can form a symbiosis. We therefore recommend to pick whichever works best for you – or to combine the two.Whereas base R is already implemented in R, using the tidyverse requires users to load new packages. People often find base R unintuitive and hard to read. This is why Hadley Wickham developed and introduced the tidyverse – a more intuitive approach to managing and wrangling data (Wickham and Wickham, 2017).

Code written before 2014 was usually written in base R whereas the tidyverse style is becoming increasingly widespread. Again, which approach you prefer is rather a matter of personal taste than a decision between “right or wrong”. We first familiarize ourselves with the basic logic of the tidyverse style using some examples. For this, we use LFQ_sample_1.xls. since is an Excel format, lets import it in our session;

lfq.sample1 = readxl::read_excel("../data/LFQ_sample_1.xls")

Let’s print the assigned lfq.sample1

lfq.sample1 |> 
  FSA::headtail() |> 
  flextable::flextable() |> 
  flextable::autofit()

Date	Species	Size (cm)	Size Class	Gear type	Landing_site
2019-05-30 00:00:00	Siganus sutor	16.5	18	Speargun	Pwani
2019-05-30 00:00:00	Siganus sutor	12.0	12	Speargun	Pwani
2019-05-30 00:00:00	Siganus sutor	15.0	15	Speargun	Pwani
2020-07-25 00:00:00	Lutjanus fulviflamma	13.0	15	Speargun	Bahari
2020-07-25 00:00:00	Lutjanus fulviflamma	17.0	18	Speargun	Bahari
2020-07-25 00:00:00	Lutjanus fulviflamma	18.5	21	Speargun	Bahari

As you can see from the output generated in your console, the lfq.sample1 data frame object contains length information of several species. The fisheries data collected in this dataset ensured that each record on a separate row and each column represents a variable. The dataset contains six variables (columns) and 779 records (rows). These variables includes;

Date a sample was collected. The date is in the YYYY-MM-DD, which is the format that accepted
Species the scientific name of the species
Size (cm) the length of the fish species measured in centimeter
Size Class of the individual
Gear type used to catch the recorded fish, and
Landing-site a landing station where the fish was recorded

This dataset is arranged in appropriate format. The data was entered in spreadsheet with a few basics. These correspond to: what, where, when it only miss who collect the data or the originator of this dataset. The way this dataset is organized is the classical example of how set out our data in a similar fashion. This makes manipulating the data more straightforward and also maintains the multi-purpose nature of the analysis work. Having your data organized is really important!

Warning

Though the size and unit of the dataaset is provide, but it does not tell us much because we do not know whether the measurement is total length (TL), standard length (SL) or fork length (FL). It is imperative to specify in the dataset variable names what was measured and the unit

3.1 Renaming variables

Looking lfq.sample1 dataset presented in ?@tbl-case1, we notice that variable names do not adhere to standard variable names. The tidyverse style guide recommends snake case (words separated by underscores tl_cm) for object and column names. Let’s look back at our column names for a minute. Using the names() function from the and pipe operator %>% from the magrittr package simultaneously serves as the first pipe in a chain of commands and print names in the dataset

lfq.sample1 %>% 
  names()

[1] "Date"         "Species"      "Size (cm)"    "Size Class"   "Gear type"   
[6] "Landing_site"

There are all sorts of capital letters and spaces (e.g. “Gear type” ,” “Size Class” ) as well as symbols (“Size (cm)”). Therefore we mustl convert all of these to snake case for us. We can rename specific variables using the rename function from the dplyr package (Wickham et al., 2019). Since dplyr is part of the tidyverse package (Wickham and Wickham, 2017) that was loaded, we do not need to load it again but rather use its function to rename the variables. We’ll begin by specifying the name of our data frame object lfq.sample1, followed by the = operator so that we can overwrite the existing lfq.sample1 frame object with one that contains the renamed variables.

Next type lfq.sample1 followed by pipe operator %>% and then call arename function, where you parse the new names to their corresponding old names. As the first argument, let’s change the Date variable to date by typing the name of our new variable followed by = and, in quotation marks (” “), the name of the original variable date = "Date". As the second argument, let’s apply the same process as the second argument and change the Species variable to species by typing the name of our new variable followed by = and, in quotation marks (” “), the name of the original variable species = "Species". We follow the same naming procedure to the six variables as the chunk below highlight;

lfq.sample1 = lfq.sample1 %>% 
  rename(
    date = "Date",
    species = "Species",
    size_cm = "Size (cm)",
    size_class = "Size Class",
    gear_type = "Gear type",
    landing_site = "Landing_site"
    )

Using the head function from base R, let’s verify that we renamed the two variables successfully.

lfq.sample1

# A tibble: 779 × 6
   date                species         size_cm size_class gear_type landing_site
   <dttm>              <chr>             <dbl>      <dbl> <chr>     <chr>       
 1 2019-05-30 00:00:00 Siganus sutor      16.5         18 Speargun  Pwani       
 2 2019-05-30 00:00:00 Siganus sutor      12           12 Speargun  Pwani       
 3 2019-05-30 00:00:00 Siganus sutor      15           15 Speargun  Pwani       
 4 2019-05-30 00:00:00 Siganus sutor      33           33 Speargun  Pwani       
 5 2019-05-30 00:00:00 Siganus sutor      11.2         12 hook and… Pwani       
 6 2019-05-30 00:00:00 Siganus sutor      17           18 hook and… Pwani       
 7 2019-05-30 00:00:00 Siganus sutor      19           21 Gill net  Pwani       
 8 2019-05-30 00:00:00 Lethrinus mahs…    10           12 Traps     Pwani       
 9 2019-05-30 00:00:00 Lethrinus mahs…    11           12 Traps     Pwani       
10 2019-05-30 00:00:00 Lethrinus mahs…     9           12 Traps     Pwani       
# ℹ 769 more rows

As you can see, the new names have no spaces and all are written in small letters. Spaces in variables names were replaced with the underscore symbol.

Though the rename function is great tool to rename variables in the dataset, however there are times when you a dataset has hundred of variables names to change and using rename function from dplyr package become tedious. Furtunate , a janitor package has a nifty clean_names() function that clean all the variable names with a single line of code. I highly recommend incorporating it into your workflow.

lfq.sample1 = readxl::read_excel("../data/LFQ_sample_1.xls") %>% 
  janitor::clean_names()

lfq.sample1

# A tibble: 779 × 6
   date                species         size_cm size_class gear_type landing_site
   <dttm>              <chr>             <dbl>      <dbl> <chr>     <chr>       
 1 2019-05-30 00:00:00 Siganus sutor      16.5         18 Speargun  Pwani       
 2 2019-05-30 00:00:00 Siganus sutor      12           12 Speargun  Pwani       
 3 2019-05-30 00:00:00 Siganus sutor      15           15 Speargun  Pwani       
 4 2019-05-30 00:00:00 Siganus sutor      33           33 Speargun  Pwani       
 5 2019-05-30 00:00:00 Siganus sutor      11.2         12 hook and… Pwani       
 6 2019-05-30 00:00:00 Siganus sutor      17           18 hook and… Pwani       
 7 2019-05-30 00:00:00 Siganus sutor      19           21 Gill net  Pwani       
 8 2019-05-30 00:00:00 Lethrinus mahs…    10           12 Traps     Pwani       
 9 2019-05-30 00:00:00 Lethrinus mahs…    11           12 Traps     Pwani       
10 2019-05-30 00:00:00 Lethrinus mahs…     9           12 Traps     Pwani       
# ℹ 769 more rows

As you can see above, the clean_names() function handled every kind of messy variable name that was present in our lfq.sample1 dataset. Everything now looks neat and tidy

3.2 Filtering Data

Our lfq.sample1 contains six variables – date, species, size_cm, size_class, gear_type and landing_site;

lfq.sample1 %>% 
  glimpse()

Rows: 779
Columns: 6
$ date         <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ species      <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ size_cm      <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ size_class   <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ gear_type    <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…

Suppose we are only interested in Siganus sutor species only from the dataset. For this purpose, we can use the filter() function from dplyr package (Wickham et al., 2019)

lfq.sample1 %>% 
  filter(species == "Siganus sutor")

# A tibble: 196 × 6
   date                species       size_cm size_class gear_type   landing_site
   <dttm>              <chr>           <dbl>      <dbl> <chr>       <chr>       
 1 2019-05-30 00:00:00 Siganus sutor    16.5         18 Speargun    Pwani       
 2 2019-05-30 00:00:00 Siganus sutor    12           12 Speargun    Pwani       
 3 2019-05-30 00:00:00 Siganus sutor    15           15 Speargun    Pwani       
 4 2019-05-30 00:00:00 Siganus sutor    33           33 Speargun    Pwani       
 5 2019-05-30 00:00:00 Siganus sutor    11.2         12 hook and l… Pwani       
 6 2019-05-30 00:00:00 Siganus sutor    17           18 hook and l… Pwani       
 7 2019-05-30 00:00:00 Siganus sutor    19           21 Gill net    Pwani       
 8 2019-06-25 00:00:00 Siganus sutor    36.5         39 Speargun    Pwani       
 9 2019-06-25 00:00:00 Siganus sutor    11           12 hook and l… Pwani       
10 2019-06-25 00:00:00 Siganus sutor    11           12 hook and l… Pwani       
# ℹ 186 more rows

3.3 Select

Sometimes, you want to select specific variables. For instance, We are interested in three variables – date, landing_site, species, size_cm but not so much in the other variables. A select() function dplyr package allows you to do exactly this.

lfq.sample1 %>% 
  select(date, landing_site, species, size_cm)

# A tibble: 779 × 4
   date                landing_site species           size_cm
   <dttm>              <chr>        <chr>               <dbl>
 1 2019-05-30 00:00:00 Pwani        Siganus sutor        16.5
 2 2019-05-30 00:00:00 Pwani        Siganus sutor        12  
 3 2019-05-30 00:00:00 Pwani        Siganus sutor        15  
 4 2019-05-30 00:00:00 Pwani        Siganus sutor        33  
 5 2019-05-30 00:00:00 Pwani        Siganus sutor        11.2
 6 2019-05-30 00:00:00 Pwani        Siganus sutor        17  
 7 2019-05-30 00:00:00 Pwani        Siganus sutor        19  
 8 2019-05-30 00:00:00 Pwani        Lethrinus mahsena    10  
 9 2019-05-30 00:00:00 Pwani        Lethrinus mahsena    11  
10 2019-05-30 00:00:00 Pwani        Lethrinus mahsena     9  
# ℹ 769 more rows

3.4 Arranging variables

Let’s say we want to know the species with larges and smallest size throughout the entire dataset. This can be done with the arrange() function.

lfq.sample1 %>% 
  arrange(size_cm)

# A tibble: 779 × 6
   date                species         size_cm size_class gear_type landing_site
   <dttm>              <chr>             <dbl>      <dbl> <chr>     <chr>       
 1 2019-07-12 00:00:00 Lethrinus harak     6            6 Traps     Pwani       
 2 2020-04-08 00:00:00 Lethrinus mahs…     8.3         12 Traps     Bahari      
 3 2020-04-08 00:00:00 Lethrinus mahs…     8.4         12 Traps     Bahari      
 4 2019-05-30 00:00:00 Lethrinus mahs…     9           12 Traps     Pwani       
 5 2019-05-30 00:00:00 Lethrinus mahs…     9           12 Traps     Pwani       
 6 2020-06-15 00:00:00 Siganus sutor       9           12 Traps     Bahari      
 7 2020-06-15 00:00:00 Lethrinus harak     9           12 Traps     Bahari      
 8 2019-05-30 00:00:00 Lethrinus mahs…     9.3         12 hook and… Pwani       
 9 2019-05-30 00:00:00 Lethrinus mahs…     9.3         12 hook and… Pwani       
10 2020-04-08 00:00:00 Lethrinus mahs…     9.4         12 Traps     Bahari      
# ℹ 769 more rows

As we can see, Lethrinus are the species with the smallest size in the datase. The smallest size in the dataset is 6cm. By default, the arrange() function sorts the data in ascending order. To display the data in descending order, we add desc().

lfq.sample1 %>% 
  arrange(desc(size_cm))

# A tibble: 779 × 6
   date                species         size_cm size_class gear_type landing_site
   <dttm>              <chr>             <dbl>      <dbl> <chr>     <chr>       
 1 2019-06-25 00:00:00 Siganus sutor      36.5         39 Speargun  Pwani       
 2 2019-05-30 00:00:00 Siganus sutor      33           33 Speargun  Pwani       
 3 2020-04-08 00:00:00 Calotomus caro…    30.4         33 Traps     Bahari      
 4 2020-06-15 00:00:00 Lethrinus mahs…    30.2         33 hook and… Bahari      
 5 2019-05-30 00:00:00 Leptoscarus va…    30           30 Speargun  Pwani       
 6 2020-04-08 00:00:00 Anampses spp.      30           30 Speargun  Bahari      
 7 2019-07-11 00:00:00 Lutjanus fulvi…    27.5         30 hook and… Pwani       
 8 2019-07-12 00:00:00 Siganus sutor      27.5         30 Gill net  Pwani       
 9 2019-07-11 00:00:00 Calotomus caro…    27           27 Beach se… Pwani       
10 2019-07-12 00:00:00 Siganus sutor      27           27 Gill net  Pwani       
# ℹ 769 more rows

As we can see, Siganus sutor are the species with the larges size in the dataset. The highest size in the dataset is 36.50cm.

3.5 Extracting unique observations

Which unique species are included in the dataset? To get this information, we use the distinct() function in dplyr.

lfq.sample1 %>% 
  distinct(species)

# A tibble: 9 × 1
  species               
  <chr>                 
1 Siganus sutor         
2 Lethrinus mahsena     
3 Leptoscarus vaigiensis
4 Calotomus carolinus   
5 Lutjanus fulviflamma  
6 Lethrinus harak       
7 Anampses spp.         
8 Acanthurus chronixis  
9 Cheilinus chlorourus

As we see, there are nine species in the dataset arranged in the species variable of the dataframe. However, We might need these species in vector instead of data frame. We use the function pull for that, which convert from dataframe to vector dataset.

lfq.sample1 %>% 
  distinct(species) %>% 
  pull()

[1] "Siganus sutor"          "Lethrinus mahsena"      "Leptoscarus vaigiensis"
[4] "Calotomus carolinus"    "Lutjanus fulviflamma"   "Lethrinus harak"       
[7] "Anampses spp."          "Acanthurus chronixis"   "Cheilinus chlorourus"

3.6 Creating new variables

The mutate() command allows you to generate a new variable from existing. Let’s say you want to generate a day, month and year from date variables. The lubridate packaged (Grolemund and Wickham, 2011) has functions dedicated for dealing with dates, some of these functions include day(), month(), and year() that are used to extract values.

lfq.sample1 %>% 
  select(date, species, size_cm) %>% 
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    day = lubridate::day(date)
    )

# A tibble: 779 × 6
   date                species           size_cm  year month   day
   <dttm>              <chr>               <dbl> <dbl> <dbl> <int>
 1 2019-05-30 00:00:00 Siganus sutor        16.5  2019     5    30
 2 2019-05-30 00:00:00 Siganus sutor        12    2019     5    30
 3 2019-05-30 00:00:00 Siganus sutor        15    2019     5    30
 4 2019-05-30 00:00:00 Siganus sutor        33    2019     5    30
 5 2019-05-30 00:00:00 Siganus sutor        11.2  2019     5    30
 6 2019-05-30 00:00:00 Siganus sutor        17    2019     5    30
 7 2019-05-30 00:00:00 Siganus sutor        19    2019     5    30
 8 2019-05-30 00:00:00 Lethrinus mahsena    10    2019     5    30
 9 2019-05-30 00:00:00 Lethrinus mahsena    11    2019     5    30
10 2019-05-30 00:00:00 Lethrinus mahsena     9    2019     5    30
# ℹ 769 more rows

3.7 Group-wise operations

Suppose we are interested to know the sample size for each species in the dataset along with the mean, median, and standard deviation of the body size. That’s is not possible with single function in the tidyverse but rather a combination of functions. We use a combination of group_by() (to group our results by month), and parse the specific function we want to compute in the summarise() function.

lfq.sample1 %>% 
  group_by(species) %>% 
  summarise(
    n = n(),
    mean_cm = mean(size_cm),
    median_cm = median(size_cm),
    std_cm = sd(size_cm)
    )

# A tibble: 9 × 5
  species                    n mean_cm median_cm std_cm
  <chr>                  <int>   <dbl>     <dbl>  <dbl>
1 Acanthurus chronixis      21    13.7      12.5  2.65 
2 Anampses spp.             17    17.5      16.5  4.43 
3 Calotomus carolinus      176    16.4      16.5  3.05 
4 Cheilinus chlorourus       4    13.9      14    0.810
5 Leptoscarus vaigiensis   146    18.1      17.5  3.53 
6 Lethrinus harak           37    12.6      12    2.72 
7 Lethrinus mahsena        135    11.9      11.9  2.39 
8 Lutjanus fulviflamma      47    16.7      16.9  2.67 
9 Siganus sutor            196    18.2      17.5  4.66

The computed variables are grouped after performing transformations. In this case, you can pipe another command with ungroup() to ungroup your result.

lfq.sample1 %>% 
  group_by(species) %>% 
  summarise(
    n = n(),
    mean_cm = mean(size_cm),
    median_cm = median(size_cm),
    std_cm = sd(size_cm)
    ) %>% 
  ungroup()

# A tibble: 9 × 5
  species                    n mean_cm median_cm std_cm
  <chr>                  <int>   <dbl>     <dbl>  <dbl>
1 Acanthurus chronixis      21    13.7      12.5  2.65 
2 Anampses spp.             17    17.5      16.5  4.43 
3 Calotomus carolinus      176    16.4      16.5  3.05 
4 Cheilinus chlorourus       4    13.9      14    0.810
5 Leptoscarus vaigiensis   146    18.1      17.5  3.53 
6 Lethrinus harak           37    12.6      12    2.72 
7 Lethrinus mahsena        135    11.9      11.9  2.39 
8 Lutjanus fulviflamma      47    16.7      16.9  2.67 
9 Siganus sutor            196    18.2      17.5  4.66

In some cases, you might wish to arrange the sample size from lowest to highest sample size along with the computed values. You just add the arrange function

lfq.sample1 %>% 
  group_by(species) %>% 
  summarise(
    n = n(),
    mean_cm = mean(size_cm),
    median_cm = median(size_cm),
    std_cm = sd(size_cm)
    ) %>% 
  arrange(n)

# A tibble: 9 × 5
  species                    n mean_cm median_cm std_cm
  <chr>                  <int>   <dbl>     <dbl>  <dbl>
1 Cheilinus chlorourus       4    13.9      14    0.810
2 Anampses spp.             17    17.5      16.5  4.43 
3 Acanthurus chronixis      21    13.7      12.5  2.65 
4 Lethrinus harak           37    12.6      12    2.72 
5 Lutjanus fulviflamma      47    16.7      16.9  2.67 
6 Lethrinus mahsena        135    11.9      11.9  2.39 
7 Leptoscarus vaigiensis   146    18.1      17.5  3.53 
8 Calotomus carolinus      176    16.4      16.5  3.05 
9 Siganus sutor            196    18.2      17.5  4.66

You notice that Cheilinus chlorourus has a only four observation where as Siganus sutor has 196 records in the dataset. You can also explore the summary statistics to see whether the mean and median values are close (indicator of normal distribution) or differ (indication of skewness)

4 References

Grolemund, G., Wickham, H., 2011. Dates and times made easy with lubridate. Journal of Statistical Software 40, 1–25.

Wickham, H., Bryan, J., 2018. Readxl: Read excel files.

Wickham, H., François, R., Henry, L., Müller, K., 2019. Dplyr: A grammar of data manipulation.

Wickham, H., Henry, L., 2018. Tidyr: Easily tidy data with ’spread()’ and ’gather()’ functions.

Wickham, H., Wickham, M.H., 2017. Tidyverse: Easily install and load the ’tidyverse’.

Citation

BibTeX citation:

@online{semba2024,
  author = {Semba, Masumbuko},
  title = {Data Cleaning, Merging, and Appending},
  date = {2024-05-01},
  url = {https://lugoga.github.io/kitaa/posts/manipulate/},
  langid = {en}
}

For attribution, please cite this work as:

Semba, M., 2024. Data cleaning, merging, and appending [WWW Document]. URL https://lugoga.github.io/kitaa/posts/manipulate/