1 Introduction
In reality, data cleaning
is not so much its own step in the data wrangling process as it is a constant activity that accompanies every other step, both because most data is not clean when we encounter it, and because how a data set (or part of it) needs to be “cleaned” is often revealed progressively as we work. At a high level, clean data might be summarized as being free from errors or typos – such as mismatched units, multiple spellings of the same word or term, and fields that are not well-separated – and missing or impossible values.
While many of these are at least somewhat straightforward to recognize (though not always to correct), however, deeper data problems may still persist. Measurement changes, calculation errors and other oversights – especially in system-generated data – often don’t reveal themselves until some level of analysis has been done and the data has been “reality-checked” with folks who significant expertise and/or first-hand experience with the subject.
The iterative nature of data cleaning is an example of why data wrangling is a cycle rather than a linear series of steps: as your work reveals more about the data and your understanding of its relationship to the world deepens, you may find that you need to revisit earlier work that you’ve done and repeat or adjust certain aspects of the data
1.1 Importing data
Although creating data frames from existing data structures is extremely useful, by far the most common approach is to create a data frame by importing data from an external file. To do this, you’ll need to have your data correctly formatted and saved in a file format that R is able to recognize. Fortunately for us, R is able to recognize a wide variety of file formats, although in reality you’ll probably end up only using two or three regularly—comma separated, excel spreadsheet and tab-delimited.
The easiest method of creating a data file to import into R is to enter your data into a spreadsheet using either Microsoft Excel or LibreOffice Calc and save the spreadsheet as a comma separated (.csv
), excel spreadsheet (.xls
or .xlsx
) and tab-delimited (.txt
). We then read the file into our session, for this mode, we are going to use the LFQ_sample_1.xls
, which is an Excel spreadsheet file. To load into the R session, we use a read_excel
function from readxl package (Wickham and Bryan, 2018). Before we import the data, we need to load the packages that we will use their functions in this chapter
There are a few things to note about the above command. First, the file path and the filename (including the file extension) needs to be enclosed in either single or double quotes (i.e. the data/flower.txt bit) as the read_excel()
function expects this to be a character string. If your working directory is already set to the directory which contains the file, you don’t need to include the entire file path just the filename
.
In the example above, the file path is separated with a single forward slash /. This will work regardless of the operating system you are using. The argument sheet = 1
specify the function to pick the data from first sheet as spreadsheet store data in multiple sheet. The argument skip = 0
specify the function not to skip any row in the datase. This used to specify when the data has metadata information at the top.
After importing our data into R it doesn’t appear that R has done much, at least nothing appears in the R Console! To see the contents of the data frame we need to tell R to do so using a print function
.
# A tibble: 779 × 6
Date Species `Size (cm)` `Size Class` `Gear type` Landing_site
<dttm> <chr> <dbl> <dbl> <chr> <chr>
1 2019-05-30 00:00:00 Siganu… 16.5 18 Speargun Pwani
2 2019-05-30 00:00:00 Siganu… 12 12 Speargun Pwani
3 2019-05-30 00:00:00 Siganu… 15 15 Speargun Pwani
4 2019-05-30 00:00:00 Siganu… 33 33 Speargun Pwani
5 2019-05-30 00:00:00 Siganu… 11.2 12 hook and l… Pwani
6 2019-05-30 00:00:00 Siganu… 17 18 hook and l… Pwani
7 2019-05-30 00:00:00 Siganu… 19 21 Gill net Pwani
8 2019-05-30 00:00:00 Lethri… 10 12 Traps Pwani
9 2019-05-30 00:00:00 Lethri… 11 12 Traps Pwani
10 2019-05-30 00:00:00 Lethri… 9 12 Traps Pwani
# ℹ 769 more rows
The data frame appears in the console and that prove to us that we successfully imported the data into R. But printing the imported data into console is only useful for small dataset and for large dataset that makes the console messy and there is nothing you can glean by simply printing the dataset in the console. The other common practice often used is looking on the internal structure of the dataset using str()
function
tibble [779 × 6] (S3: tbl_df/tbl/data.frame)
$ Date : POSIXct[1:779], format: "2019-05-30" "2019-05-30" ...
$ Species : chr [1:779] "Siganus sutor" "Siganus sutor" "Siganus sutor" "Siganus sutor" ...
$ Size (cm) : num [1:779] 16.5 12 15 33 11.2 17 19 10 11 9 ...
$ Size Class : num [1:779] 18 12 15 33 12 18 21 12 12 12 ...
$ Gear type : chr [1:779] "Speargun" "Speargun" "Speargun" "Speargun" ...
$ Landing_site: chr [1:779] "Pwani" "Pwani" "Pwani" "Pwani" ...
and a better solution is to use a tidyverse approach of glimpse
function
Rows: 779
Columns: 6
$ Date <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ Species <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ `Size (cm)` <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ `Size Class` <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ `Gear type` <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ Landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…
The output in the console tell us that lfq
dataset we just imported has six columns (variables) and 779 rows (records). Each of the variable are listed along with their data types and few records of the data The first row of the data contains the variable (column) names.
1.2 Common import frustrations
It’s quite common to get a bunch of really frustrating error messages when you first start importing data into R. Perhaps the most common is
This error message is telling you that R cannot find the file you are trying to import. Several reasons can lead to that information error.The first is that you’ve made a mistake in the spelling of either the filename or file path. Another common mistake is that you have forgotten to include the file extension in the filename (i.e. .txt). Lastly, the file is not where you say it is or you’ve used an incorrect file path. Using RStudio Projects and having a logical directory structure goes a long way to avoiding these types of errors.
1.3 clean variable names
1.4 check the structure
Rows: 779
Columns: 6
$ date <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ species <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ size_cm <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ size_class <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ gear_type <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…
1.5 creating variables
You notice that the date is datetime and date but the display is date. Thus, we need to create three variables that represent day
, month
, and year
. We can extract these variables from date variables, We can use the day()
, month()
, and year()
functions from lubridate package (Grolemund and Wickham, 2011);
lfq.time = lfq.clean %>%
mutate(
day = lubridate::day(date),
month = lubridate::month(date),
year = lubridate::year(date)
) %>%
relocate(c(landing_site, day, month, year), .before = date)
lfq.time |>
sample_n(30) |>
select(-c(day, month, year)) |>
flextable::flextable() |>
flextable::autofit()
landing_site | date | species | size_cm | size_class | gear_type |
---|---|---|---|---|---|
Pwani | 2019-07-12 00:00:00 | Anampses spp. | 14.0 | 15 | Gill net |
Bahari | 2020-04-08 00:00:00 | Leptoscarus vaigiensis | 19.8 | 21 | Speargun |
Pwani | 2019-07-11 00:00:00 | Leptoscarus vaigiensis | 25.5 | 27 | Speargun |
Pwani | 2019-07-11 00:00:00 | Lutjanus fulviflamma | 27.5 | 30 | hook and line |
Bahari | 2020-07-25 00:00:00 | Siganus sutor | 13.8 | 15 | Traps |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 16.0 | 18 | Speargun |
Pwani | 2019-05-30 00:00:00 | Leptoscarus vaigiensis | 15.6 | 18 | Speargun |
Pwani | 2019-07-12 00:00:00 | Siganus sutor | 17.5 | 18 | Traps |
Pwani | 2019-06-25 00:00:00 | Siganus sutor | 18.4 | 21 | Traps |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 15.5 | 18 | Beach seine |
Pwani | 2019-05-30 00:00:00 | Lethrinus mahsena | 11.1 | 12 | hook and line |
Pwani | 2019-07-12 00:00:00 | Siganus sutor | 27.5 | 30 | Gill net |
Bahari | 2020-07-25 00:00:00 | Siganus sutor | 15.5 | 18 | Traps |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 16.0 | 18 | Beach seine |
Bahari | 2020-07-25 00:00:00 | Siganus sutor | 16.0 | 18 | Speargun |
Bahari | 2020-04-08 00:00:00 | Siganus sutor | 12.0 | 12 | Traps |
Bahari | 2020-07-25 00:00:00 | Siganus sutor | 14.2 | 15 | Traps |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 17.0 | 18 | Beach seine |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 14.3 | 15 | Speargun |
Pwani | 2019-05-30 00:00:00 | Leptoscarus vaigiensis | 16.5 | 18 | Speargun |
Pwani | 2019-05-30 00:00:00 | Leptoscarus vaigiensis | 30.0 | 30 | Speargun |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 18.0 | 18 | Beach seine |
Pwani | 2019-05-30 00:00:00 | Lethrinus mahsena | 12.0 | 12 | hook and line |
Bahari | 2020-07-25 00:00:00 | Siganus sutor | 14.3 | 15 | Traps |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 16.0 | 18 | Beach seine |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 13.5 | 15 | Beach seine |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 9.5 | 12 | Beach seine |
Pwani | 2019-05-30 00:00:00 | Lethrinus mahsena | 14.0 | 15 | hook and line |
Pwani | 2019-07-11 00:00:00 | Calotomus carolinus | 17.5 | 18 | Beach seine |
Bahari | 2020-07-25 00:00:00 | Siganus sutor | 15.6 | 18 | Traps |
1.6 Dropping and keeping variables
# A tibble: 779 × 8
landing_site day month year species size_cm size_class gear_type
<chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Pwani 30 5 2019 Siganus sutor 16.5 18 Speargun
2 Pwani 30 5 2019 Siganus sutor 12 12 Speargun
3 Pwani 30 5 2019 Siganus sutor 15 15 Speargun
4 Pwani 30 5 2019 Siganus sutor 33 33 Speargun
5 Pwani 30 5 2019 Siganus sutor 11.2 12 hook and…
6 Pwani 30 5 2019 Siganus sutor 17 18 hook and…
7 Pwani 30 5 2019 Siganus sutor 19 21 Gill net
8 Pwani 30 5 2019 Lethrinus mahsena 10 12 Traps
9 Pwani 30 5 2019 Lethrinus mahsena 11 12 Traps
10 Pwani 30 5 2019 Lethrinus mahsena 9 12 Traps
# ℹ 769 more rows
1.7 Missing values
R handles missing values differently than some other programs, including Stata. Missing values will appear as NA (whereas in Stata these will appear as large numeric values). Note, though, that NA is not a string, it is a symbol. If you try to conduct logical tests with NA, you are likely to get errors or NULL.
2 Length frequency
Let’s first look at the length data from the catch, which gives an indication of the size structure and health of the population. Using several case studies, we will filter the dataset to choose certain species and then plot histograms of the length data, which reveal the number of individual of each size class were measured in a sample or catch data.
You have several options for dealing with NA
values. - na.omit()
or na.exclude()
will row-wise delete missing values in your dataset - na.fail()
will keep an object only if no missing values are present - na.action=
is a common option in many functions, for example in a linear model where you might write model <- lm(y~x, data = data, na.action = na.omit)
. - is.na
allows you to logically test for NA
values, for example when subsetting
2.1 Merging and combining data
You may want to draw on data from multiple sources or at differing levels of aggregation. To do this, you first must know what your data look like and know what format you ultimately want them to take. Whatever you do, do not attempt to merge datasets by hand—this leads to mistakes and frustration.
left join bind_row bind_col
2.2 Reshaping data
Reshaping data into different formats is a common task. With rectangular type data (data frames have the same number of rows in each column) there are two main data frame shapes that you will come across: the long
format (sometimes called stacked) and the wide
format (Wickham and Henry, 2018). An example of a long format data frame is given in Table 1. We can see that each row is a single observation from an individual species and each species can have multiple rows with different values of year and catch.
english_name | species | year | catch_mt |
---|---|---|---|
Round sardinella | Sardinella aurita | 2016 | 123 377 |
Round sardinella | Sardinella aurita | 2017 | 126 766 |
Round sardinella | Sardinella aurita | 2018 | 126 400 |
Round sardinella | Sardinella aurita | 2019 | 126 400 |
Mexican four | eyed octopus | 2016 | 25 722 |
Mexican four | eyed octopus | 2017 | 26 020 |
Mexican four | eyed octopus | 2018 | 28809 |
Mexican four | eyed octopus | 2019 | 20 119 |
Gulf menhaden | Brevoortia patronus | 2016 | 618 563 |
Gulf menhaden | Brevoortia patronus | 2017 | 461 189 |
Gulf menhaden | Brevoortia patronus | 2018 | 529 231 |
Gulf menhaden | Brevoortia patronus | 2019 | 336 221 |
Stromboid conchs nei | Strombus spp | 2016 | 28 774 |
Stromboid conchs nei | Strombus spp | 2017 | 35 273 |
Stromboid conchs nei | Strombus spp | 2018 | 34 856 |
Stromboid conchs nei | Strombus spp | 2019 | 28 308 |
Atlantic seabob | Xiphopenaeus kroyeri | 2016 | 30 147 |
Atlantic seabob | Xiphopenaeus kroyeri | 2017 | 31 319 |
Atlantic seabob | Xiphopenaeus kroyeri | 2018 | 31 625 |
Atlantic seabob | Xiphopenaeus kroyeri | 2019 | 22 497 |
We can also format the same data in the wide format as in Table 2. In this format the year are treated as variables and catches are values within the years.
english_name | species | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|
Round sardinella | Sardinella aurita | 123 377 | 126 766 | 126 400 | 126 400 |
Mexican four | eyed octopus | 25 722 | 26 020 | 28809 | 20 119 |
Gulf menhaden | Brevoortia patronus | 618 563 | 461 189 | 529 231 | 336 221 |
Stromboid conchs nei | Strombus spp | 28 774 | 35 273 | 34 856 | 28 308 |
Atlantic seabob | Xiphopenaeus kroyeri | 30 147 | 31 319 | 31 625 | 22 497 |
Whilst there’s no inherent problem with either of these formats we will sometimes need to convert between the two because some functions will require a specific format for them to work. But for the modern tidyverse function, only the dataset that is long
format accepted.
There are many ways to convert between these two formats but we’ll use the pivot_longer()
and pivot_wider()
functions from the tidyr package (Wickham and Henry, 2018). We will use fao_capture.csv
dataset in this session, we can simply load it with read_csv
function and specify the path our dataset reside;
ASFIS species | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|
Marine fishes nei-Osteichthyes | 137 683 | 112 095 | 110 535 | 118 078 | 118 204 | 124 895 | 109 938 | 415 031 |
Gulf menhaden-Brevoortia patronus | 500 162 | 440 709 | 385 022 | 539 198 | 618 563 | 461 189 | 529 231 | 336 221 |
Round sardinella-Sardinella aurita | 42 344 | 46 299 | 50 698 | 79 851 | 123 377 | 126 766 | 126 400 | 126 400 |
American cupped oyster-Crassostrea virginica | 77 900 | 53 866 | 57 930 | 55 041 | 44 313 | 78 961 | 49 631 | 47 723 |
Ark clams nei-Arca spp | 12 687 | 15 000 | 17 384 | 29 073 | 29 076 | 29 970 | 29 970 | 29 970 |
Stromboid conchs nei-Strombus spp | 36 855 | 36 610 | 30 691 | 34 388 | 28 774 | 35 273 | 34 856 | 28 308 |
Blue crab-Callinectes sapidus | 43 997 | 40 570 | 44 361 | 50 781 | 53 616 | 54 628 | 56 689 | 25 937 |
Yellowfin tuna-Thunnus albacares | 19 087 | 21 865 | 25 831 | 27 036 | 34487 | 29 310 | 30 966 | 24 503 |
Caribbean spiny lobster-Panulirus argus | 31 102 | 28 468 | 28 298 | 31 358 | 32 410 | 30 204 | 28 325 | 24 311 |
Atlantic seabob-Xiphopenaeus kroyeri | 36 073 | 34 243 | 25 203 | 25 322 | 30 147 | 31 319 | 31 625 | 22 497 |
Mexican four-eyed octopus-Octopus maya | 12 629 | 8 806 | 15 403 | 23 441 | 25 722 | 26 020 | 28809 | 20 119 |
The capture.wide
object we just created is the dataframe but is in wide format. To change from wide to long format data table, we use the pivot_longer
function. The first argument specified in pivot_longer
is cols = 2:9
are index value of the variable in column 2 to colum 9 of the dataset we want to stack, names_to = "year"
argument specify that the stacked variables will all be under a new variable name called year
and the values in the stacked variables will all be under one variable called catch
ASFIS species | year | catch |
---|---|---|
Marine fishes nei-Osteichthyes | 2012 | 137 683 |
Marine fishes nei-Osteichthyes | 2013 | 112 095 |
Marine fishes nei-Osteichthyes | 2014 | 110 535 |
Marine fishes nei-Osteichthyes | 2015 | 118 078 |
Mexican four-eyed octopus-Octopus maya | 2016 | 25 722 |
Mexican four-eyed octopus-Octopus maya | 2017 | 26 020 |
Mexican four-eyed octopus-Octopus maya | 2018 | 28809 |
Mexican four-eyed octopus-Octopus maya | 2019 | 20 119 |
The pivot_wider()
function is used to convert from a long
format data frame to a wide
format data frame. The first argument is names_from = year
is the variable of the data frame for which we want spread across and the second argument values_from = catch
is the corresponding values will reside in each casted variable.
ASFIS species | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|
Marine fishes nei-Osteichthyes | 137 683 | 112 095 | 110 535 | 118 078 | 118 204 | 124 895 | 109 938 | 415 031 |
Gulf menhaden-Brevoortia patronus | 500 162 | 440 709 | 385 022 | 539 198 | 618 563 | 461 189 | 529 231 | 336 221 |
Round sardinella-Sardinella aurita | 42 344 | 46 299 | 50 698 | 79 851 | 123 377 | 126 766 | 126 400 | 126 400 |
American cupped oyster-Crassostrea virginica | 77 900 | 53 866 | 57 930 | 55 041 | 44 313 | 78 961 | 49 631 | 47 723 |
Ark clams nei-Arca spp | 12 687 | 15 000 | 17 384 | 29 073 | 29 076 | 29 970 | 29 970 | 29 970 |
Stromboid conchs nei-Strombus spp | 36 855 | 36 610 | 30 691 | 34 388 | 28 774 | 35 273 | 34 856 | 28 308 |
Blue crab-Callinectes sapidus | 43 997 | 40 570 | 44 361 | 50 781 | 53 616 | 54 628 | 56 689 | 25 937 |
Yellowfin tuna-Thunnus albacares | 19 087 | 21 865 | 25 831 | 27 036 | 34487 | 29 310 | 30 966 | 24 503 |
Caribbean spiny lobster-Panulirus argus | 31 102 | 28 468 | 28 298 | 31 358 | 32 410 | 30 204 | 28 325 | 24 311 |
Atlantic seabob-Xiphopenaeus kroyeri | 36 073 | 34 243 | 25 203 | 25 322 | 30 147 | 31 319 | 31 625 | 22 497 |
Mexican four-eyed octopus-Octopus maya | 12 629 | 8 806 | 15 403 | 23 441 | 25 722 | 26 020 | 28809 | 20 119 |
2.3 Saving transformed data
Once you have transformed your data and edited variables, you may want to save your new dataframe as an external data file that your collaborators or other researchers can use. As with reading in data, for saving out data you will need the foreign package. Similarly, you can choose from a number of different data formats to export to as well. Most commonly you will want to save out data as a .csv
or a tab-delited text file.
3 Transforming, summarising, and analysing data
The R universe basically builds upon two (seemingly contradictive) approaches: base R and the tidyverse. While these two approaches are often seen as two different philosophies, they can form a symbiosis. We therefore recommend to pick whichever works best for you – or to combine the two.Whereas base R is already implemented in R, using the tidyverse requires users to load new packages. People often find base R unintuitive and hard to read. This is why Hadley Wickham developed and introduced the tidyverse – a more intuitive approach to managing and wrangling data (Wickham and Wickham, 2017).
Code written before 2014 was usually written in base R whereas the tidyverse style is becoming increasingly widespread. Again, which approach you prefer is rather a matter of personal taste than a decision between “right or wrong”. We first familiarize ourselves with the basic logic of the tidyverse style using some examples. For this, we use LFQ_sample_1.xls
. since is an Excel format, lets import it in our session;
Let’s print the assigned lfq.sample1
Date | Species | Size (cm) | Size Class | Gear type | Landing_site |
---|---|---|---|---|---|
2019-05-30 00:00:00 | Siganus sutor | 16.5 | 18 | Speargun | Pwani |
2019-05-30 00:00:00 | Siganus sutor | 12.0 | 12 | Speargun | Pwani |
2019-05-30 00:00:00 | Siganus sutor | 15.0 | 15 | Speargun | Pwani |
2020-07-25 00:00:00 | Lutjanus fulviflamma | 13.0 | 15 | Speargun | Bahari |
2020-07-25 00:00:00 | Lutjanus fulviflamma | 17.0 | 18 | Speargun | Bahari |
2020-07-25 00:00:00 | Lutjanus fulviflamma | 18.5 | 21 | Speargun | Bahari |
As you can see from the output generated in your console, the lfq.sample1
data frame object contains length information of several species. The fisheries data collected in this dataset ensured that each record on a separate row and each column represents a variable. The dataset contains six variables (columns) and 779 records (rows). These variables includes;
Date
a sample was collected. The date is in theYYYY-MM-DD
, which is the format that acceptedSpecies
the scientific name of the speciesSize (cm)
the length of the fish species measured in centimeterSize Class
of the individualGear type
used to catch the recorded fish, andLanding-site
a landing station where the fish was recorded
This dataset is arranged in appropriate format. The data was entered in spreadsheet with a few basics. These correspond to: what, where, when it only miss who collect the data or the originator of this dataset. The way this dataset is organized is the classical example of how set out our data in a similar fashion. This makes manipulating the data more straightforward and also maintains the multi-purpose nature of the analysis work. Having your data organized is really important!
Though the size and unit of the dataaset is provide, but it does not tell us much because we do not know whether the measurement is total length (TL), standard length (SL) or fork length (FL). It is imperative to specify in the dataset variable names what was measured and the unit
3.1 Renaming variables
Looking lfq.sample1
dataset presented in ?@tbl-case1, we notice that variable names do not adhere to standard variable names. The tidyverse style guide recommends snake case (words separated by underscores tl_cm) for object and column names. Let’s look back at our column names for a minute. Using the names()
function from the and pipe operator %>%
from the magrittr package simultaneously serves as the first pipe in a chain of commands and print names in the dataset
[1] "Date" "Species" "Size (cm)" "Size Class" "Gear type"
[6] "Landing_site"
There are all sorts of capital letters and spaces (e.g. “Gear type” ,” “Size Class” ) as well as symbols (“Size (cm)”). Therefore we mustl convert all of these to snake case for us. We can rename specific variables using the rename
function from the dplyr package (Wickham et al., 2019). Since dplyr is part of the tidyverse package (Wickham and Wickham, 2017) that was loaded, we do not need to load it again but rather use its function to rename the variables. We’ll begin by specifying the name of our data frame object lfq.sample1
, followed by the =
operator so that we can overwrite the existing lfq.sample1
frame object with one that contains the renamed variables.
Next type lfq.sample1
followed by pipe operator %>%
and then call arename
function, where you parse the new names to their corresponding old names. As the first argument, let’s change the Date
variable to date
by typing the name of our new variable followed by = and, in quotation marks (” “), the name of the original variable date = "Date"
. As the second argument, let’s apply the same process as the second argument and change the Species
variable to species
by typing the name of our new variable followed by = and, in quotation marks (” “), the name of the original variable species = "Species"
. We follow the same naming procedure to the six variables as the chunk below highlight;
Using the head function from base R, let’s verify that we renamed the two variables successfully.
# A tibble: 779 × 6
date species size_cm size_class gear_type landing_site
<dttm> <chr> <dbl> <dbl> <chr> <chr>
1 2019-05-30 00:00:00 Siganus sutor 16.5 18 Speargun Pwani
2 2019-05-30 00:00:00 Siganus sutor 12 12 Speargun Pwani
3 2019-05-30 00:00:00 Siganus sutor 15 15 Speargun Pwani
4 2019-05-30 00:00:00 Siganus sutor 33 33 Speargun Pwani
5 2019-05-30 00:00:00 Siganus sutor 11.2 12 hook and… Pwani
6 2019-05-30 00:00:00 Siganus sutor 17 18 hook and… Pwani
7 2019-05-30 00:00:00 Siganus sutor 19 21 Gill net Pwani
8 2019-05-30 00:00:00 Lethrinus mahs… 10 12 Traps Pwani
9 2019-05-30 00:00:00 Lethrinus mahs… 11 12 Traps Pwani
10 2019-05-30 00:00:00 Lethrinus mahs… 9 12 Traps Pwani
# ℹ 769 more rows
As you can see, the new names have no spaces and all are written in small letters. Spaces in variables names were replaced with the underscore symbol.
Though the rename function is great tool to rename variables in the dataset, however there are times when you a dataset has hundred of variables names to change and using rename
function from dplyr package become tedious. Furtunate , a janitor package has a nifty clean_names()
function that clean all the variable names with a single line of code. I highly recommend incorporating it into your workflow.
# A tibble: 779 × 6
date species size_cm size_class gear_type landing_site
<dttm> <chr> <dbl> <dbl> <chr> <chr>
1 2019-05-30 00:00:00 Siganus sutor 16.5 18 Speargun Pwani
2 2019-05-30 00:00:00 Siganus sutor 12 12 Speargun Pwani
3 2019-05-30 00:00:00 Siganus sutor 15 15 Speargun Pwani
4 2019-05-30 00:00:00 Siganus sutor 33 33 Speargun Pwani
5 2019-05-30 00:00:00 Siganus sutor 11.2 12 hook and… Pwani
6 2019-05-30 00:00:00 Siganus sutor 17 18 hook and… Pwani
7 2019-05-30 00:00:00 Siganus sutor 19 21 Gill net Pwani
8 2019-05-30 00:00:00 Lethrinus mahs… 10 12 Traps Pwani
9 2019-05-30 00:00:00 Lethrinus mahs… 11 12 Traps Pwani
10 2019-05-30 00:00:00 Lethrinus mahs… 9 12 Traps Pwani
# ℹ 769 more rows
As you can see above, the clean_names()
function handled every kind of messy variable name that was present in our lfq.sample1
dataset. Everything now looks neat and tidy
3.2 Filtering Data
Our lfq.sample1
contains six variables – date, species, size_cm, size_class, gear_type and landing_site;
Rows: 779
Columns: 6
$ date <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ species <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ size_cm <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ size_class <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ gear_type <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…
Suppose we are only interested in Siganus sutor species only from the dataset. For this purpose, we can use the filter()
function from dplyr package (Wickham et al., 2019)
# A tibble: 196 × 6
date species size_cm size_class gear_type landing_site
<dttm> <chr> <dbl> <dbl> <chr> <chr>
1 2019-05-30 00:00:00 Siganus sutor 16.5 18 Speargun Pwani
2 2019-05-30 00:00:00 Siganus sutor 12 12 Speargun Pwani
3 2019-05-30 00:00:00 Siganus sutor 15 15 Speargun Pwani
4 2019-05-30 00:00:00 Siganus sutor 33 33 Speargun Pwani
5 2019-05-30 00:00:00 Siganus sutor 11.2 12 hook and l… Pwani
6 2019-05-30 00:00:00 Siganus sutor 17 18 hook and l… Pwani
7 2019-05-30 00:00:00 Siganus sutor 19 21 Gill net Pwani
8 2019-06-25 00:00:00 Siganus sutor 36.5 39 Speargun Pwani
9 2019-06-25 00:00:00 Siganus sutor 11 12 hook and l… Pwani
10 2019-06-25 00:00:00 Siganus sutor 11 12 hook and l… Pwani
# ℹ 186 more rows
3.3 Select
Sometimes, you want to select specific variables. For instance, We are interested in three variables – date, landing_site, species, size_cm but not so much in the other variables. A select()
function dplyr package allows you to do exactly this.
# A tibble: 779 × 4
date landing_site species size_cm
<dttm> <chr> <chr> <dbl>
1 2019-05-30 00:00:00 Pwani Siganus sutor 16.5
2 2019-05-30 00:00:00 Pwani Siganus sutor 12
3 2019-05-30 00:00:00 Pwani Siganus sutor 15
4 2019-05-30 00:00:00 Pwani Siganus sutor 33
5 2019-05-30 00:00:00 Pwani Siganus sutor 11.2
6 2019-05-30 00:00:00 Pwani Siganus sutor 17
7 2019-05-30 00:00:00 Pwani Siganus sutor 19
8 2019-05-30 00:00:00 Pwani Lethrinus mahsena 10
9 2019-05-30 00:00:00 Pwani Lethrinus mahsena 11
10 2019-05-30 00:00:00 Pwani Lethrinus mahsena 9
# ℹ 769 more rows
3.4 Arranging variables
Let’s say we want to know the species with larges and smallest size throughout the entire dataset. This can be done with the arrange()
function.
# A tibble: 779 × 6
date species size_cm size_class gear_type landing_site
<dttm> <chr> <dbl> <dbl> <chr> <chr>
1 2019-07-12 00:00:00 Lethrinus harak 6 6 Traps Pwani
2 2020-04-08 00:00:00 Lethrinus mahs… 8.3 12 Traps Bahari
3 2020-04-08 00:00:00 Lethrinus mahs… 8.4 12 Traps Bahari
4 2019-05-30 00:00:00 Lethrinus mahs… 9 12 Traps Pwani
5 2019-05-30 00:00:00 Lethrinus mahs… 9 12 Traps Pwani
6 2020-06-15 00:00:00 Siganus sutor 9 12 Traps Bahari
7 2020-06-15 00:00:00 Lethrinus harak 9 12 Traps Bahari
8 2019-05-30 00:00:00 Lethrinus mahs… 9.3 12 hook and… Pwani
9 2019-05-30 00:00:00 Lethrinus mahs… 9.3 12 hook and… Pwani
10 2020-04-08 00:00:00 Lethrinus mahs… 9.4 12 Traps Bahari
# ℹ 769 more rows
As we can see, Lethrinus are the species with the smallest size in the datase. The smallest size in the dataset is 6cm. By default, the arrange()
function sorts the data in ascending order. To display the data in descending order, we add desc().
# A tibble: 779 × 6
date species size_cm size_class gear_type landing_site
<dttm> <chr> <dbl> <dbl> <chr> <chr>
1 2019-06-25 00:00:00 Siganus sutor 36.5 39 Speargun Pwani
2 2019-05-30 00:00:00 Siganus sutor 33 33 Speargun Pwani
3 2020-04-08 00:00:00 Calotomus caro… 30.4 33 Traps Bahari
4 2020-06-15 00:00:00 Lethrinus mahs… 30.2 33 hook and… Bahari
5 2019-05-30 00:00:00 Leptoscarus va… 30 30 Speargun Pwani
6 2020-04-08 00:00:00 Anampses spp. 30 30 Speargun Bahari
7 2019-07-11 00:00:00 Lutjanus fulvi… 27.5 30 hook and… Pwani
8 2019-07-12 00:00:00 Siganus sutor 27.5 30 Gill net Pwani
9 2019-07-11 00:00:00 Calotomus caro… 27 27 Beach se… Pwani
10 2019-07-12 00:00:00 Siganus sutor 27 27 Gill net Pwani
# ℹ 769 more rows
As we can see, Siganus sutor are the species with the larges size in the dataset. The highest size in the dataset is 36.50cm.
3.5 Extracting unique observations
Which unique species are included in the dataset? To get this information, we use the distinct()
function in dplyr.
# A tibble: 9 × 1
species
<chr>
1 Siganus sutor
2 Lethrinus mahsena
3 Leptoscarus vaigiensis
4 Calotomus carolinus
5 Lutjanus fulviflamma
6 Lethrinus harak
7 Anampses spp.
8 Acanthurus chronixis
9 Cheilinus chlorourus
As we see, there are nine species in the dataset arranged in the species variable of the dataframe. However, We might need these species in vector instead of data frame. We use the function pull
for that, which convert from dataframe to vector dataset.
3.6 Creating new variables
The mutate()
command allows you to generate a new variable from existing. Let’s say you want to generate a day, month and year from date variables. The lubridate packaged (Grolemund and Wickham, 2011) has functions dedicated for dealing with dates, some of these functions include day()
, month()
, and year()
that are used to extract values.
lfq.sample1 %>%
select(date, species, size_cm) %>%
mutate(
year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date)
)
# A tibble: 779 × 6
date species size_cm year month day
<dttm> <chr> <dbl> <dbl> <dbl> <int>
1 2019-05-30 00:00:00 Siganus sutor 16.5 2019 5 30
2 2019-05-30 00:00:00 Siganus sutor 12 2019 5 30
3 2019-05-30 00:00:00 Siganus sutor 15 2019 5 30
4 2019-05-30 00:00:00 Siganus sutor 33 2019 5 30
5 2019-05-30 00:00:00 Siganus sutor 11.2 2019 5 30
6 2019-05-30 00:00:00 Siganus sutor 17 2019 5 30
7 2019-05-30 00:00:00 Siganus sutor 19 2019 5 30
8 2019-05-30 00:00:00 Lethrinus mahsena 10 2019 5 30
9 2019-05-30 00:00:00 Lethrinus mahsena 11 2019 5 30
10 2019-05-30 00:00:00 Lethrinus mahsena 9 2019 5 30
# ℹ 769 more rows
3.7 Group-wise operations
Suppose we are interested to know the sample size for each species in the dataset along with the mean, median, and standard deviation of the body size. That’s is not possible with single function in the tidyverse but rather a combination of functions. We use a combination of group_by()
(to group our results by month), and parse the specific function we want to compute in the summarise()
function.
lfq.sample1 %>%
group_by(species) %>%
summarise(
n = n(),
mean_cm = mean(size_cm),
median_cm = median(size_cm),
std_cm = sd(size_cm)
)
# A tibble: 9 × 5
species n mean_cm median_cm std_cm
<chr> <int> <dbl> <dbl> <dbl>
1 Acanthurus chronixis 21 13.7 12.5 2.65
2 Anampses spp. 17 17.5 16.5 4.43
3 Calotomus carolinus 176 16.4 16.5 3.05
4 Cheilinus chlorourus 4 13.9 14 0.810
5 Leptoscarus vaigiensis 146 18.1 17.5 3.53
6 Lethrinus harak 37 12.6 12 2.72
7 Lethrinus mahsena 135 11.9 11.9 2.39
8 Lutjanus fulviflamma 47 16.7 16.9 2.67
9 Siganus sutor 196 18.2 17.5 4.66
The computed variables are grouped after performing transformations. In this case, you can pipe another command with ungroup()
to ungroup your result.
lfq.sample1 %>%
group_by(species) %>%
summarise(
n = n(),
mean_cm = mean(size_cm),
median_cm = median(size_cm),
std_cm = sd(size_cm)
) %>%
ungroup()
# A tibble: 9 × 5
species n mean_cm median_cm std_cm
<chr> <int> <dbl> <dbl> <dbl>
1 Acanthurus chronixis 21 13.7 12.5 2.65
2 Anampses spp. 17 17.5 16.5 4.43
3 Calotomus carolinus 176 16.4 16.5 3.05
4 Cheilinus chlorourus 4 13.9 14 0.810
5 Leptoscarus vaigiensis 146 18.1 17.5 3.53
6 Lethrinus harak 37 12.6 12 2.72
7 Lethrinus mahsena 135 11.9 11.9 2.39
8 Lutjanus fulviflamma 47 16.7 16.9 2.67
9 Siganus sutor 196 18.2 17.5 4.66
In some cases, you might wish to arrange the sample size from lowest to highest sample size along with the computed values. You just add the arrange
function
lfq.sample1 %>%
group_by(species) %>%
summarise(
n = n(),
mean_cm = mean(size_cm),
median_cm = median(size_cm),
std_cm = sd(size_cm)
) %>%
arrange(n)
# A tibble: 9 × 5
species n mean_cm median_cm std_cm
<chr> <int> <dbl> <dbl> <dbl>
1 Cheilinus chlorourus 4 13.9 14 0.810
2 Anampses spp. 17 17.5 16.5 4.43
3 Acanthurus chronixis 21 13.7 12.5 2.65
4 Lethrinus harak 37 12.6 12 2.72
5 Lutjanus fulviflamma 47 16.7 16.9 2.67
6 Lethrinus mahsena 135 11.9 11.9 2.39
7 Leptoscarus vaigiensis 146 18.1 17.5 3.53
8 Calotomus carolinus 176 16.4 16.5 3.05
9 Siganus sutor 196 18.2 17.5 4.66
You notice that Cheilinus chlorourus has a only four observation where as Siganus sutor has 196 records in the dataset. You can also explore the summary statistics to see whether the mean and median values are close (indicator of normal distribution) or differ (indication of skewness)
4 References
Citation
@online{semba2024,
author = {Semba, Masumbuko},
title = {Data Cleaning, Merging, and Appending},
date = {2024-05-01},
url = {https://lugoga.github.io/kitaa/posts/manipulate/},
langid = {en}
}