Importing Data
So far, we’ve looked at several dataset in previous chapter and we have also created ourselves some datasets. While you can do all your data entry work in R or Excel, it is much more common to load data from other sources. Local and international organization have been collecting fisheries dependent and fisheries independent data for years. These historical dataset with fisheries information like fish catch, effort, landing sites, fishing ground and critical habitats can be obtained from several databases—some are open and other closed.
Much of the data we download or receive from is either comma-separated value files .csv
or and Excel spreadsheets, .xlsx.
.csv files are spreadsheets stored as text files - basically Excel files stripped down to the bare minimum - no formatting, no formulas, no macros. You can open and edit them in spreadsheet software like LibreOffice Calc, Google Sheets or Microsoft Excel. Many devices and databases can export data in .csv format, making it a commonly used file format that you are likely to encounter sooner rather than later.
Whether that be a comma separated (csv) or a tab delimited file, there are multiple functions that can read these data into R. We will stick to loading these data from the tidyverse packages (Wickham and Wickham, 2017) but be aware these are not the only methods for doing this. We will use the tidyverse functions just to maintain consistency with everything else we do. The first package in tidyverse we will use is called readr (Wickham et al., 2017), which is a collection of functions to load the tabular data from working directory in our machine into R session. Some of its functions include:
read_csv()
: comma separated (CSV) filesread_tsv()
: tab separated filesread_delim()
: general delimited filesread_fwf()
: fixed width filesread_table()
: tabular files where columns are separated by white-space.read_log()
: web log files- readxl reads in Excel files.
Before we import the data, we need to load the packages that we will use their functions in this chapter
Importing csv files
A CSV
file is a type of file where each line contains a single record, and all the columns are separated from each other via a comma. In order to load data from a file into R, you need its path - that is, you need to tell R where to find the file. Unless you specify otherwise, R will look for files in its current working directory. You can read .csv
file using read_csv()
function of the readr package (Wickham et al., 2017) as shown in the chunk below;
We imported tidy_LFQ_sample_4.csv
from working directory into R using read_csv()
and specify the path to the file in your working directory and store as imported.lfq. If you get an error message, it means that
tidy_LFQ_sample_4.csvis not in your working directory. Either move the file to the right directory (remember, you can use run
getwd()` to see what your working directory is) or change your working directory.
If you glimpse the dataframe with glimpse()
function, you should see the internal structure of the imported.lfq
object we just loaded;
Rows: 6,185
Columns: 6
$ site <chr> "Mombasa", "Mombasa", "Mombasa", "Mombasa", "Mombasa", "Mombasa"…
$ date <date> 2019-04-05, 2019-04-05, 2019-04-05, 2019-04-05, 2019-04-05, 201…
$ tl_mm <dbl> 184, 185, 145, 189, 175, 165, 181, 176, 164, 154, 188, 186, 179,…
$ fl_mm <dbl> 169, 169, 134, 173, 161, 153, 165, 163, 148, 142, 173, 173, 164,…
$ wt_gm <dbl> 59.50, 54.71, 24.15, 61.36, 49.31, 38.54, 49.68, 45.27, 36.26, 3…
$ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",…
The dataset contains six variables and 6,185 records. The variables site
and sex
both contain text, and have been imported as character vectors4. The date
column has been imported as date format, the variable tl_mm
and fl_mm
are measured length and have been imported as numeric vector measured in millimeters. The variable wt_gm
is the weight of fish measured in grams and also have been imported as numeric vector.
So, what can you do in case you need to import data from a file that is not in your working directory? This is a common problem, as many of us store script files and data files in separate folders (or even on separate drives). One option is to use file.choose
, which opens a pop-up window that lets you choose which file to open using a graphical interface:
This solution work just fine if you just want to open a single file once. But if you want to reuse your code or run it multiple times, you probably don’t want to have to click and select your file each time. Instead, you can specify the path to your file in the call to read_csv
.
Importing Excel files
Commonly our data is stored as a Excel file. There are several packages that can be used to import Excel files to R. I prefer the readxl package (Wickham and Bryan, 2018), so let’s install that:
The package has read_exel()
function that allows us to specify which sheet within the Excel file to read. The function automatically convert the worksheet into a .csv
file and read it. Let’s us import the the data in first sheet of the tidy_LFQ_sample_4.xlsx
. Is a similar dataset that just imported in the previous section, but is in Excel format. We will use this file to illustrate how to import the excel file into R workspace with readxl
package (Wickham and Bryan, 2018).
# A tibble: 6,185 × 6
site date tl_mm fl_mm wt_gm sex
<chr> <dttm> <dbl> <dbl> <dbl> <chr>
1 Mombasa 2019-04-05 00:00:00 184 169 59.5 M
2 Mombasa 2019-04-05 00:00:00 185 169 54.7 M
3 Mombasa 2019-04-05 00:00:00 145 134 24.2 M
4 Mombasa 2019-04-05 00:00:00 189 173 61.4 M
5 Mombasa 2019-04-05 00:00:00 175 161 49.3 M
6 Mombasa 2019-04-05 00:00:00 165 153 38.5 M
7 Mombasa 2019-04-05 00:00:00 181 165 49.7 M
8 Mombasa 2019-04-05 00:00:00 176 163 45.3 M
9 Mombasa 2019-04-05 00:00:00 164 148 36.3 M
10 Mombasa 2019-04-05 00:00:00 154 142 31.9 M
# ℹ 6,175 more rows
Name | Piped data |
Number of rows | 6185 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 3 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
site | 0 | 1 | 3 | 7 | 0 | 2 | 0 |
sex | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
tl_mm | 0 | 1 | 170.77 | 21.08 | 97.0 | 157.00 | 171.00 | 183.00 | 269.00 | ▁▅▇▁▁ |
fl_mm | 0 | 1 | 156.00 | 19.26 | 18.1 | 144.00 | 156.00 | 168.00 | 241.00 | ▁▁▅▇▁ |
wt_gm | 0 | 1 | 46.03 | 19.51 | 7.0 | 32.77 | 43.59 | 55.28 | 194.18 | ▇▆▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2016-03-31 | 2020-09-11 | 2020-02-25 | 42 |
Saving and exporting your data
In many a case, data manipulation is a huge part of statistical work, and of course you want to be able to save a data frame after manipulating it. There are two options for doing this in R - you can either export the data as e.g. a .csv or a .xlsx file, or save it in R format as an .RData file.
Exporting data
Just as we used the functions read_csv
and read_excel
to import data, we can use write_csv
to export it. The code below saves the bookstore data frame as a .csv file file, which will be created in the current working directory. If you wish to store
Saving and loading R data
Being able to export to different spreadsheet formats is very useful, but sometimes you want to save an object that can’t be saved in a spreadsheet format. For instance, you may wish to save a multiple processed data, functions and formula that you’ve created. .RData
files can be used to store one or more R objects. To save the objects bookstore and age in a .Rdata
file, we can use the save function:
References
Citation
@online{semba2024,
author = {Semba, Masumbuko},
title = {Importing Table Files into {R}},
date = {2024-02-26},
url = {https://lugoga.github.io/kitaa/posts/importingData/},
langid = {en}
}