Understanding vector and dataframe

Understanding vectoor and dataframe as core data storage in R is an important part, which allows for data analysis and visualization
visualization
code
Author
Affiliation
Published

February 12, 2024

Introduction

R language is a flexible language that allows to work with different kind of data format (R Core Team, 2023). This include integer, numeric, character, complex, dates and logical. The default data type or class in R is double precision– numeric. In a nutshell, R treats all kind of data into five categories but we deal with only four in this book. Datasets in R are often a combination of seven different data types are highlighted in Figure 1;

Figure 1: Common data types often collected and stored for anaysis and modelling

Vectors

Often times we want to store a set of numbers in once place. One way to do this is using the vectors in R. Vector is the most basic data structure in R. It is a sequence of elements of the same data type. if the elements are of different data types, they be coerced to a common type that can accommodate all the elements. Vector are generally created using the c() function widely called concatenate, though depending on the type vector being created, other method. Vectors store several numbers– a set of numbers in one container. let us look on the example below

id = c(1,2,3,4,5)
mean.tl = c(158,659,782,659,759)
country = c("Somalia", "Kenya", "Mauritius", "Seychelles",  "Mozambique")

Notice that the c() function, which is short for concatenate wraps the list of numbers. The c() function combines all numbers together into one container. Notice also that all the individual numbers are separated with a comma. The comma is referred to an an item-delimiter. It allows R to hold each of the numbers separately. This is vital as without the item-delimiter, R will treat a vector as one big, unseperated number.

Numeric

The most common data type in R is numeric. The numeric class holds the set of real numbers — decimal place numbers. We create a numeric vector using a c() function but you can use any function that creates a sequence of numbers. For example, we can create a numeric vector of SST as follows;

sst = c(25.4, 26, 28, 27.8, 29, 24.8, 22.3)

We can check whether the variable sst is numeric with is.numeric function

is.numeric(sst)
[1] TRUE

Integer

Integer vector data type is actually a special case of numeric data. Unlike numeric, integer values do not have decimal places.They are commonly used for counting or indexing. Creating an integer vector is similar to numeric vector except that we need to instruct R to treat the data as integer and not numeric or double. To command R creating integer, we specify a suffix L to an element

depth = c(5L, 10L, 15L, 20L, 25L,30L)
is.vector(depth);class(depth)
[1] TRUE
[1] "integer"
Note

if your variable does not have decimals, R will automatically set the type as integers instead of numeric.

aa = c(20,68,78,50)

You can check if the data is integer with is.integer() and can convert numeric value to an integer with as.integer()

is.integer(aa)
[1] FALSE

You can query the class of the object with the class() to know the class of the object

class(aa)
[1] "numeric"

Although the object bb is integer as confirmed with as.integer() function, the class() ouput the answer as numeric. This is because the defaul type of number in r is numeric. However, you can use the function as.integer() to convert numeric value to integer

class(as.integer(aa))
[1] "integer"

Character

In programming terms, we usually call text as string. This often are text data like names. A character vector may contain a single character , a word or a group of words. The elements must be enclosed with a single or double quotations mark.

sites = c("Pemba Channel", "Zanzibar Channnel", "Pemba Channel")
is.vector(sites); class(sites)
[1] TRUE
[1] "character"

We can be sure whether the object is a string with is.character() or check the class of the object with class().

countries = c("Kenya", "Uganda", "Rwanda", "Tanzania")
class(countries)
[1] "character"
Note

Everything inside "" will be considered as character, no matter if it looks like character or not

Factor

Factor variables are a special case of character variables in the sense that it also contains text. However, factor variables are used when there are a limited number of unique character strings. It often represents a categorical variable. For instance, the gender will usually take on only two values, "female" or "male" (and will be considered as a factor variable) whereas the name will generally have lots of possibilities (and thus will be considered as a character variable). To create a factor variable use the factor() function:

    maturity.stage <- factor(c("I", "II", "III", "IV", "V"))
    maturity.stage
[1] I   II  III IV  V  
Levels: I II III IV V

To know the different levels of a factor variable, use levels():

 levels(maturity.stage)
[1] "I"   "II"  "III" "IV"  "V"  

By default, the levels are sorted alphabetically. You can reorder the levels with the argument levels in the factor() function:

mature <- factor(maturity.stage, levels = c("V", "III"))
    levels(mature)
[1] "V"   "III"

Character strings can be converted to factors with as.factor():

 text <- c("test1", "test2", "test1", "test1") # create a character vector
    class(text) # to know the class
[1] "character"
 text_factor <- as.factor(text) # transform to factor
    class(text_factor) # recheck the class
[1] "factor"

The character strings have been transformed to factors, as shown by its class of the type factor.

Often we wish to take a continuous numerical vector and transform it into a factor. The function cut() takes a vector of numerical data and creates a factor based on your give cut-points. Let us make a fictional total length of 508 bigeye tuna with rnorm() function.

tl.cm = rnorm(n = 508, mean = 40, sd = 18)

# mosaic::plotDist(dist = "norm", mean = 40, sd = 18, under = F, kind = "cdf", add = TRUE)

tl.cm |>
  tibble::as.tibble() |>
  ggstatsplot::gghistostats(x = value, binwidth = 10, test.value = 40.2, type = "n", normal.curve = T, centrality.type = "p", xlab = "Total length (cm)")
Figure 2: Normal distribution of bigeye tuna’s tota length

We can now breaks the distribution into groups and make a simple plot as shown in ?@fig-lfq, where frequency of bigeye tuna color coded with the group size

group = cut(tl.cm, breaks = c(0,30,60,110),
            labels = c("Below 20", "30-60", "Above 60"))
is.factor(group)
[1] TRUE
levels(group)
[1] "Below 20" "30-60"    "Above 60"
barplot(table(group), las = 1, horiz = FALSE, col = c("blue", "green", "red"), ylab = "Frequency", xlab = "")
Figure 3: Length frequency of bigeye tuna

Logical

Logical data (or simply logical ) represent the logical TRUE state and the logical FALSE state. Logical variables are the variables in which logical data are stored. Logical variables can assume only two states:

  • FALSE, always represent by 0;
  • TRUE, always represented by a nonzero object. Usually, the digit 1 is used for TRUE.

We can create logical variables indirectly, through logical operations, such as the result of a comparison between two numbers. These operations return logical values. For example, type the following statement at the R console:

5 > 3;
[1] TRUE
5 < 3
[1] FALSE

Since 5 is indeed greater than 3, the result of the comparison is true, however, 5 is not less than 3, and hence the comparison is false. The sign > and < are relational operators, returning logical data types as a result.

 value1 <- 7
    value2 <- 9
    greater <- value1 > value2
    greater
[1] FALSE
    class(greater)
[1] "logical"
    # is value1 less than or equal to value2?
    less <- value1 <= value2
    less
[1] TRUE
    class(less)
[1] "logical"

It is also possible to transform logical data into numeric data. After the transformation from logical to numeric with the as.numeric() command, FALSE values equal to 0 and TRUE values equal to 1:

 greater_num <- as.numeric(greater)
    sum(greater)
[1] 0
   less_num <- as.numeric(less)
    sum(less)
[1] 1

Conversely, numeric data can be converted to logical data, with FALSE for all values equal to 0 and TRUE for all other values.

  x <- 0
  as.logical(x)
[1] FALSE
 y <- 5
as.logical(y)
[1] TRUE

Date and Time

Date and time are also treated as vector in R

date.time = seq(lubridate::dmy(010121), 
                lubridate::dmy(250121), 
                length.out = 5)
date.time
[1] "2021-01-01" "2021-01-07" "2021-01-13" "2021-01-19" "2021-01-25"

Generating vectors

Sequence Numbers

There are few R operators that are designed for creating vecor of non-random numbers. These functions provide multiple ways for generating sequences of numbers

The colon : operator, explicitly generate regular sequence of numbers between the lower and upper boundary numbers specified. For example, generating number beween 0 and 10, we simply write;

vector.seq = 0:10
vector.seq
 [1]  0  1  2  3  4  5  6  7  8  9 10

However, if you want to generate a vector of sequence number with specified interval, let say we want to generate number between 0 and 10 with interval of 2, then the seq() function is used

regular.vector = seq(from = 0,to = 10, by = 2)
regular.vector
[1]  0  2  4  6  8 10

unlike the seq() function and : operator that works with numbers, the rep() function generate sequence of repeated numbers or strings to create a vector

id = rep(x = 3, each = 4)
station = rep(x = "Station1", each = 4)
id;station
[1] 3 3 3 3
[1] "Station1" "Station1" "Station1" "Station1"

Sequence characters

The rep() function allows to parse each and times arguments. The each argument allows creation of vector that that repeat each element in a vector according to specified number.

sampled.months = c("January", "March", "May")
rep(x = sampled.months, each = 3)
[1] "January" "January" "January" "March"   "March"   "March"   "May"    
[8] "May"     "May"    

But the times argument repeat the whole vector to specfied times

rep(x = sampled.months, times = 3)
[1] "January" "March"   "May"     "January" "March"   "May"     "January"
[8] "March"   "May"    

Generating normal distribution

The central limit theorem that ensure the data is normal distributed is well known to statistician. R has a rnorm() function which makes vector of normal distributed values. For example to generate a vector of 40 sea surface temperature values from a normal distribution with a mean of 25, and standard deviation of 1.58, we simply type this expression in console;

sst = rnorm(n = 40, mean = 25,sd = 1.58)
sst
 [1] 23.04693 24.99349 25.68869 23.84683 25.69666 24.93500 23.44773 26.62016
 [9] 27.67181 26.30010 22.03781 25.77229 23.92286 23.35629 27.66600 28.08170
[17] 22.16890 24.93247 24.46477 25.94592 24.50469 28.61894 21.42219 26.88232
[25] 26.96524 22.87907 26.34715 22.76567 24.19697 25.49118 29.21119 22.55112
[33] 23.87877 25.75880 24.54350 23.59964 22.44975 25.43948 25.33276 23.46390

Common task

Rounding off numbers

There are many ways of rounding off numerical number to the nearest integers or specify the number of decimal places. the code block below illustrate the common way to round off:

chl = rnorm(n = 20, mean = .55, sd = .2)
chl |> round(digits = 2)
 [1] 0.32 0.74 0.58 0.31 0.59 0.85 0.76 0.53 0.32 0.72 0.63 0.38 0.47 0.52 0.98
[16] 0.77 0.87 0.73 0.48 0.53

Number of elements in a vector

Sometimes you may have a long vector and want to know the numbers of elements in the object. R has length() function that allows you to query the vector and print the answer

length(chl)
[1] 20

Data Frame

One of R’s greatest strengths is in manipulating data. One of the primary structures for storing data in R is called a Data Frame. Much of your work in R will be working with and manipulating data frames. Data frames are made up of rows and columns. The top row is a header and describes the contents of each variable. Each row represents an individual measured or observed record. Records can also have names. Each record contains multiple cells of values. Let’s illustrates data frame using historical catch data in the Western Indian Ocean Region from FAO. This dataset is called landings_wio_country.csv and contains some data about total landed catches of ten countries in the WIO region reported in FAO between 1951 and 2015 (Table 1).

Table 1: Landing of fish by country

country

year

catch

Kenya

1,950

19,154

Kenya

1,951

21,318

Kenya

1,952

19,126

Kenya

1,953

20,989

Kenya

1,954

17,541

Madagascar

2,011

180,052

Madagascar

2,012

369,808

Madagascar

2,013

266,953

Madagascar

2,014

138,478

Madagascar

2,015

145,629

data.frame is very much like a simple Excel spreadsheet where each column represents a variable type and each row represent observations. A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. A data frame is a list of equal–length vectors with rows as records and columns as variables. This makes data frames unique in data storing as it can store different classes of objects in each column (i.e. numeric, character, factor, logic, etc).

In this section, we will create data frames and add attributes to data frames. Perhaps the easiest way to create a data frame is to parse vectors in a data.frame() function. For instance, in this case we create a simple data frame dt and assess its internal structure

# create vectors
country  = c('Kenya','Mozambique','Seychelles')
weight = c(90, 75, 92)
maturity = c("I", "II", "V")

## use the vectors to make a data frame
dt = data.frame(country, weight, maturity)

## assess the internal structure
str(dt)
'data.frame':   3 obs. of  3 variables:
 $ country : chr  "Kenya" "Mozambique" "Seychelles"
 $ weight  : num  90 75 92
 $ maturity: chr  "I" "II" "V"

Note how Variable Name in dt was converted to a column of factors . This is because there is a default setting in data.frame() that converts character columns to factors . We can turn this off by setting the stringsAsFactors = FALSE argument:

## use the vectors to make a data frame
df = data.frame(country, weight, maturity, stringsAsFactors = FALSE)
df |> str()
'data.frame':   3 obs. of  3 variables:
 $ country : chr  "Kenya" "Mozambique" "Seychelles"
 $ weight  : num  90 75 92
 $ maturity: chr  "I" "II" "V"

Now the variable Name is of character class in the data frame. The inherited problem of data frame to convert character columns into a factor is resolved by introduction f advanced data frames called tibble (Müller and Wickham, 2019), which provides sticker checking and better formating than the traditional data.frame.

## use the vectors to make a tibble
tb = tibble::tibble(country, weight, maturity) 
## check the internal structure of the tibble
tb |> dplyr::glimpse()
Rows: 3
Columns: 3
$ country  <chr> "Kenya", "Mozambique", "Seychelles"
$ weight   <dbl> 90, 75, 92
$ maturity <chr> "I", "II", "V"

?@tbl-score show the the data frame created by fusing the two vectors together.

country

weight

maturity

Kenya

90

I

Mozambique

75

II

Seychelles

92

V

Because the columns have meaning and we have given them column names, it is desirable to want to access an element by the name of the column as opposed to the column number.In large Excel spreadsheets I often get annoyed trying to remember which column something was. The $sign and []are used in R to select variable from the data frame.

dt$country
[1] "Kenya"      "Mozambique" "Seychelles"
dt[,1]
[1] "Kenya"      "Mozambique" "Seychelles"
dt$weight
[1] 90 75 92
dt[,2]
[1] 90 75 92

The FSA package in R has build in dataset that we can use for illustration. For example, ChinookArg dataset contains total length and weight of 112 Chinook salmon collected in three sites in Argentina (Table 2).

chinook = FSA::ChinookArg



chinook |>
  dplyr::sample_n(size = 12) |>
  flextable::flextable() |> 
  flextable::autofit()
Table 2: Length and weight of chinook samlon sampled at three different sites

tl

w

loc

25.2

0.3

Puyehue

112.9

16.0

Petrohue

108.1

13.3

Petrohue

68.1

7.3

Argentina

82.2

6.7

Argentina

78.8

8.4

Argentina

86.0

6.8

Petrohue

92.1

14.8

Argentina

99.4

10.2

Petrohue

85.1

9.0

Argentina

79.0

6.6

Puyehue

103.0

12.6

Petrohue

Sometimes you may need to create set of values and store them in vectors, then combine the vectors into a data frame. Let us see how this can be done. First create three vectors. One contains id for ten individuals, the second vector hold the time each individual signed in the attendance book and the third vector is the distance of each individual from office. We can concatenate the set of values to make vectors.

vessel.id  = c(1,2,3,4,5,6,7,8,9,10)

departure.time = lubridate::ymd_hms(c("2018-11-20 06:35:25 EAT", "2018-11-20 06:52:05 EAT", 
                 "2018-11-20 07:08:45 EAT", "2018-11-20 07:25:25 EAT", 
                 "2018-11-20 07:42:05 EAT", "2018-11-20 07:58:45 EAT", 
                 "2018-11-20 08:15:25 EAT", "2018-11-20 08:32:05 EAT", 
                 "2018-11-20 08:48:45 EAT", "2018-11-20 09:05:25 EAT"), tz = "")

distance.ground = c(20, 85, 45, 69, 42,  52, 6, 45, 36, 7)

Once we have the vectors that have the same length dimension, we can use the function data.frame() to combine the the three vectors into one data frame shown in Table 3

fishing.dep = data.frame(vessel.id, 
                     departure.time, 
                     distance.ground)
Table 3: The time fishers departed for fishing with the distance to th fishing ground

vessel.id

date

time

distance.ground

1

2018-11-20

06:35:25

20

2

2018-11-20

06:52:05

85

3

2018-11-20

07:08:45

45

4

2018-11-20

07:25:25

69

5

2018-11-20

07:42:05

42

6

2018-11-20

07:58:45

52

7

2018-11-20

08:15:25

6

8

2018-11-20

08:32:05

45

9

2018-11-20

08:48:45

36

10

2018-11-20

09:05:25

7

References

Müller, K., Wickham, H., 2019. Tibble: Simple data frames.
R Core Team, 2023. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Citation

BibTeX citation:
@online{semba2024,
  author = {Semba, Masumbuko},
  title = {Understanding Vector and Dataframe},
  date = {2024-02-12},
  url = {https://lugoga.github.io/kitaa/posts/vectorDataframe/},
  langid = {en}
}
For attribution, please cite this work as:
Semba, M., 2024. Understanding vector and dataframe [WWW Document]. URL https://lugoga.github.io/kitaa/posts/vectorDataframe/