3  Vectors

The primary objectives of this chapter revolve around three key areas. Firstly, we aim to equip readers with the skills to effectively work with R code files, allowing them to efficiently manage and execute their code. Secondly, we focus on familiarizing readers with the fundamental data structure in R, namely the vector. By understanding the intricacies and functionalities of vectors, readers will gain a solid foundation for handling and manipulating data in R. Lastly, we delve into the essential concept of subsetting, which plays a pivotal role in data analysis. Through this chapter, readers will acquire the necessary knowledge and techniques to extract and manipulate subsets of data, empowering them to conduct insightful and meaningful analyses in R.

3.0.1 Using code files

In Chapter 2, we typed short and simple expressions into the R console.As we advance in our learning journey, we will transition from typing short and simple expressions directly into the R console to working with longer and more complex code. To facilitate the editing and saving of such code, we store it in dedicated code files. Code files provide a structured and organized approach to managing our R code, allowing us to easily modify, save, and reuse our code for future use. This practice becomes particularly crucial as our projects and analyses become more intricate, enabling us to maintain clarity and efficiency in our coding endeavors..

When it comes to writing computer code, it is crucial to use a plain text editor, such as Notepad++ or RStudio, rather than a word processor like Microsoft Word. The reasons for avoiding word processors when writing code are as follows:

  1. Formatting and Encoding: Word processors often add formatting and special characters that are not compatible with code. These additional formatting elements can introduce errors or cause the code to be misinterpreted by the compiler or interpreter.

  2. Rich Text and Styling: Word processors allow for rich text formatting, including different fonts, sizes, colors, and styles. However, these formatting options are not relevant or recognized by code editors, and they may result in syntax errors or unexpected behavior.

  3. Hidden Characters: Word processors may include hidden characters for formatting purposes, such as line breaks or tab spaces. These characters can interfere with the intended structure and functionality of the code.

  4. File Format: Word processors typically save files in proprietary formats like .doc or .docx, which are not plain text formats. Code files, on the other hand, are commonly saved with plain text extensions like .R or .txt. Using a plain text editor ensures that the code is saved in the appropriate format.

  5. Integration with Development Environment: Specialized code editors, such as RStudio (See Section 2.2), offer features and functions specifically designed for coding, such as syntax highlighting, auto-completion, and debugging tools. These features greatly enhance the coding experience and make it easier to write, debug, and manage code.

3.1 RStudio keyboard stortcuts

RStudio provides several handy keyboard shortcuts that streamline the process of editing and executing code files. Here are some useful RStudio keyboard shortcuts:

  1. Ctrl + Enter: Executes the currently selected line or the highlighted code chunk.

  2. Ctrl + Shift + Enter: Executes the current line or selected code chunk and moves the cursor to the next line.

  3. Ctrl + Shift + Up/Down Arrow: Moves the current line or selected lines up or down within the script.

  4. Ctrl + L: Clears the console.

  5. Ctrl + 1/2/3/4: Switches between the different panes in RStudio (Source, Console, Environment, and Plots).

  6. Ctrl + Shift + M: Inserts a new Markdown chunk in an R Markdown document.

  7. Ctrl + Shift + K: Knits the current R Markdown document to the specified output format.

  8. Ctrl + Shift + C: Comments or uncomments the selected lines or the current line.

  9. Ctrl + Shift + R: Runs the current line or selected code as R code (useful for running lines in R Markdown documents).

  10. Ctrl + Shift + T: Surrounds the selected text with a tag in an R Markdown document.

These are just a few examples of the keyboard shortcuts available in RStudio. To explore more shortcuts or customize them according to your preferences, you can go to the “Tools” menu in RStudio, select “Modify Keyboard Shortcuts,” and browse through the available options.

3.2 Vectors

3.2.1 What is a vector?

The vector is indeed the simplest data structure in R and serves as our introductory topic in this book. In R, a vector is an ordered collection of values that share the same data type. It can contain elements of various types, such as numeric, character, logical, or factors. Here are a few examples illustrating the concept of a vector:

  • Numeric vector: c(1.5, 2.7, 3.9, 4.2)

  • Character vector: c(“apple”, “banana”, “orange”)

  • Logical vector: c(TRUE, FALSE, TRUE, FALSE)

  • Factor vector: factor(c(“low”, “medium”, “high”, “low”))

Each of these examples represents a vector with elements of the same data type. The elements within a vector maintain their order, allowing for easy indexing and manipulation. Understanding vectors is fundamental as they serve as building blocks for more complex data structures and facilitate data analysis and manipulation in R.

3.2.2 The c function

Vectors can be created with the c function, which combines the given vectors, in the specified order. For example:

Code
x = c(1, 2, 3) 
x 
[1] 1 2 3

The c() function in R is not limited to combining individual values but can also be used to combine multiple vectors of different lengths into a new vector. Here’s an example that demonstrates this capability:

Code
v1 <- c(1, 2, 3)
v2 <- c(4)
v3 <- c(5, 6, 7)
v4 <- c(8, 9)

combined_vector <- c(v1, v2, v3, v4)
combined_vector
[1] 1 2 3 4 5 6 7 8 9

The c() function in R can indeed be used to combine vectors of different lengths into a new vector. Here’s the corrected example:

Code
v1 <- c(1, 2, 3)
v2 <- c(4)
v3 <- c(5, 6, 7)
v4 <- c(8, 9)

combined_vector <- c(v1, v2, v3, v4)
combined_vector
[1] 1 2 3 4 5 6 7 8 9

The expression c(x, 84, x, c(-1, -2)) combines the values from multiple vectors into a new vector. However, since the value of x is not provided, I’ll demonstrate the general concept using a placeholder value.

Here is another example of using the c function, this time to combine four character values into a vector of length 4.

Code
y = c("cat", "dog", "mouse", "apple") 
y
[1] "cat"   "dog"   "mouse" "apple"

3.2.3 Vector subsetting (individual elements)

In R, individual elements of a vector can be accessed using the [ operator along with a numeric index. This indexing mechanism allows us to retrieve a specific element from the vector, effectively creating a subset containing only that element. By specifying the desired index within square brackets, we can extract the value associated with that position in the vector. For example, my_vector[3] would retrieve the third element of the vector my_vector. This capability provides flexibility in working with vectors, allowing us to selectively access and manipulate individual elements as needed for our data analysis and computations.

Code
y[1] 
[1] "cat"
Code
y[2] 
[1] "dog"
Code
y[3] 
[1] "mouse"
Code
y[4]
[1] "apple"

Note that numeric indices in R start at 110!

Here is another example:

Code
counts = c(2, 0, 3, 1, 3, 2, 9, 0, 2, 1, 11, 2) 
counts[4] 
[1] 1

We can also make an assignment into a vector subset, for example to replace an individual element:

Code
x = c(1, 2, 3)
x 
[1] 1 2 3
Code
x[2] = 300 
x 
[1]   1 300   3

Indeed, in the example you provided, we made an assignment into a subset with a single element. However, as we progress in our learning, we will discover that assigning values into subsets can be done for subsets of any length. This flexibility allows us to modify specific elements or even replace entire subsets within a vector. By specifying the desired subset using indexing techniques, we can assign new values to those elements, effectively updating or altering the vector according to our requirements.

This capability becomes particularly valuable when performing data manipulations, transformations, or when updating specific data points within a larger dataset. As we delve deeper into R, we will explore various techniques and approaches for assigning values into subsets of different lengths, empowering us to efficiently manipulate and update our data as needed.

3.2.4 Calling functions on a vector

R provides a wide range of functions that allow us to calculate various properties of vectors. Some commonly used functions include length, min, max, range, mean, and sum. Here’s an example that showcases their usage:

Code
x = c(1, 6, 3, -8, 2) 
x 
[1]  1  6  3 -8  2
Code
length(x)  
[1] 5
Code
min(x)    
[1] -8
Code
max(x)     
[1] 6
Code
range(x)   
[1] -8  6
Code
mean(x)    
[1] 0.8
Code
sum(x)     
[1] 4
Note

In what way is the range function different from the other functions shown above?

Contrariwise, there are functions that operate on each vector element, separately, returning a vector of results having the same length as the input:

Code
abs(x)   # Absolute value ## [1] 1 6 3 8 2
[1] 1 6 3 8 2
Code
sqrt(x)  
[1] 1.000000 2.449490 1.732051      NaN 1.414214
Code
x
[1]  1  6  3 -8  2

3.2.5 The recycling rule (arithmetic)

When binary operations, such as arithmetic operators (+, -, *, /), are applied to two vectors in R, the operations are performed element-by-element. This means that each corresponding pair of elements from the two vectors is operated upon independently, resulting in a new vector of the respective results. Here’s an example that demonstrates this behavior:

Code
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
Code
vector1 + vector2
[1] 5 7 9
Code
vector1 - vector2
[1] -3 -3 -3
Code
vector1 * vector2
[1]  4 10 18
Code
vector1 / vector2
[1] 0.25 0.40 0.50
Code
vector1 > vector2
[1] FALSE FALSE FALSE
Code
vector1 < vector2
[1] TRUE TRUE TRUE

What happens when the input vector lengths do not match? In such case, the shorter vector gets “recycled”. For example, when one of the vectors is of length 3 and the other vector is of length 6, then the shorter vector (of length 3) is replicated two times, until it matches the longer vector (Figure 2.7). Thus, the expression:

Code
c(1, 2, 3)  + c(1, 2, 3, 4, 5, 6) 
[1] 2 4 6 5 7 9

is similar to the expression:

Code
c(1, 2, 3, 1, 2, 3) + c(1, 2, 3, 4, 5, 6) 
[1] 2 4 6 5 7 9

When one of the vectors is of length 1 and the other is of length 4, the shorter vector (of length 1) is replicated 4 times:

Code
2  * c(1, 2, 3, 4) 
[1] 2 4 6 8
Code
c(2, 2, 2, 2) * c(1, 2, 3, 4) 
[1] 2 4 6 8

When one of the vectors is of length 2 and the other is of length 6, the shorter vector (of length 2) is replicated 3 times:

Code
c(10, 100) + c(1, 2, 3, 4, 5, 6) 
[1]  11 102  13 104  15 106
Code
c(10, 100, 10, 100, 10, 100) + c(1, 2, 3, 4, 5, 6) 
[1]  11 102  13 104  15 106

What happens when the longer vector length is not a multiple of the shorter vector length? When the longer vector length is not a multiple of the shorter vector length, a situation arises where recycling is incomplete. This means that the elements of the shorter vector cannot be evenly distributed or recycled to match the length of the longer vector. As a result, some elements from the shorter vector will be used to fill up the longer vector, but there will be remaining elements in the shorter vector that do not have corresponding positions in the longer vector.

In programming languages like R, this incomplete recycling is indicated by a warning message, which alerts the user to the fact that the two vectors are not compatible for recycling. It is important to ensure that vector lengths are compatible before attempting to recycle elements, as incomplete recycling

Code
c(1, 2) * c(1, 2, 3) 
[1] 1 4 3
Code
c(1, 2, 1) * c(1, 2, 3) 
[1] 1 4 3

3.2.6 Consecutive and repetitive vectors

In addition to the c function, there are three commonly used methods for creating consecutive or repetitive vectors: the seq and rep functions.

The seq function generates a sequence of numbers that can be either increasing or decreasing. It takes arguments such as the starting point, ending point, and the increment or decrement value. For example, seq(1, 10, 2) creates a vector with the numbers 1, 3, 5, 7, 9, indicating an increasing sequence with a step of 2.

Code
seq(1, 10, 2)
[1] 1 3 5 7 9

Generate a decreasing sequence from 10 to 1 with a step of 1

Code
seq(10, 1, -1)
 [1] 10  9  8  7  6  5  4  3  2  1

The rep function, on the other hand, replicates elements of a vector. It can replicate a single element or an entire vector multiple times. By specifying the times argument, you can control the number of repetitions. For instance, rep(c(1, 2, 3), times = 3) generates a vector containing three repetitions of the sequence 1, 2, 3: 1, 2, 3, 1, 2, 3, 1, 2, 3.

Code
rep(1:3, times = 3)
[1] 1 2 3 1 2 3 1 2 3

Replicate a vector multiple times

Code
rep(c("A", "B", "C"), each = 3)
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"

These functions provide flexible ways to create consecutive or repetitive vectors in programming languages like R, allowing for efficient generation of data for various purposes.

3.2.7 Function calls

By utilizing the seq function, we will illustrate three characteristics of function calls. Initially, it is possible to exclude parameter names provided that the arguments are provided in the default order. For instance, the subsequent function calls are equivalent as the default order of the first three parameters of the seq function is from, to, and by:

Code
seq(from = 5, to = 10, by = 1) ## [1]  5  6  7  8  9 10
[1]  5  6  7  8  9 10
Code
seq(5, 10, 1) ## [1]  5  6  7  8  9 10
[1]  5  6  7  8  9 10

Furthermore, we can utilize any argument order by explicitly specifying parameter names. This allows for flexibility in function calls. Even if the order of arguments is different, the following three function calls are equivalent due to the usage of parameter names:

Code
seq(to = 10, by = 1, from = 5) 
[1]  5  6  7  8  9 10
Code
seq(by = 1, from = 5, to = 10) 
[1]  5  6  7  8  9 10
Code
seq(from = 5, by = 1, to = 10) 
[1]  5  6  7  8  9 10

Lastly, it is possible to exclude parameters that have a default argument defined in the function. A default argument is a predefined value specified in the function definition. As an illustration, the by parameter in the seq function has a default value of 1. Hence, if the by argument is omitted in the function call, it will automatically be assigned the default value of 1:

Code
seq(5, 10, 1) 
[1]  5  6  7  8  9 10
Code
seq(5, 10) 
[1]  5  6  7  8  9 10

The order of parameters, their default values (if any), and other relevant information about a specific function can be found in the help file associated with that function. The help file provides comprehensive documentation that describes the function’s usage, arguments, return values, and often includes examples and additional explanations.

It is a valuable resource for understanding and utilizing functions effectively in programming languages like R. To access the help file for a specific function, you can use the ? operator followed by the function name in the R console or refer to the documentation provided by the programming environment or package documentation.

?seq

3.2.8 Vector subsetting (general)

Vector subsetting refers to the process of selecting specific elements from a vector based on certain criteria or indices. It allows you to extract a subset of data from a vector that meets your desired conditions or requirements. There are different ways to perform vector subsetting in various programming languages, including R.

In R, vector subsetting can be done using square brackets [ ] and various indexing methods. Here are a few commonly used approaches:

Code
x = c(43, 85, 10) 
x[3] 
[1] 10

We can also use a vector of length >1 as an index. For example, the following expression returns the first and second elements of x, since the index is the the vector c(1,2) (which we create using the : operator) (Section 2.3.6.2):

Code
x[1:2] 
[1] 43 85

Note that the vector of indices can consist of any combination of indices whatsoever. It does not have to be consecutive, and it can even include repetitions:

Code
x[c(1, 1, 3, 2)] 
[1] 43 43 10 85

Here is another example:

Code
counts = c(2, 0, 3, 1, 3) 
counts[1:3] 
[1] 2 0 3

And here is one more example where the index is not consecutive:

Code
counts = c(2, 0, 3, 1, 3, 2, 9, 0, 2, 1, 11, 2) 
counts[c(1:3, 7:9)] ## [1] 2 0 3 9 0 2
[1] 2 0 3 9 0 2

For the next examples, let’s create a vector of all even numbers between 1 and 100:

Code
x = seq(2, 100, 2) 
x ##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38 ## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76 ## [39]  78  80  82  84  86  88  90  92  94  96  98 100
 [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
[20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
[39]  78  80  82  84  86  88  90  92  94  96  98 100

What is the meaning of the numbers in square brackets when printing the vector?

How can we check how many elements does x have? Recall the length function :

Code
length(x) ## [1] 50
[1] 50

Using this knowledge, here are two expression that return the value of the last element in x:

Code
x[50] 
[1] 100
Code
x[length(x)] 
[1] 100

Which of the last two expressions is preferable? Why?

Which index can we use to get back the entire vector? We can use an index that contains all of the vector element indices, starting from 1 and up to length(x):

Code
x[1:length(x)] ##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38 ## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76 ## [39]  78  80  82  84  86  88  90  92  94  96  98 100
 [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
[20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
[39]  78  80  82  84  86  88  90  92  94  96  98 100

What index can we use to get the entire vector except for the last two elements?

What index can we use to get a reversed vector?

Note that there is a special function named rev for reversing a vector:

Code
rev(x) ##  [1] 100  98  96  94  92  90  88  86  84  82  80  78  76  74  72  70  68  66  64 ## [20]  62  60  58  56  54  52  50  48  46  44  42  40  38  36  34  32  30  28  26 ## [39]  24  22  20  18  16  14  12  10   8   6   4   2
 [1] 100  98  96  94  92  90  88  86  84  82  80  78  76  74  72  70  68  66  64
[20]  62  60  58  56  54  52  50  48  46  44  42  40  38  36  34  32  30  28  26
[39]  24  22  20  18  16  14  12  10   8   6   4   2

Note that, when requesting an element, or elements, beyond the vector length, we get NA (Not Available) value(s). For example:

Code
x[55] ## [1] NA
[1] NA
Code
x[1:80] ##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38 ## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76 ## [39]  78  80  82  84  86  88  90  92  94  96  98 100  NA  NA  NA  NA  NA  NA  NA ## [58]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA ## [77]  NA  NA  NA  NA
 [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
[20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
[39]  78  80  82  84  86  88  90  92  94  96  98 100  NA  NA  NA  NA  NA  NA  NA
[58]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
[77]  NA  NA  NA  NA

3.2.9 The recycling rule (assignment)

The recycling rule in R also applies to assignment operations. When assigning a vector of length 1 to a subset of a longer vector, R will recycle the single value to match the length of the subset. This ensures that the assignment can be performed consistently across the subset.For example, here we assign a vector of length 1 (NA) into a subset of length 6 (c(1:3,7:9)). As a result, NA is replicated six times, to match the subset. Here’s an example illustrating this behavior:

Code
counts = c(2, 0, 3, 1, 3, 2, 9, 0, 2, 1, 11, 2) 
counts ##  [1]  2  0  3  1  3  2  9  0  2  1 11  2
 [1]  2  0  3  1  3  2  9  0  2  1 11  2
Code
counts[c(1:3, 7:9)] = NA 
counts ##  [1] NA NA NA  1  3  2 NA NA NA  1 11  2
 [1] NA NA NA  1  3  2 NA NA NA  1 11  2

Here, c(NA,99) is replicated three times, also to match the subset of length 6:

Code
counts[c(1:3, 7:9)] = c(NA, 99) 
counts ##  [1] NA 99 NA  1  3  2 99 NA 99  1 11  2
 [1] NA 99 NA  1  3  2 99 NA 99  1 11  2

3.2.10 Logical vectors

3.2.10.1 Creating logical vectors

The third common type of vectors in R, alongside numeric and character vectors, is the logical vector. A logical vector consists of logical values, which are primarily represented by TRUE and FALSE. These values are used to denote logical conditions or Boolean expressions in R. Here’s an example of a logical vector:

Code
# Create a logical vector
my_logical_vector <- c(TRUE, FALSE, TRUE, FALSE, NA)

# Output the logical vector
my_logical_vector
[1]  TRUE FALSE  TRUE FALSE    NA

n the provided case, the logical vector is composed of five elements: TRUE, FALSE, TRUE, FALSE, and NA (representing missing or undefined values). Logical vectors are commonly used for logical operations, conditional statements, filtering data, and controlling program flow in R. They allow for the representation and manipulation of logical conditions and Boolean values within vectorized operations.

Code
c(TRUE, FALSE, FALSE) ## [1]  TRUE FALSE FALSE
[1]  TRUE FALSE FALSE
Code
rep(TRUE, 7) ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Logical vectors are typically created by applying conditional operators to numeric or character vectors. These conditional operators evaluate each element of the vector and return a logical value (TRUE or FALSE) based on the specified condition.

Code
x = 1:10 
x 
 [1]  1  2  3  4  5  6  7  8  9 10
Code
x >= 7 
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

Note how the recycling rule applies to conditional operators in the above expression.

An important property of logical vectors is, that, when arithmetic operations are applied, the logical vector is automatically converted to a numeric one, where TRUE becomes 1 and FALSE becomes 0. For example:

Code
TRUE + FALSE ## [1] 1
[1] 1
Code
sum(x >= 7) ## [1] 4
[1] 4
Code
mean(x >= 7) ## [1] 0.4
[1] 0.4

::: layout-important What is the meaning of the values 4 and 0.4 in the above example? :::

3.2.10.2 Subsetting with logical vectors

Exactly! In addition to using a numeric vector of indices for subsetting, a logical vector can also be used as an index. When a logical vector is used as an index for subsetting, the resulting subset will include the elements that correspond to the TRUE positions in the index. For example:

Code
counts = c(2, 0, 3, 1, 3)
Code
counts[c(TRUE, FALSE, TRUE, FALSE, FALSE)] ## [1] 2 3
[1] 2 3

Here is another example, where the logical vector of indices is created from the same vector being subsetted:

Code
counts[counts < 3] ## [1] 2 0 1
[1] 2 0 1

In this example, the logical vector counts<3:

Code
counts < 3 ## [1]  TRUE  TRUE FALSE  TRUE FALSE
[1]  TRUE  TRUE FALSE  TRUE FALSE

specifies whether to include each of the elements of counts in the resulting subset (Figure 2.10).

What does the expression counts[counts<3] do, in plain language?

Figure 2.10: Subsetting with a logical vector

Here are some more examples of subsetting with a logical index:

Code
x = 1:10 
x 
 [1]  1  2  3  4  5  6  7  8  9 10
Code
x[x >= 3]         # Elements of 'x' greater or equal than 3 ## [1]  3  4  5  6  7  8  9 10
[1]  3  4  5  6  7  8  9 10
Code
x[x != 2]         # Elements of 'x' not equal to 2  ## [1]  1  3  4  5  6  7  8  9 10
[1]  1  3  4  5  6  7  8  9 10
Code
x[x > 4 | x < 2]  # Elements of 'x' greater than 4 or smaller than 2 ## [1]  1  5  6  7  8  9 10
[1]  1  5  6  7  8  9 10
Code
x[x > 4 & x < 2]  # Elements of 'x' greater than 4 and smaller than 2 ## integer(0)
integer(0)

What does the output integer(0), which we got in the last expression, mean?

The next example is slightly more complex; we select the elements of z whose square is larger than 8:

Code
z = c(5, 2, -3, 8) 
z[z^2 > 8] ## [1]  5 -3  8
[1]  5 -3  8

Let’s go over this step-by-step. First, z^2 gives a vector of squared z values (2 is recycled):

Code
z^2 ## [1] 25  4  9 64
[1] 25  4  9 64

Then, each of the squares is compared to 8 (8 is recycled):

Code
z^2 > 8 ## [1]  TRUE FALSE  TRUE  TRUE
[1]  TRUE FALSE  TRUE  TRUE

Finally, the logical vector z^2>8 is used for subsetting z.

3.2.11 Missing values

Missing values, often represented as NA in R, are used to indicate the absence or unavailability of data for a particular element in a vector. Missing values play a crucial role in handling incomplete or undefined information within datasets. It

  • accepts a vector, of any type, and
  • returns a logical vector, with TRUE in place of NA values and FALSE in place of non-NA values.

For example, suppose we have a vector x where some of the values are missing:

Code
x = c(28, 58, NA, 31, 39, NA, 9) 
x ## [1] 28 58 NA 31 39 NA  9
[1] 28 58 NA 31 39 NA  9

The is.na function can be used to detect which values are missing:

Code
is.na(x) ## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
[1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

How can we use the above expression to subset the non-missing values of x?

Many of the functions that summarize vector properties, such as sum and mean, have a parameter called na.rm. The na.rm parameter is used to determine whether NA values are excluded from the calculation. The default is na.rm=FALSE, meaning that NA values are not excluded. For example:

Code
x = c(28, 58, NA, 31, 39, NA, 9) 
mean(x) 
[1] NA

Why do we get NA in the first expression?

What do you think will be the result of length(x)?

How can we replace the NA values in x with the mean of its non-NA values?

3.3 Additional useful functions

3.3.1 any and all

Sometimes we want to figure out whether a logical vector:

  • contains at least one TRUE value; or

  • is entirely composed of TRUE values.

We can use the any and all functions, respectively, to do those things.

The any function returns TRUE if at least one of the input vector values is TRUE, otherwise it returns FALSE. For example, let’s take a numeric vector x:

Code
x = c(2, 6, 2, 3, 0, 1, 6) 
x ## [1] 2 6 2 3 0 1 6
[1] 2 6 2 3 0 1 6

The expression any(x > 5) returns TRUE, which means that the vector x > 5 contains at least one TRUE value, i.e., at least one element of x is greater than 5:

Code
x > 5 
[1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

The expression any(x > 88) returns FALSE, which means that the vector x > 88 contains no TRUE values, i.e., none of the elements of x are greater than 88:

Code
x > 88 ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE any(x > 88) ## [1] FALSE
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The all function returns TRUE if all of the input vector values are TRUE, otherwise it returns FALSE. For example, the expression all(x > 5) returns FALSE, which means that the vector x > 5 contains at least one FALSE value, i.e., not all elements of x are greater than 5:

Code
x > 5 ## [1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE all(x > 5) ## [1] FALSE
[1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

The expression all(x > -1) returns TRUE, which means that x > -1 is composed entirely of TRUE values, i.e., all elements of x are greater than -1:

Code
x > -1 
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

In a way, any and all are inverse:

  • any determines if the logical vector contains at least one TRUE value.

  • all determines if the logical vector contains at least one FALSE value.

3.3.2 which

The which function converts a logical vector to a numeric one, with the indices of TRUE values. That way, we can find out the index of values that satisfy a given condition. For example, considering the vector x:

Code
x ## [1] 2 6 2 3 0 1 6
[1] 2 6 2 3 0 1 6

the expression which(x > 2.3) returns the indices of TRUE elements in x > 2.3, i.e., the indices of x elements which are greater than 2.3:

Code
x > 2.3 ## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE which(x > 2.3) ## [1] 2 4 7
[1] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE

3.3.3 which.min and which.max

Related to the which function, there are two additional functions in R: which.min and which.max. These functions return the index of the first occurrence of the minimum or maximum value in a vector, respectively.

Here’s an example illustrating the usage of which.min and which.max functions with a vector x:

Code
x 
[1] 2 6 2 3 0 1 6

using which.min we can find out that the minimal value of x is in the 5th position:

Code
which.min(x) 
[1] 5

while using which.max we can find out that the maximal value of x is in the 2nd position:

Code
which.max(x) 
[1] 2

3.3.4 The order function

The order function returns ordered vector indices, based on the order of vector values. In other words, order gives the index of the smallest value, the index of the second smallest value, etc., up to the index of the largest value. For example, given the vector x:

Code
x ~ [1] 2 6 2 3 0 1 6

order(x) returns the indices 1:length(x), ordered from smallest to largest value:

Code
order(x) ## [1] 5 6 1 3 4 2 7
[1] 5 6 1 3 4 2 7

This result tells us that the 5th element of x is the smallest, the 6th is the second smallest, and so on.

We can also get the reverse order with decreasing=TRUE:

order(x, decreasing = TRUE) ## [1] 2 7 4 1 3 6 5

How can we get a sorted vector of elements from x, as shown below, using the order function?

Code
## [1] 0 1 2 2 3 6 6

3.3.5 paste and paste0

The paste function is used to “paste” text values. Its sep parameter determines the separating character(s), with the default being sep=" " (a space). For example:

Code
paste("There are", "5", "books.") ## [1] "There are 5 books." paste("There are", "5", "books.", sep = "_") ## [1] "There are_5_books."
[1] "There are 5 books."

Non-character vectors are automatically converted to character before pasting:

Code
paste("There are", 80, "books.") ## [1] "There are 80 books."
[1] "There are 80 books."

The recycling rule applies in paste too:

Code
paste("image", 1:5, ".tif", sep = "") ## [1] "image1.tif" "image2.tif" "image3.tif" "image4.tif" "image5.tif"
[1] "image1.tif" "image2.tif" "image3.tif" "image4.tif" "image5.tif"

The paste0 function is a shortcut for paste with sep="":

Code
paste0("image", 1:5, ".tif") ## [1] "image1.tif" "image2.tif" "image3.tif" "image4.tif" "image5.tif"
[1] "image1.tif" "image2.tif" "image3.tif" "image4.tif" "image5.tif"