[1] 1 2 3
3 Vectors
The primary objectives of this chapter revolve around three key areas. Firstly, we aim to equip readers with the skills to effectively work with R code files, allowing them to efficiently manage and execute their code. Secondly, we focus on familiarizing readers with the fundamental data structure in R, namely the vector. By understanding the intricacies and functionalities of vectors, readers will gain a solid foundation for handling and manipulating data in R. Lastly, we delve into the essential concept of subsetting, which plays a pivotal role in data analysis. Through this chapter, readers will acquire the necessary knowledge and techniques to extract and manipulate subsets of data, empowering them to conduct insightful and meaningful analyses in R.
3.0.1 Using code files
In Chapter 2, we typed short and simple expressions into the R console.As we advance in our learning journey, we will transition from typing short and simple expressions directly into the R console to working with longer and more complex code. To facilitate the editing and saving of such code, we store it in dedicated code files. Code files provide a structured and organized approach to managing our R code, allowing us to easily modify, save, and reuse our code for future use. This practice becomes particularly crucial as our projects and analyses become more intricate, enabling us to maintain clarity and efficiency in our coding endeavors..
When it comes to writing computer code, it is crucial to use a plain text editor, such as Notepad++ or RStudio, rather than a word processor like Microsoft Word. The reasons for avoiding word processors when writing code are as follows:
Formatting and Encoding: Word processors often add formatting and special characters that are not compatible with code. These additional formatting elements can introduce errors or cause the code to be misinterpreted by the compiler or interpreter.
Rich Text and Styling: Word processors allow for rich text formatting, including different fonts, sizes, colors, and styles. However, these formatting options are not relevant or recognized by code editors, and they may result in syntax errors or unexpected behavior.
Hidden Characters: Word processors may include hidden characters for formatting purposes, such as line breaks or tab spaces. These characters can interfere with the intended structure and functionality of the code.
File Format: Word processors typically save files in proprietary formats like .doc or .docx, which are not plain text formats. Code files, on the other hand, are commonly saved with plain text extensions like
.R
or.txt
. Using a plain text editor ensures that the code is saved in the appropriate format.Integration with Development Environment: Specialized code editors, such as RStudio (See Section 2.2), offer features and functions specifically designed for coding, such as syntax highlighting, auto-completion, and debugging tools. These features greatly enhance the coding experience and make it easier to write, debug, and manage code.
3.1 RStudio keyboard stortcuts
RStudio provides several handy keyboard shortcuts that streamline the process of editing and executing code files. Here are some useful RStudio keyboard shortcuts:
Ctrl + Enter: Executes the currently selected line or the highlighted code chunk.
Ctrl + Shift + Enter: Executes the current line or selected code chunk and moves the cursor to the next line.
Ctrl + Shift + Up/Down Arrow: Moves the current line or selected lines up or down within the script.
Ctrl + L: Clears the console.
Ctrl + 1/2/3/4: Switches between the different panes in RStudio (Source, Console, Environment, and Plots).
Ctrl + Shift + M: Inserts a new Markdown chunk in an R Markdown document.
Ctrl + Shift + K: Knits the current R Markdown document to the specified output format.
Ctrl + Shift + C: Comments or uncomments the selected lines or the current line.
Ctrl + Shift + R: Runs the current line or selected code as R code (useful for running lines in R Markdown documents).
Ctrl + Shift + T: Surrounds the selected text with a tag in an R Markdown document.
These are just a few examples of the keyboard shortcuts available in RStudio. To explore more shortcuts or customize them according to your preferences, you can go to the “Tools” menu in RStudio, select “Modify Keyboard Shortcuts,” and browse through the available options.
3.2 Vectors
3.2.1 What is a vector?
The vector is indeed the simplest data structure in R and serves as our introductory topic in this book. In R, a vector is an ordered collection of values that share the same data type. It can contain elements of various types, such as numeric, character, logical, or factors. Here are a few examples illustrating the concept of a vector:
Numeric vector: c(1.5, 2.7, 3.9, 4.2)
Character vector: c(“apple”, “banana”, “orange”)
Logical vector: c(TRUE, FALSE, TRUE, FALSE)
Factor vector: factor(c(“low”, “medium”, “high”, “low”))
Each of these examples represents a vector with elements of the same data type. The elements within a vector maintain their order, allowing for easy indexing and manipulation. Understanding vectors is fundamental as they serve as building blocks for more complex data structures and facilitate data analysis and manipulation in R.
3.2.2 The c
function
Vectors can be created with the c
function, which combines the given vectors, in the specified order. For example:
The c()
function in R is not limited to combining individual values but can also be used to combine multiple vectors of different lengths into a new vector. Here’s an example that demonstrates this capability:
Code
[1] 1 2 3 4 5 6 7 8 9
The c()
function in R can indeed be used to combine vectors of different lengths into a new vector. Here’s the corrected example:
Code
[1] 1 2 3 4 5 6 7 8 9
The expression c(x, 84, x, c(-1, -2))
combines the values from multiple vectors into a new vector. However, since the value of x is not provided, I’ll demonstrate the general concept using a placeholder value.
Here is another example of using the c
function, this time to combine four character
values into a vector of length 4.
3.2.3 Vector subsetting (individual elements)
In R, individual elements of a vector can be accessed using the [
operator along with a numeric index. This indexing mechanism allows us to retrieve a specific element from the vector, effectively creating a subset containing only that element. By specifying the desired index within square brackets, we can extract the value associated with that position in the vector. For example, my_vector[3]
would retrieve the third element of the vector my_vector
. This capability provides flexibility in working with vectors, allowing us to selectively access and manipulate individual elements as needed for our data analysis and computations.
Note that numeric indices in R start at 1
10!
Here is another example:
We can also make an assignment into a vector subset, for example to replace an individual element:
Indeed, in the example you provided, we made an assignment into a subset with a single element. However, as we progress in our learning, we will discover that assigning values into subsets can be done for subsets of any length. This flexibility allows us to modify specific elements or even replace entire subsets within a vector. By specifying the desired subset using indexing techniques, we can assign new values to those elements, effectively updating or altering the vector according to our requirements.
This capability becomes particularly valuable when performing data manipulations, transformations, or when updating specific data points within a larger dataset. As we delve deeper into R, we will explore various techniques and approaches for assigning values into subsets of different lengths, empowering us to efficiently manipulate and update our data as needed.
3.2.4 Calling functions on a vector
R provides a wide range of functions that allow us to calculate various properties of vectors. Some commonly used functions include length, min, max, range, mean, and sum. Here’s an example that showcases their usage:
In what way is the range
function different from the other functions shown above?
Contrariwise, there are functions that operate on each vector element, separately, returning a vector of results having the same length as the input:
3.2.5 The recycling rule (arithmetic)
When binary operations, such as arithmetic operators (+, -, *, /), are applied to two vectors in R, the operations are performed element-by-element. This means that each corresponding pair of elements from the two vectors is operated upon independently, resulting in a new vector of the respective results. Here’s an example that demonstrates this behavior:
What happens when the input vector lengths do not match? In such case, the shorter vector gets “recycled”. For example, when one of the vectors is of length 3 and the other vector is of length 6, then the shorter vector (of length 3) is replicated two times, until it matches the longer vector (Figure 2.7). Thus, the expression:
is similar to the expression:
When one of the vectors is of length 1 and the other is of length 4, the shorter vector (of length 1) is replicated 4 times:
When one of the vectors is of length 2 and the other is of length 6, the shorter vector (of length 2) is replicated 3 times:
What happens when the longer vector length is not a multiple of the shorter vector length? When the longer vector length is not a multiple of the shorter vector length, a situation arises where recycling is incomplete. This means that the elements of the shorter vector cannot be evenly distributed or recycled to match the length of the longer vector. As a result, some elements from the shorter vector will be used to fill up the longer vector, but there will be remaining elements in the shorter vector that do not have corresponding positions in the longer vector.
In programming languages like R, this incomplete recycling is indicated by a warning message, which alerts the user to the fact that the two vectors are not compatible for recycling. It is important to ensure that vector lengths are compatible before attempting to recycle elements, as incomplete recycling
3.2.6 Consecutive and repetitive vectors
In addition to the c
function, there are three commonly used methods for creating consecutive or repetitive vectors: the seq
and rep
functions.
The seq
function generates a sequence of numbers that can be either increasing or decreasing. It takes arguments such as the starting point, ending point, and the increment or decrement value. For example, seq(1, 10, 2)
creates a vector with the numbers 1, 3, 5, 7, 9, indicating an increasing sequence with a step of 2.
Generate a decreasing sequence from 10 to 1 with a step of 1
The rep
function, on the other hand, replicates elements of a vector. It can replicate a single element or an entire vector multiple times. By specifying the times
argument, you can control the number of repetitions. For instance, rep(c(1, 2, 3), times = 3)
generates a vector containing three repetitions of the sequence 1, 2, 3: 1, 2, 3, 1, 2, 3, 1, 2, 3.
Replicate a vector multiple times
These functions provide flexible ways to create consecutive or repetitive vectors in programming languages like R, allowing for efficient generation of data for various purposes.
3.2.7 Function calls
By utilizing the seq
function, we will illustrate three characteristics of function calls. Initially, it is possible to exclude parameter names provided that the arguments are provided in the default order. For instance, the subsequent function calls are equivalent as the default order of the first three parameters of the seq
function is from
, to
, and by
:
Furthermore, we can utilize any argument order by explicitly specifying parameter names. This allows for flexibility in function calls. Even if the order of arguments is different, the following three function calls are equivalent due to the usage of parameter names:
Lastly, it is possible to exclude parameters that have a default argument defined in the function. A default argument is a predefined value specified in the function definition. As an illustration, the by
parameter in the seq
function has a default value of 1
. Hence, if the by
argument is omitted in the function call, it will automatically be assigned the default value of 1
:
The order of parameters, their default values (if any), and other relevant information about a specific function can be found in the help file associated with that function. The help file provides comprehensive documentation that describes the function’s usage, arguments, return values, and often includes examples and additional explanations.
It is a valuable resource for understanding and utilizing functions effectively in programming languages like R. To access the help file for a specific function, you can use the ?
operator followed by the function name in the R console or refer to the documentation provided by the programming environment or package documentation.
?seq
3.2.8 Vector subsetting (general)
Vector subsetting refers to the process of selecting specific elements from a vector based on certain criteria or indices. It allows you to extract a subset of data from a vector that meets your desired conditions or requirements. There are different ways to perform vector subsetting in various programming languages, including R.
In R, vector subsetting can be done using square brackets [ ]
and various indexing methods. Here are a few commonly used approaches:
We can also use a vector of length >1 as an index. For example, the following expression returns the first and second elements of x
, since the index is the the vector c(1,2)
(which we create using the :
operator) (Section 2.3.6.2):
Note that the vector of indices can consist of any combination of indices whatsoever. It does not have to be consecutive, and it can even include repetitions:
Here is another example:
And here is one more example where the index is not consecutive:
[1] 2 0 3 9 0 2
For the next examples, let’s create a vector of all even numbers between 1 and 100:
Code
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
[20] 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76
[39] 78 80 82 84 86 88 90 92 94 96 98 100
What is the meaning of the numbers in square brackets when printing the vector?
How can we check how many elements does x
have? Recall the length
function :
Using this knowledge, here are two expression that return the value of the last element in x
:
Which of the last two expressions is preferable? Why?
Which index can we use to get back the entire vector? We can use an index that contains all of the vector element indices, starting from 1
and up to length(x)
:
Code
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
[20] 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76
[39] 78 80 82 84 86 88 90 92 94 96 98 100
What index can we use to get the entire vector except for the last two elements?
What index can we use to get a reversed vector?
Note that there is a special function named rev
for reversing a vector:
Code
[1] 100 98 96 94 92 90 88 86 84 82 80 78 76 74 72 70 68 66 64
[20] 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26
[39] 24 22 20 18 16 14 12 10 8 6 4 2
Note that, when requesting an element, or elements, beyond the vector length, we get NA
(Not Available) value(s). For example:
Code
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
[20] 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76
[39] 78 80 82 84 86 88 90 92 94 96 98 100 NA NA NA NA NA NA NA
[58] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[77] NA NA NA NA
3.2.9 The recycling rule (assignment)
The recycling rule in R also applies to assignment operations. When assigning a vector of length 1 to a subset of a longer vector, R will recycle the single value to match the length of the subset. This ensures that the assignment can be performed consistently across the subset.For example, here we assign a vector of length 1 (NA
) into a subset of length 6 (c(1:3,7:9)
). As a result, NA
is replicated six times, to match the subset. Here’s an example illustrating this behavior:
[1] 2 0 3 1 3 2 9 0 2 1 11 2
[1] NA NA NA 1 3 2 NA NA NA 1 11 2
Here, c(NA,99)
is replicated three times, also to match the subset of length 6:
3.2.10 Logical vectors
3.2.10.1 Creating logical vectors
The third common type of vectors in R, alongside numeric
and character
vectors, is the logical
vector. A logical
vector consists of logical
values, which are primarily represented by TRUE
and FALSE
. These values are used to denote logical conditions or Boolean expressions in R. Here’s an example of a logical
vector:
Code
[1] TRUE FALSE TRUE FALSE NA
n the provided case, the logical vector is composed of five elements: TRUE
, FALSE
, TRUE
, FALSE
, and NA
(representing missing or undefined values). Logical vectors are commonly used for logical operations, conditional statements, filtering data, and controlling program flow in R. They allow for the representation and manipulation of logical conditions and Boolean values within vectorized operations.
Logical vectors are typically created by applying conditional operators to numeric or character vectors. These conditional operators evaluate each element of the vector and return a logical value (TRUE or FALSE) based on the specified condition.
Note how the recycling rule applies to conditional operators in the above expression.
An important property of logical
vectors is, that, when arithmetic operations are applied, the logical
vector is automatically converted to a numeric one, where TRUE
becomes 1
and FALSE
becomes 0
. For example:
::: layout-important What is the meaning of the values 4
and 0.4
in the above example? :::
3.2.10.2 Subsetting with logical vectors
Exactly! In addition to using a numeric vector of indices for subsetting, a logical vector can also be used as an index. When a logical vector is used as an index for subsetting, the resulting subset will include the elements that correspond to the TRUE positions in the index. For example:
Here is another example, where the logical
vector of indices is created from the same vector being subsetted:
In this example, the logical vector counts<3
:
specifies whether to include each of the elements of counts
in the resulting subset (Figure 2.10).
What does the expression
counts[counts<3]
do, in plain language?
Figure 2.10: Subsetting with a logical vector
Here are some more examples of subsetting with a logical
index:
[1] 3 4 5 6 7 8 9 10
[1] 1 5 6 7 8 9 10
What does the output
integer(0)
, which we got in the last expression, mean?
The next example is slightly more complex; we select the elements of z
whose square is larger than 8:
Let’s go over this step-by-step. First, z^2
gives a vector of squared z
values (2
is recycled):
Then, each of the squares is compared to 8 (8
is recycled):
Finally, the logical
vector z^2>8
is used for subsetting z
.
3.2.11 Missing values
Missing values, often represented as NA in R, are used to indicate the absence or unavailability of data for a particular element in a vector. Missing values play a crucial role in handling incomplete or undefined information within datasets. It
- accepts a vector, of any type, and
- returns a
logical
vector, withTRUE
in place ofNA
values andFALSE
in place of non-NA
values.
For example, suppose we have a vector x
where some of the values are missing:
The is.na
function can be used to detect which values are missing:
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE
How can we use the above expression to subset the non-missing values of
x
?
Many of the functions that summarize vector properties, such as sum
and mean
, have a parameter called na.rm
. The na.rm
parameter is used to determine whether NA
values are excluded from the calculation. The default is na.rm=FALSE
, meaning that NA
values are not excluded. For example:
Why do we get
NA
in the first expression?
What do you think will be the result of
length(x)
?
How can we replace the
NA
values inx
with the mean of its non-NA
values?
3.3 Additional useful functions
3.3.1 any
and all
Sometimes we want to figure out whether a logical
vector:
contains at least one
TRUE
value; oris entirely composed of
TRUE
values.
We can use the any
and all
functions, respectively, to do those things.
The any
function returns TRUE
if at least one of the input vector values is TRUE
, otherwise it returns FALSE
. For example, let’s take a numeric vector x
:
The expression any(x > 5)
returns TRUE
, which means that the vector x > 5
contains at least one TRUE
value, i.e., at least one element of x
is greater than 5
:
The expression any(x > 88)
returns FALSE
, which means that the vector x > 88
contains no TRUE
values, i.e., none of the elements of x
are greater than 88
:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The all
function returns TRUE
if all of the input vector values are TRUE
, otherwise it returns FALSE
. For example, the expression all(x > 5)
returns FALSE
, which means that the vector x > 5
contains at least one FALSE
value, i.e., not all elements of x
are greater than 5
:
[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE
The expression all(x > -1)
returns TRUE
, which means that x > -1
is composed entirely of TRUE
values, i.e., all elements of x
are greater than -1
:
In a way, any
and all
are inverse:
any
determines if the logical vector contains at least oneTRUE
value.all
determines if the logical vector contains at least oneFALSE
value.
3.3.2 which
The which
function converts a logical
vector to a numeric
one, with the indices of TRUE
values. That way, we can find out the index of values that satisfy a given condition. For example, considering the vector x
:
the expression which(x > 2.3)
returns the indices of TRUE
elements in x > 2.3
, i.e., the indices of x
elements which are greater than 2.3
:
3.3.3 which.min
and which.max
Related to the which function, there are two additional functions in R: which.min and which.max. These functions return the index of the first occurrence of the minimum or maximum value in a vector, respectively.
Here’s an example illustrating the usage of which.min and which.max functions with a vector x:
using which.min
we can find out that the minimal value of x
is in the 5th position:
while using which.max
we can find out that the maximal value of x
is in the 2nd position:
3.3.4 The order
function
The order
function returns ordered vector indices, based on the order of vector values. In other words, order
gives the index of the smallest value, the index of the second smallest value, etc., up to the index of the largest value. For example, given the vector x
:
order(x)
returns the indices 1:length(x)
, ordered from smallest to largest value:
This result tells us that the 5th element of x
is the smallest, the 6th is the second smallest, and so on.
We can also get the reverse order with decreasing=TRUE
:
order(x, decreasing = TRUE) ## [1] 2 7 4 1 3 6 5
How can we get a sorted vector of elements from
x
, as shown below, using theorder
function?
3.3.5 paste
and paste0
The paste
function is used to “paste” text values. Its sep
parameter determines the separating character(s), with the default being sep=" "
(a space). For example:
Code
[1] "There are 5 books."
Non-character vectors are automatically converted to character
before pasting:
The recycling rule applies in paste
too:
Code
[1] "image1.tif" "image2.tif" "image3.tif" "image4.tif" "image5.tif"
The paste0
function is a shortcut for paste
with sep=""
: