R Basics, Functions, and Data Types
In this section, I will introduce you to R Basics, Functions, and Datatypes.
In this part, you will learn to:
Appreciate the rationale for data analysis using R.
Define objects and perform basic arithmetic and logical operations.
Use pre-defined functions to perform operations on objects.
Distinguish between various data types.
Installing R
why Install R and RStudio?
To complete this course, you should install R locally on your computer. We also highly recommend installing RStudio, an integrated development environment (IDE), to edit and test your code.
In order to complete some assignments in the course, you will need your own copy of R. You may also find it helpful to follow along with the course videos in R or RStudio.
Both R and RStudio can be freely downloaded and installed.
Key Points
You need to install R before using RStudio, which is an interactive desktop environment.
Select base subdirectory in CRAN and click download.
Select all default choices in the installation process.
We recommend selecting English for language to help you better follow the course.
You can try using the R console, but for productivity purposes, we can switch to RStudio.
Installing Rstudio
Key Points
You can download the latest version of RStudio at the RStudio website.
The free desktop version is more than enough for this course.
Make sure to choose the version for your own operating system.
Choose “Yes” for all defaults in the installation process.
Using RStudio for the First Time
Key Points
The free desktop version of RStudio can be launched like other applications on your computer.
When you start RStudio for the first time, you will see three panes. The left pane shows you the R console. On the right, the top pane includes three tabs, while the bottom pane shows you five tabs, file, plots, packages, help, and viewer.
You can download a cheat sheet of the most common RStudio commands directly from RStudio by going to “Help -> Cheat Sheets -> RStudio IDE Cheat Sheet.”
Getting Started Using R
Key Points:
R was developed by statisticians and data analysts as an interactive environment for data analysis.
Some of the advantages of R are that:
it is free and open source;
it has the capability to save scripts;
there are numerous resources for learning;
it is easy for developers to share software implementation.
Expressions are evaluated in the R console when you type the expression into the console and hit Return.
A great advantage of R over point and click analysis software is that you can save your work as scripts.
“Base R” is what you get after you first install R. Additional components are available via packages.
Some Addition Notes
In RStudio, you can upload additional functions and datasets in addition to the base R functions and datasets that come with R automatically. A common way to do this is by installing packages, which often contain extra functions and datasets. For this course, there are a few packages you will need to install. You only need to install each individual package once, but after you install a package, there are other steps you have to do whenever you want to use something from that package.
To install a package, you use the code install.packages("package_name", dependencies = TRUE)
.
To load a package, you use the code library(package_name)
.
If you also want to use a dataset from a package you have loaded, then you use the code data(dataset_name)
. To see the dataset, you can take the additional step of View(dataset_name)
.
Installing Packahes
Note
We recommend installing packages through RStudio, rather than through R, and the code provided works in both R and RStudio. Once a package has been installed, it is technically added onto R (even if you use RStudio to install it), which is why packages must be re-installed when R is updated. However, since we use R through RStudio, any packages that are installed can be used in both R and RStudio, regardless of which one was used to install the packages.
key points
- The base version of R is quite minimal, but you can supplement its functions by installing additional packages.
- We will be using tidyverse and dslabs packages for this course.
- Install packages from R console:
install.packages("pkg_name")
- Install packages from RStudio interface: Tools > Install Packages (allows autocomplete)
- Once installed, we can use
library(pkg_name)
to load a package each time we want to use it
Additional Notes
- If you try to load a package with
library(blahblah)
and get a message like Error in library(blahblah) : there is no package called ‘blahblah’, it means you need to install that package first withinstall.packages()
. - On the DataCamp interface we use for some problems in the course, you cannot install additional packages. The problems have been set up with the packages you need to solve them.
- You can add the option
dependencies = TRUE
, which tells R to install the other things that are necessary for the package or packages to run smoothly. Otherwise, you may need to install additional packages to unlock the full functionality of a package. - Throughout the course materials and textbook, package names are in bold.
Code
Running Commands While Editing Scripts
Key Points
RStudio has many useful features as an R editor, including the ability to test code easily as we write scripts and several auto complete features.
Keyboard shortcuts:
- Save a script: Ctrl+S on Windows and Command+S on Mac
- Run an entire script: Ctrl+Shift+Enter on Windows Command+Shift+Return on Mac, or click “Source” on the editor pane
- Run a single line of script: Ctrl+Enter on Windows and Command+Return on Mac while the cursor is pointing to that line, or select the chunk and click “run”
- Open a new script: Ctrl+Shift+N on Windows and Command+Shift+N on Mac
Code
R Basics
Key Points
To define a variable, we may use the assignment symbol,
<-
.There are two ways to see the value stored in a variable:
(1) type the variable name into the console and hit Return;
- use the
print()
function by typingprint(variable_name)
and hitting Return.
- use the
Objects are things that are stored in named containers in R. They can be variables, functions, etc.
The
ls()
function shows the names of the objects saved in your work space.
Code: example to solving the equation \(x^{2} + x - 1 = 0\)
Function
Key Points
In general, to evaluate a function we need to use parentheses. If we type a function without parenthesis, R shows us the code for the function. Most functions also require an argument, that is, something to be written inside the parenthesis.
To access help files, we may use the help function,
help(function_name)
, or write the question mark followed by the function name, ?function_name.The help file shows you the arguments the function is expecting, some of which are required and some are optional. If an argument is optional, a default value is assigned with the equal sign. The
args()
function also shows the arguments a function needs.To specify arguments, we use the equals sign. If no argument name is used, R assumes you’re entering arguments in the order shown in the help file.
Creating and saving a script makes code much easier to execute.
To make your code more readable, use intuitive variable names and include comments (using the “#” symbol) to remind yourself why you wrote a particular line of code.
Data Types
The code data("dataset_name")
and data(dataset_name)
do the same thing. The code will work regardless of whether the quotes are present. It is a bit faster to leave out the quotes (as we do in the Code at the bottom of this page), so that is usually what we recommend, but it is your choice.
Key Points
The function
class()
helps us determine the type of an object.Data frames can be thought of as tables with rows representing observations and columns representing different variables.
To access data from columns of a data frame, we use the dollar sign symbol,
$
, which is called the accessor.A vector is an object consisting of several entries and can be a numeric vector, a character vector, or a logical vector.
We use quotes to distinguish between variable names and character strings.
Factors are useful for storing categorical data, and are more memory efficient than storing characters.
L
to specify integer data.Complex: specify purely imaginary values in R. We use the suffix i
to specify the imaginary part. e.g. 3 + 2i.Character:specify character or string values in a variable. ''
for character variables; ""
for string variables.Logical: is known as boolean data type. It can only have two values: TRUE
and FALSE
Code
# loading the dslabs package and the murders dataset
library(dslabs)
data(murders)
# determining that the murders dataset is of the "data frame" class
class(murders)
# finding out more about the structure of the object
str(murders)
# showing the first 6 lines of the dataset
head(murders)
# using the accessor operator to obtain the population column
murders$population
# displaying the variable names in the murders dataset
names(murders)
# determining how many entries are in a vector
pop <- murders$population
length(pop)
# vectors can be of class numeric and character
class(pop)
class(murders$state)
# logical vectors are either TRUE or FALSE
z <- 3 == 2
z
class(z)
# factors are another type of class
class(murders$region)
# obtaining the levels of a factor
levels(murders$region)
Vectors, Sorting
In this section, I will introduce you to vectors and functions such as sorting.
In Vectors, you will:
Create numeric and character vectors.
Name the columns of a vector.
Generate numeric sequences.
Access specific elements or parts of a vector.
Coerce data into different data types as needed.
In Sorting, you will:
Sort vectors in ascending and descending order.
Extract the indices of the sorted elements from the original vector.
Find the maximum and minimum elements, as well as their indices, in a vector.
Rank the elements of a vector in increasing order.
In Vector Arithmetic, you will:
Perform arithmetic between a vector and a single number.
Perform arithmetic between two vectors of the same length.
Vectors
Key points
The function c(), which stands for concatenate, is useful for creating vectors.
Another useful function for creating vectors is the seq() function, which generates sequences.
Subsetting lets us access specific parts of a vector by using square brackets to access elements of a vector.
Code
# We may create vectors of class numeric or character with the concatenate function
codes <- c(380, 124, 818)
country <- c("italy", "canada", "egypt")
# We can also name the elements of a numeric vector
# Note that the two lines of code below have the same result
codes <- c(italy = 380, canada = 124, egypt = 818)
codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)
# We can also name the elements of a numeric vector using the names() function
codes <- c(380, 124, 818)
country <- c("italy","canada","egypt")
names(codes) <- country
# Using square brackets is useful for subsetting to access specific elements of a vector
codes[2]
codes[c(1,3)]
codes[1:2]
# If the entries of a vector are named, they may be accessed by referring to their name
codes["canada"]
codes[c("egypt","italy")]
Vector Coercion
Key Point
- In general, coercion is an attempt by R to be flexible with data types by guessing what was meant when an entry does not match the expected. For example, when defining x as
R coerced the data into characters. It guessed that because you put a character string in the vector, you meant the 1 and 3 to actually be character strings, “1” and “3”.
The function
as.character()
turns numbers into characters.The function
as.numeric()
turns characters into numbers.In R, missing data is assigned the value
NA
.
Question
- class(3L) is integer ?
- 3L-3 equals 0 ?
Sorting
Original | Sort(按从小到大排列) | Order(Sort对应数字在原来数字排列中的顺序) | Rank(Original原来数字在Sort顺序中的排名) |
31 | 4 | 2 | 3 |
4 | 15 | 3 | 1 |
15 | 31 | 1 | 2 |
92 | 65 | 5 | 5 |
65 | 92 | 4 | 4 |
Key Points
The function
sort()
sorts a vector in increasing order.The function
order()
produces the indices needed to obtain the sorted vector, e.g. a result of 2 3 1 5 4 means the sorted vector will be produced by listing the 2nd, 3rd, 1st, 5th, and then 4th item of the original vector.The function
rank()
gives us the ranks of the items in the original vector.The function
max()
returns the largest value, whilewhich.max()
returns the index of the largest value. The functionsmin()
andwhich.min()
work similarly for minimum values.
Code
library(dslabs)
data(murders)
sort(murders$total)
x <- c(31, 4, 15, 92, 65)
x
sort(x) # puts elements in order
index <- order(x) # returns index that will put x in order
x[index] # rearranging by this index puts elements in order
order(x)
murders$state[1:10]
murders$abb[1:10]
index <- order(murders$total)
murders$abb[index] # order abbreviations by total murders
max(murders$total) # highest number of total murders
i_max <- which.max(murders$total) # index with highest number of murders
murders$state[i_max] # state name with highest number of total murders
x <- c(31, 4, 15, 92, 65)
x
rank(x) # returns ranks (smallest to largest)
Vector Arithmetic
Key Point
- In R, arithmetic operation on vectors occur element-wise
Code
# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]
# how to obtain the murder rate
murder_rate <- murders$total / murders$population * 100000
# ordering the states by murder rate, in decreasing order
murders$state[order(murder_rate, decreasing=TRUE)]
Indexing, Data Wrangling, Plots
In this section, I will introduce the R commands and techniques that help you wrangle, analyze, and visualize data.
In Indexing, you will: - Subset a vector based on properties of another vector.
Use multiple logical operators to index vectors.
Extract the indices of vector elements satisfying one or more logical conditions.
Extract the indices of vector elements matching with another vector.
Determine which elements in one vector are present in another vector.
In basic data wrangling, you will:
Wrangle data tables using functions in the dplyr package.
Modify a data table by adding or changing columns.
Subset rows in a data table.
Subset columns in a data table.
Perform a series of operations using the pipe operator.
Create data frames.
In basic plots, you will: - Plot data in scatter plots, box plots, and histograms.
In summarizing with dplyr, you will: - Use summarize() to facilitate summarizing data in dplyr.
Learn about the dot placeholder.
Learn how to group and then summarize in dplyr.
Learn how to sort data tables in dplyr.
In the rest section, you will: - Learn how to subset and summarize data using data.table.
- Learn how to sort data frames using data.table.
Indexing
Key Point
We can use logicals to index vectors.
Using the function sum()on a logical vector returns the number of entries that are true.
The logical operator “&” makes two logicals true only when they are both true.
Code
# defining murder rate as before
murder_rate <- murders$total / murders$population * 100000
# creating a logical vector that specifies if the murder rate in that state is less than or equal to 0.71
index <- murder_rate <= 0.71
# determining which states have murder rates less than or equal to 0.71
murders$state[index]
# calculating how many states have a murder rate less than or equal to 0.71
sum(index)
# creating the two logical vectors representing our conditions
west <- murders$region == "West"
safe <- murder_rate <= 1
# defining an index and identifying states with both conditions true
index <- safe & west
murders$state[index]
Indexing Functions
Key Points
The function
which()
gives us the entries of a logical vector that are true.The function
match()
looks for entries in a vector and returns the index needed to access them.We use the function
%in%
if we want to know whether or not each element of a first vector is in a second vector.
Code
x <- c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE)
which(x) # returns indices that are TRUE
# to determine the murder rate in Massachusetts we may do the following
index <- which(murders$state == "Massachusetts")
index
murder_rate[index]
# to obtain the indices and subsequent murder rates of New York, Florida, Texas, we do:
index <- match(c("New York", "Florida", "Texas"), murders$state)
index
murders$state[index]
murder_rate[index]
x <- c("a", "b", "c", "d", "e")
y <- c("a", "d", "f")
y %in% x
# to see if Boston, Dakota, and Washington are states
c("Boston", "Dakota", "Washington") %in% murders$state
Basic Data Wrangling
Key Points
To change a data table by adding a new column, or changing an existing one, we use the
mutate()
function.To filter the data by subsetting rows, we use the function
filter()
.To subset the data by selecting specific columns, we use the
select()
function.We can perform a series of operations by sending the results of one function to another function using the pipe operator,
%>%
.
Creating Data Frames
Note
The default settings in R have changed as of version 4.0, and it is no longer necessary to include the code stringsAsFactors = FALSE
in order to keep strings as characters. Putting the entries in quotes, as in the example, is adequate to keep strings as characters. The stringsAsFactors = FALSE
code is useful in certain other situations, but you do not need to include it when you create data frames in this manner.
Key Points
We can use the
data.frame()
function to create data frames.Formerly, the
data.frame()
function turned characters into factors by default. To avoid this, we could utilize thestringsAsFactors
argument and set it equal to false. As of R 4.0, it is no longer necessary to include thestringsAsFactors
argument, because R no longer turns characters into factors by default.
Code
Basic Plots
Key Points
We can create a simple scatterplot using the function
plot()
.Histograms are graphical summaries that give you a general overview of the types of values you have. In R, they can be produced using the
hist()
function.Boxplots provide a more compact summary of a distribution than a histogram and are more useful for comparing distributions. They can be produced using the
boxplot()
function.
Code
The summarize function
Key Points
Summarizing data is an important part of data analysis.
Some summary ststistics are the mean, median, and standard deviation.
The
summarize()
function from dplyr provides an easy way to compute summary statics.
Code
# minimum, median, and maximum murder rate for the states in the West region
s <- murders %>%
filter(region == "West") %>%
summarize(minimum = min(rate),
median = median(rate),
maximum = max(rate))
s
minimum median maximum
1 0.514592 1.292453 3.629527
[1] 1.292453
[1] 3.629527
[1] 2.779125
# average rate adjusted by population size
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5)
us_murder_rate
rate
1 3.034555
Summarizing with more than one value
Key Points
- The
quantile()
function can be used to return the min, median, and max in a single line of code.
Code
# minimum, median, and maximum murder rate for the states in the West region using quantile
# note that this returns a vector
murders %>%
filter(region == "West") %>%
summarize(range = quantile(rate, c(0, 0.5, 1)))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
range
1 0.514592
2 1.292453
3 3.629527
# returning minimum, median, and maximum as a data frame
my_quantile <- function(x){
r <- quantile(x, c(0, 0.5, 1))
data.frame(minimum = r[1], median = r[2], maximum = r[3])
}
murders %>%
filter(region == "West") %>%
summarize(my_quantile(rate))
minimum median maximum
1 0.514592 1.292453 3.629527
Pull to access to columns
Key Points
- The
pull()
function can be used to access values stored in data when using pipes: when a data object is piped that object and its columns can be accessed using thepull()
function.
Code
# average rate adjusted by population size
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5)
us_murder_rate
rate
1 3.034555
[1] "data.frame"
[1] 3.034555
# using pull to save the number directly
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5) %>%
pull(rate)
us_murder_rate
[1] 3.034555
[1] "numeric"
The dot placeholder
Key Points
- The
dot (.)
can be thought of as a placeholder for the data being passed through the pipe.
# average rate adjusted by population size
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5)
us_murder_rate
rate
1 3.034555
# using the dot to access the rate
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5) %>%
.$rate
us_murder_rate
[1] 3.034555
[1] "numeric"
Group then summarize
Key Points
Splitting data into groups and then computing summaries for each group is a common operation in data exploration.
We can use the dplyr
group_by()
function to create a special grouped data frame to facilitate such summaries.
# A tibble: 51 × 6
# Groups: region [4]
state abb region population total rate
<chr> <chr> <fct> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Arizona AZ West 6392017 232 3.63
4 Arkansas AR South 2915918 93 3.19
5 California CA West 37253956 1257 3.37
6 Colorado CO West 5029196 65 1.29
7 Connecticut CT Northeast 3574097 97 2.71
8 Delaware DE South 897934 38 4.23
9 District of Columbia DC South 601723 99 16.5
10 Florida FL South 19687653 669 3.40
# ℹ 41 more rows
# A tibble: 4 × 2
region median
<fct> <dbl>
1 Northeast 1.80
2 South 3.40
3 North Central 1.97
4 West 1.29
Sorting data tables
Key Points
To order an entire table, we can use the dplyr function
arrange()
.We can also use nested sorting to order by additional columns.
The function
head()
returns on the first few lines of a table.The function
top_n()
returns the top n rows of a table.
Code
state abb region population total rate
1 Wyoming WY West 563626 5 0.8871131
2 District of Columbia DC South 601723 99 16.4527532
3 Vermont VT Northeast 625741 2 0.3196211
4 North Dakota ND North Central 672591 4 0.5947151
5 Alaska AK West 710231 19 2.6751860
6 South Dakota SD North Central 814180 8 0.9825837
# order the states by murder rate - the default is ascending order
murders %>% arrange(rate) %>% head()
state abb region population total rate
1 Vermont VT Northeast 625741 2 0.3196211
2 New Hampshire NH Northeast 1316470 5 0.3798036
3 Hawaii HI West 1360301 7 0.5145920
4 North Dakota ND North Central 672591 4 0.5947151
5 Iowa IA North Central 3046355 21 0.6893484
6 Idaho ID West 1567582 12 0.7655102
state abb region population total rate
1 District of Columbia DC South 601723 99 16.452753
2 Louisiana LA South 4533372 351 7.742581
3 Missouri MO North Central 5988927 321 5.359892
4 Maryland MD South 5773552 293 5.074866
5 South Carolina SC South 4625364 207 4.475323
6 Delaware DE South 897934 38 4.231937
# order the states by region and then by murder rate within region
murders %>% arrange(region, rate) %>% head()
state abb region population total rate
1 Vermont VT Northeast 625741 2 0.3196211
2 New Hampshire NH Northeast 1316470 5 0.3798036
3 Maine ME Northeast 1328361 11 0.8280881
4 Rhode Island RI Northeast 1052567 16 1.5200933
5 Massachusetts MA Northeast 6547629 118 1.8021791
6 New York NY Northeast 19378102 517 2.6679599
state abb region population total rate
1 Arizona AZ West 6392017 232 3.629527
2 Delaware DE South 897934 38 4.231937
3 District of Columbia DC South 601723 99 16.452753
4 Georgia GA South 9920000 376 3.790323
5 Louisiana LA South 4533372 351 7.742581
6 Maryland MD South 5773552 293 5.074866
7 Michigan MI North Central 9883640 413 4.178622
8 Mississippi MS South 2967297 120 4.044085
9 Missouri MO North Central 5988927 321 5.359892
10 South Carolina SC South 4625364 207 4.475323
# return the top 10 states ranked by murder rate, sorted by murder rate
murders %>% arrange(desc(rate)) %>% top_n(10)
Selecting by rate
state abb region population total rate
1 District of Columbia DC South 601723 99 16.452753
2 Louisiana LA South 4533372 351 7.742581
3 Missouri MO North Central 5988927 321 5.359892
4 Maryland MD South 5773552 293 5.074866
5 South Carolina SC South 4625364 207 4.475323
6 Delaware DE South 897934 38 4.231937
7 Michigan MI North Central 9883640 413 4.178622
8 Mississippi MS South 2967297 120 4.044085
9 Georgia GA South 9920000 376 3.790323
10 Arizona AZ West 6392017 232 3.629527
Introduction to data.table
Key Points
In this course, we often use tidyverse packages to illustrate because these packages tend to have code that is very readable for beginners.
There are other approaches to wrangling and analyzing data in R that are faster and better at handling large objects, such as the data.table package.
Selecting in data.table uses notation similar to that used with matrices.
To add a column in data.table, you can use the := function.
Because the data.table package is designed to avoid wasting memory, when you make a copy of a table, it does not create a new object. The := function changes by reference. If you want to make an actual copy, you need to use the copy() function.
Side note: the R language has a new, built-in pipe operator as of version 4.1: |>. This works similarly to the pipe %>% you are already familiar with. You can read more about the |> pipe here External link.
Code
# install the data.table package before you use it!
install.packages("data.table")
# load data.table package
library(data.table)
# load other packages and datasets
library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
# convert the data frame into a data.table object
murders <- setDT(murders)
# selecting in dplyr
select(murders, state, region)
# selecting in data.table - 2 methods
murders[, c("state", "region")] |> head()
murders[, .(state, region)] |> head()
# adding or changing a column in dplyr
murders <- mutate(murders, rate = total / population * 10^5)
# adding or changing a column in data.table
murders[, rate := total / population * 100000]
head(murders)
murders[, ":="(rate = total / population * 100000, rank = rank(population))]
# y is referring to x and := changes by reference
x <- data.table(a = 1)
y <- x
x[,a := 2]
y
y[,a := 1]
x
# use copy to make an actual copy
x <- data.table(a = 1)
y <- copy(x)
x[,a := 2]
y
Subsetting with data.table
Key Points
Subsetting in data.table uses notation similar to that used with matrices.
Code
# load packages and prepare the data
library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
library(data.table)
murders <- setDT(murders)
murders <- mutate(murders, rate = total / population * 10^5)
murders[, rate := total / population * 100000]
# subsetting in dplyr
filter(murders, rate <= 0.7)
# subsetting in data.table
murders[rate <= 0.7]
# combining filter and select in data.table
murders[rate <= 0.7, .(state, rate)]
# combining filter and select in dplyr
murders %>% filter(rate <= 0.7) %>% select(state, rate)
Summarizing with data.table
Key Points
In data.table we can call functions inside
.()
and they will be applied to rows.The
group_by
followed by summarize in dplyr is performed in one line in data.table using the by argument.
Code
# load packages and prepare the data - heights dataset
library(tidyverse)
library(dplyr)
library(dslabs)
data(heights)
heights <- setDT(heights)
# summarizing in dplyr
s <- heights %>%
summarize(average = mean(height), standard_deviation = sd(height))
# summarizing in data.table
s <- heights[, .(average = mean(height), standard_deviation = sd(height))]
# subsetting and then summarizing in dplyr
s <- heights %>%
filter(sex == "Female") %>%
summarize(average = mean(height), standard_deviation = sd(height))
# subsetting and then summarizing in data.table
s <- heights[sex == "Female", .(average = mean(height), standard_deviation = sd(height))]
# previously defined function
median_min_max <- function(x){
qs <- quantile(x, c(0.5, 0, 1))
data.frame(median = qs[1], minimum = qs[2], maximum = qs[3])
}
# multiple summaries in data.table
heights[, .(median_min_max(height))]
# grouping then summarizing in data.table
heights[, .(average = mean(height), standard_deviation = sd(height)), by = sex]
Sorting data frames
Key Points
To order rows in a data frame using data.table, we can use the same approach we used for filtering.
The default sort is an ascending order, but we can also sort tables in descending order.
We can also perform nested sorting by including multiple variables in the desired sort order.
Code
# load packages and datasets and prepare the data
library(tidyverse)
library(dplyr)
library(data.table)
library(dslabs)
data(murders)
murders <- setDT(murders)
murders[, rate := total / population * 100000]
# order by population
murders[order(population)] |> head()
# order by population in descending order
murders[order(population, decreasing = TRUE)]
# order by region and then murder rate
murders[order(region, rate)]
Programming Basics
In this section, I will introduce you to general programming features like ‘if-else’ and ‘for loop’ commands so that you can write your own functions to perform various operations on datasets.
In programming basics, you will: - Understand some of the programming capabilities of R.
In basic condationals, you will: - Use basic conditional expressions to perform different operations. Check if any or all elements of a logical vector are TRUE.
In function, you will: - Define and call functions to perform various operations.
- Pass arguments to functions, and return variables/objects from functions.
In loops, you will: - Use for-loops to perform repeated operations.
- Articulate in-built functions of R that you could try for yourself.
Programming Basics
Basic Condationals
Key Points
The most common conditional expression in programming is an if-else statement, which has the form “if [condition], perform [expression], else perform [alternative expression]”.
The
ifelse()
function works similarly to an if-else statement, but it is particularly useful since it works on vectors by examining each element of the vector and returning a corresponding answer accordingly.The
any()
function takes a vector of logicals and returns true if any of the entries are true.The
all()
function takes a vector of logicals and returns true if all of the entries are true.
Code
# an example showing the general structure of an if-else statement
a <- 0
if(a!=0){
print(1/a)
} else{
print("No reciprocal for 0.")
}
# an example that tells us which states, if any, have a murder rate less than 0.5
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population*100000
ind <- which.min(murder_rate)
if(murder_rate[ind] < 0.5){
print(murders$state[ind])
} else{
print("No state has murder rate that low")
}
# changing the condition to < 0.25 changes the result
if(murder_rate[ind] < 0.25){
print(murders$state[ind])
} else{
print("No state has a murder rate that low.")
}
# the ifelse() function works similarly to an if-else conditional
a <- 0
ifelse(a > 0, 1/a, NA)
# the ifelse() function is particularly useful on vectors
a <- c(0,1,2,-4,5)
result <- ifelse(a > 0, 1/a, NA)
# the ifelse() function is also helpful for replacing missing values
data(na_example)
no_nas <- ifelse(is.na(na_example), 0, na_example)
sum(is.na(no_nas))
# the any() and all() functions evaluate logical vectors
z <- c(TRUE, TRUE, FALSE)
any(z)
all(z)
Functions
Key Points
The R function called function() tells R you are about to define a new function.
Functions are objects, so must be assigned a variable name with the arrow operator.
The general way to define functions is:
- decide the function name, which will be an object,
- type function() with your function’s arguments in parentheses, - (3) write all the operations inside brackets.
Variables defined inside a function are not saved in the workspace.
Code
# example of defining a function to compute the average of a vector x
avg <- function(x){
s <- sum(x)
n <- length(x)
s/n
}
# we see that the above function and the pre-built R mean() function are identical
x <- 1:100
identical(mean(x), avg(x))
# variables inside a function are not defined in the workspace
s <- 3
avg(1:10)
s
# the general form of a function
my_function <- function(VARIABLE_NAME){
perform operations on VARIABLE_NAME and calculate VALUE
VALUE
}
# functions can have multiple arguments as well as default values
avg <- function(x, arithmetic = TRUE){
n <- length(x)
ifelse(arithmetic, sum(x)/n, prod(x)^(1/n))
}
For Loops
Key Points
For-loops perform the same task over and over while changing the variable. They let us define the range that our variable takes, and then changes the value with each loop and evaluates the expression every time inside the loop.
The general form of a for-loop is: “For i in [some range], do operations”. This i changes across the range of values and the operations assume i is a value you’re interested in computing on.
At the end of the loop, the value of i is the last value of the range.
Code
# creating a function that computes the sum of integers 1 through n
compute_s_n <- function(n){
x <- 1:n
sum(x)
}
# a very simple for-loop
for(i in 1:5){
print(i)
}
# a for-loop for our summation
m <- 25
s_n <- vector(length = m) # create an empty vector
for(n in 1:m){
s_n[n] <- compute_s_n(n)
}
# creating a plot for our summation function
n <- 1:m
plot(n, s_n)
# a table of values comparing our function to the summation formula
head(data.frame(s_n = s_n, formula = n*(n+1)/2))
# overlaying our function with the summation formula
plot(n, s_n)
lines(n, n*(n+1)/2)
Citation
@online{semba2024,
author = {Semba, Masumbuko},
title = {The Basics of {R} Programming},
date = {2024-01-24},
url = {https://lugoga.github.io/kitaa/posts/rbasics/},
langid = {en}
}