Chapter 4 Manipulating Data with dplyr
Although many fundamental data processing functions exist in R, they have been a bit convoluted to date and have lacked consistent coding and the ability to easily flow together. This leads to difficult-to-read nested functions and/or choppy code. R Studio is driving a lot of new packages to collate data management tasks and better integrate them with other analysis activities. As a result, a lot of data processing tasks are becoming packaged in more cohesive and consistent ways, which leads to:
- More efficient code
- Easier to remember syntax
- Easier to read syntax
The dplyr
package provides a set of functions for efficiently manipulating datasets in R writen by Wickham et al. (2018). The package make it easy to transform
and summarise
tabular data with rows and columns. The dplyr
packages contains set of functions—verbs that perfom most common data manipulation tasks like
When working with data you must:
- Figure out what you want to do.
- Describe those tasks in the form of a computer program.
- Execute the program.
The dplyr package makes these steps fast and easy:
- By constraining your options, it helps you think about your data manipulation challenges.
- It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
- It uses efficient backends, so you spend less time waiting for the computer.
4.1 Why use dplyr?
- Great for data exploration and manipulation
- Intuitive to write and easy to read, especially when using the chaining syntax
- Fast on data frame—tabular dataset
4.2 dplyr functionality
- Five basic verbs:
select()
to select columns based on their namesfilter()
to rows in data framearrange()
to re-order or arrange the rows in ascending or descending ordermutate()
to create new columns—add new variablesummarise()
to make a summary of variable(s)group_by()
to group observationsample_n()
andrename()
to make random sample from the data set
The group_by()
function perform other common task which are related to the split-apply-combine concept. The dplyr
package comes with the pipe operateor %>%
from the magrittr
package. The pipe operator is very useful for combining several functions in a chain.
4.3 filter: Keep rows matching criteria
- Base R approach to filtering forces you to repeat the data frame’s name
- dplyr approach is simpler to write and read
- Command structure (for all dplyr verbs):
- first argument is a data frame
- return value is a data frame
- nothing is modified in place
- Note: dplyr generally does not preserve row names
require(dplyr)
require(readr)
require(lubridate)
require(readxl)
require(haven)
require(ggplot2)
require(kableExtra)
flights = read_csv("flights.csv") %>%
select(-X1)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_integer(),
## carrier = col_character(),
## tailnum = col_character(),
## origin = col_character(),
## dest = col_character(),
## time_hour = col_character()
## )
## See spec(...) for full column specifications.
# base R approach to view all flights on January 1
flights[flights$month==1 & flights$day==1, ]
# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, month==1, day==1)
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## # minute <int>, time_hour <chr>
# use pipe for OR condition
4.4 select: Pick columns by name
- Base R approach is awkward to type and to read
- dplyr approach uses similar syntax to
filter
- Like a SELECT in SQL
# base R approach to select DepTime, and ArrTime
flights[, c("dep_time", "arr_time")]
# dplyr approach
select(flights, dep_time, arr_time)
## # A tibble: 336,776 x 2
## dep_time arr_time
## <int> <int>
## 1 517 830
## 2 533 850
## 3 542 923
## 4 544 1004
## 5 554 812
## 6 554 740
## 7 555 913
## 8 557 709
## 9 557 838
## 10 558 753
## # ... with 336,766 more rows
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
select(flights, year:day, contains("taxi"), contains("delay"))
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <int> <int>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ... with 336,766 more rows
4.5 “Chaining” or “Pipelining”
Bache and Wickham (2014) developed a magrittr
package, which has changed the way coding is done in R. It introduce the pipe operator %>%
widely known as THEN
. Usual way to perform multiple operations in R was through nesting. However, magrittr
package introduced a natural order by using the %>%
chain multiple operations
# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
filter(select(flights, carrier, dep_delay), dep_delay > 60)
# chaining method
flights %>%
select(carrier, dep_delay) %>%
filter(dep_delay > 60)
## # A tibble: 26,581 x 2
## carrier dep_delay
## <chr> <int>
## 1 MQ 101
## 2 AA 71
## 3 MQ 853
## 4 UA 144
## 5 UA 134
## 6 EV 96
## 7 MQ 71
## 8 B6 77
## 9 EV 70
## 10 EV 115
## # ... with 26,571 more rows
- Chaining increases readability significantly when there are many commands
- Operator is automatically imported from the magrittr package
- Can be used to replace nesting in R commands outside of dplyr.
For example, we can create two vectors and calculate Euclidian distance between them using the mathematical equation (4.1)
\[ \begin{equation} \theta\: = \: \sqrt {\sum (x_1 - x_2)^2} \tag{4.1} \end{equation} \]
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))
# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()
## [1] 2.236068
4.6 Pipping with
4.6.1 Choosing columns: select()
, rename()
Often you work with large datasets with many columns but only a few are actually of interest to you. select()
allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:
# besides just using select() to pick columns...
flights %>%
select(carrier, flight)
## # A tibble: 336,776 x 2
## carrier flight
## <chr> <int>
## 1 UA 1545
## 2 UA 1714
## 3 AA 1141
## 4 B6 725
## 5 DL 461
## 6 UA 1696
## 7 B6 507
## 8 EV 5708
## 9 B6 79
## 10 AA 301
## # ... with 336,766 more rows
# ...you can use the minus sign to hide columns
flights %>%
select(-month, -day)
## # A tibble: 336,776 x 17
## year dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <int>
## 1 2013 517 515 2 830 819
## 2 2013 533 529 4 850 830
## 3 2013 542 540 2 923 850
## 4 2013 544 545 -1 1004 1022
## 5 2013 554 600 -6 812 837
## 6 2013 554 558 -4 740 728
## 7 2013 555 600 -5 913 854
## 8 2013 557 600 -3 709 723
## 9 2013 557 600 -3 838 846
## 10 2013 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <int>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <int>, distance <int>, hour <int>, minute <int>,
## # time_hour <chr>
# hide a range of columns
flights %>%
select(-(dep_time:arr_delay))
# hide any column with a matching name
flights %>%
select(-contains("time"))
## # A tibble: 336,776 x 13
## year month day dep_delay arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <chr> <int> <chr> <chr>
## 1 2013 1 1 2 11 UA 1545 N14228 EWR
## 2 2013 1 1 4 20 UA 1714 N24211 LGA
## 3 2013 1 1 2 33 AA 1141 N619AA JFK
## 4 2013 1 1 -1 -18 B6 725 N804JB JFK
## 5 2013 1 1 -6 -25 DL 461 N668DN LGA
## 6 2013 1 1 -4 12 UA 1696 N39463 EWR
## 7 2013 1 1 -5 19 B6 507 N516JB EWR
## 8 2013 1 1 -3 -14 EV 5708 N829AS LGA
## 9 2013 1 1 -3 -8 B6 79 N593JB JFK
## 10 2013 1 1 -2 8 AA 301 N3ALAA LGA
## # ... with 336,766 more rows, and 4 more variables: dest <chr>,
## # distance <int>, hour <int>, minute <int>
# pick columns using a character vector of column names
cols <- c("carrier", "flight", "tailnum")
flights %>%
select(one_of(cols))
## # A tibble: 336,776 x 3
## carrier flight tailnum
## <chr> <int> <chr>
## 1 UA 1545 N14228
## 2 UA 1714 N24211
## 3 AA 1141 N619AA
## 4 B6 725 N804JB
## 5 DL 461 N668DN
## 6 UA 1696 N39463
## 7 B6 507 N516JB
## 8 EV 5708 N829AS
## 9 B6 79 N593JB
## 10 AA 301 N3ALAA
## # ... with 336,766 more rows
# select() can be used to rename columns, though all columns not mentioned are dropped
flights %>%
select(tail = tailnum)
## # A tibble: 336,776 x 1
## tail
## <chr>
## 1 N14228
## 2 N24211
## 3 N619AA
## 4 N804JB
## 5 N668DN
## 6 N39463
## 7 N516JB
## 8 N829AS
## 9 N593JB
## 10 N3ALAA
## # ... with 336,766 more rows
# rename() does the same thing, except all columns not mentioned are kept
flights %>%
rename(tail = tailnum)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <int>, carrier <chr>, flight <int>, tail <chr>,
## # origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## # minute <int>, time_hour <chr>
4.6.2 Choosing rows: filter, between, slice, sample_n, top_n, distinct
filter()
allows you to select a subset of rows in a data frame. Like all single verbs, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE. For example, we can select all flights on departed between 6:00 and 6:05 in the morning with:
flights %>%
filter(dep_time >= 600, dep_time <= 605)
## # A tibble: 2,460 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <int> <int>
## 1 2013 1 1 600 600 0 851
## 2 2013 1 1 600 600 0 837
## 3 2013 1 1 601 600 1 844
## 4 2013 1 1 602 610 -8 812
## 5 2013 1 1 602 605 -3 821
## 6 2013 1 2 600 600 0 814
## 7 2013 1 2 600 605 -5 751
## 8 2013 1 2 600 600 0 819
## 9 2013 1 2 600 600 0 846
## 10 2013 1 2 600 600 0 737
## # ... with 2,450 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## # minute <int>, time_hour <chr>
# between() is a concise alternative for determing if numeric values fall in a range
flights %>%
filter(between(dep_time, 600, 605))
# side note: is.na() can also be useful when filtering
flights %>%
filter(!is.na(dep_time))
## # A tibble: 328,521 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 328,511 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## # minute <int>, time_hour <chr>
4.6.3 Adding new variables: mutate, transmute, add_rownames
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate()
:
# mutate() creates a new variable (and keeps all existing variables)
flights %>% mutate(speed = distance/air_time*60)
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## # minute <int>, time_hour <chr>, speed <dbl>
# transmute() only keeps the new variables
flights %>% transmute(speed = distance/air_time*60)
## # A tibble: 336,776 x 1
## speed
## <dbl>
## 1 370.
## 2 374.
## 3 408.
## 4 517.
## 5 394.
## 6 288.
## 7 404.
## 8 259.
## 9 405.
## 10 319.
## # ... with 336,766 more rows
# example data frame with row names
mtcars %>% head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# add_rownames() turns row names into an explicit variable
mtcars %>% add_rownames("model") %>% head()
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## # A tibble: 6 x 12
## model mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda~ 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Mazda~ 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 Datsu~ 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 Horne~ 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 Horne~ 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 Valia~ 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# side note: dplyr no longer prints row names (ever) for local data frames
mtcars %>% tbl_df()
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ... with 22 more rows
4.6.4 Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup
The verb summarise()
collapses a data frame to a single row.It’s not that useful until chained with the group_by()
verb below.
# summarise() can be used to count the number of rows in each group
flights %>% group_by(month) %>% summarise(cnt = n())
## # A tibble: 12 x 2
## month cnt
## <int> <int>
## 1 1 27004
## 2 2 24951
## 3 3 28834
## 4 4 28330
## 5 5 28796
## 6 6 28243
## 7 7 29425
## 8 8 29327
## 9 9 27574
## 10 10 28889
## 11 11 27268
## 12 12 28135
# tally() and count() can do this more concisely
flights %>% group_by(month) %>% tally()
# you can sort by the count
flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))
## # A tibble: 12 x 2
## month cnt
## <int> <int>
## 1 7 29425
## 2 8 29327
## 3 10 28889
## 4 3 28834
## 5 5 28796
## 6 4 28330
## 7 6 28243
## 8 12 28135
## 9 9 27574
## 10 11 27268
## 11 1 27004
## 12 2 24951
# you can sum over a specific variable instead of simply counting rows
flights %>% group_by(month) %>% summarise(dist = sum(distance))
## # A tibble: 12 x 2
## month dist
## <int> <int>
## 1 1 27188805
## 2 2 24975509
## 3 3 29179636
## 4 4 29427294
## 5 5 29974128
## 6 6 29856388
## 7 7 31149199
## 8 8 31149334
## 9 9 28711426
## 10 10 30012086
## 11 11 28639718
## 12 12 29954084
# tally() and count() have a wt parameter for this purpose
flights %>% group_by(month) %>% tally(wt = distance)
flights %>% count(month, wt = distance)
# group_size() returns the counts as a vector
flights %>% group_by(month) %>% group_size()
## [1] 27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268
## [12] 28135
# n_groups() simply reports the number of groups
flights %>% group_by(month) %>% n_groups()
## [1] 12
# group by two variables, summarise, arrange (output is possibly confusing)
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)
# A tibble: 365 x 3
# Groups: month [12]
month day cnt
<int> <int> <int>
1 11 27 1014
2 7 11 1006
3 7 8 1004
4 7 10 1004
5 12 2 1004
6 7 18 1003
7 7 25 1003
8 7 12 1002
9 7 9 1001
10 7 17 1001
11 7 31 1001
12 8 7 1001
13 8 8 1001
14 8 12 1001
15 7 22 1000
16 7 24 1000
17 8 1 1000
18 8 5 1000
19 8 15 1000
20 11 21 1000
21 7 15 999
22 7 19 999
23 7 26 999
24 7 29 999
25 8 2 999
26 8 9 999
27 11 22 999
28 8 16 998
29 7 23 997
30 7 30 997
31 8 14 997
32 7 16 996
33 8 6 996
34 8 19 996
35 9 13 996
36 9 26 996
37 9 27 996
38 4 15 995
39 6 20 995
40 6 26 995
# ... with 325 more rows
# ungroup() before arranging to arrange across all groups
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))
# A tibble: 365 x 3
month day cnt
<int> <int> <int>
1 11 27 1014
2 7 11 1006
3 7 8 1004
4 7 10 1004
5 12 2 1004
6 7 18 1003
7 7 25 1003
8 7 12 1002
9 7 9 1001
10 7 17 1001
# ... with 355 more rows
References
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.