Chapter 4 Manipulating Data with dplyr

Although many fundamental data processing functions exist in R, they have been a bit convoluted to date and have lacked consistent coding and the ability to easily flow together. This leads to difficult-to-read nested functions and/or choppy code. R Studio is driving a lot of new packages to collate data management tasks and better integrate them with other analysis activities. As a result, a lot of data processing tasks are becoming packaged in more cohesive and consistent ways, which leads to:

  • More efficient code
  • Easier to remember syntax
  • Easier to read syntax

The dplyr package provides a set of functions for efficiently manipulating datasets in R writen by Wickham et al. (2018). The package make it easy to transform and summarise tabular data with rows and columns. The dplyr packages contains set of functions—verbs that perfom most common data manipulation tasks like

When working with data you must:

  • Figure out what you want to do.
  • Describe those tasks in the form of a computer program.
  • Execute the program.

The dplyr package makes these steps fast and easy:

  • By constraining your options, it helps you think about your data manipulation challenges.
  • It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
  • It uses efficient backends, so you spend less time waiting for the computer.

4.1 Why use dplyr?

  1. Great for data exploration and manipulation
  2. Intuitive to write and easy to read, especially when using the chaining syntax
  3. Fast on data frame—tabular dataset

4.2 dplyr functionality

  • Five basic verbs:
  • select() to select columns based on their names
  • filter() to rows in data frame
  • arrange() to re-order or arrange the rows in ascending or descending order
  • mutate() to create new columns—add new variable
  • summarise() to make a summary of variable(s)
  • group_by() to group observation
  • sample_n() and rename()to make random sample from the data set

The group_by() function perform other common task which are related to the split-apply-combine concept. The dplyr package comes with the pipe operateor %>% from the magrittr package. The pipe operator is very useful for combining several functions in a chain.

4.3 filter: Keep rows matching criteria

  • Base R approach to filtering forces you to repeat the data frame’s name
  • dplyr approach is simpler to write and read
  • Command structure (for all dplyr verbs):
    • first argument is a data frame
    • return value is a data frame
    • nothing is modified in place
  • Note: dplyr generally does not preserve row names
require(dplyr)
require(readr)
require(lubridate)
require(readxl)
require(haven)
require(ggplot2)
require(kableExtra)
flights = read_csv("flights.csv") %>%
  select(-X1)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   carrier = col_character(),
##   tailnum = col_character(),
##   origin = col_character(),
##   dest = col_character(),
##   time_hour = col_character()
## )
## See spec(...) for full column specifications.
# base R approach to view all flights on January 1
flights[flights$month==1 & flights$day==1, ]
# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, month==1, day==1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <chr>
# use pipe for OR condition

4.4 select: Pick columns by name

  • Base R approach is awkward to type and to read
  • dplyr approach uses similar syntax to filter
  • Like a SELECT in SQL
# base R approach to select DepTime, and ArrTime
flights[, c("dep_time", "arr_time")]
# dplyr approach
select(flights, dep_time, arr_time)
## # A tibble: 336,776 x 2
##    dep_time arr_time
##       <int>    <int>
##  1      517      830
##  2      533      850
##  3      542      923
##  4      544     1004
##  5      554      812
##  6      554      740
##  7      555      913
##  8      557      709
##  9      557      838
## 10      558      753
## # ... with 336,766 more rows
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
select(flights, year:day, contains("taxi"), contains("delay"))
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <int>     <int>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ... with 336,766 more rows

4.5 “Chaining” or “Pipelining”

Bache and Wickham (2014) developed a magrittr package, which has changed the way coding is done in R. It introduce the pipe operator %>% widely known as THEN. Usual way to perform multiple operations in R was through nesting. However, magrittr package introduced a natural order by using the %>% chain multiple operations

# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
filter(select(flights, carrier, dep_delay), dep_delay > 60)
# chaining method
flights %>%
    select(carrier, dep_delay) %>%
    filter(dep_delay > 60)
## # A tibble: 26,581 x 2
##    carrier dep_delay
##    <chr>       <int>
##  1 MQ            101
##  2 AA             71
##  3 MQ            853
##  4 UA            144
##  5 UA            134
##  6 EV             96
##  7 MQ             71
##  8 B6             77
##  9 EV             70
## 10 EV            115
## # ... with 26,571 more rows
  • Chaining increases readability significantly when there are many commands
  • Operator is automatically imported from the magrittr package
  • Can be used to replace nesting in R commands outside of dplyr.

For example, we can create two vectors and calculate Euclidian distance between them using the mathematical equation (4.1)

\[ \begin{equation} \theta\: = \: \sqrt {\sum (x_1 - x_2)^2} \tag{4.1} \end{equation} \]

x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))
# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()
## [1] 2.236068

4.6 Pipping with

4.6.1 Choosing columns: select(), rename()

Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

# besides just using select() to pick columns...
flights %>% 
  select(carrier, flight)
## # A tibble: 336,776 x 2
##    carrier flight
##    <chr>    <int>
##  1 UA        1545
##  2 UA        1714
##  3 AA        1141
##  4 B6         725
##  5 DL         461
##  6 UA        1696
##  7 B6         507
##  8 EV        5708
##  9 B6          79
## 10 AA         301
## # ... with 336,766 more rows
# ...you can use the minus sign to hide columns
flights %>% 
  select(-month, -day)
## # A tibble: 336,776 x 17
##     year dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int>    <int>          <int>     <int>    <int>          <int>
##  1  2013      517            515         2      830            819
##  2  2013      533            529         4      850            830
##  3  2013      542            540         2      923            850
##  4  2013      544            545        -1     1004           1022
##  5  2013      554            600        -6      812            837
##  6  2013      554            558        -4      740            728
##  7  2013      555            600        -5      913            854
##  8  2013      557            600        -3      709            723
##  9  2013      557            600        -3      838            846
## 10  2013      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <int>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <int>, distance <int>, hour <int>, minute <int>,
## #   time_hour <chr>
# hide a range of columns
flights %>% 
  select(-(dep_time:arr_delay))
# hide any column with a matching name
flights %>% 
  select(-contains("time"))
## # A tibble: 336,776 x 13
##     year month   day dep_delay arr_delay carrier flight tailnum origin
##    <int> <int> <int>     <int>     <int> <chr>    <int> <chr>   <chr> 
##  1  2013     1     1         2        11 UA        1545 N14228  EWR   
##  2  2013     1     1         4        20 UA        1714 N24211  LGA   
##  3  2013     1     1         2        33 AA        1141 N619AA  JFK   
##  4  2013     1     1        -1       -18 B6         725 N804JB  JFK   
##  5  2013     1     1        -6       -25 DL         461 N668DN  LGA   
##  6  2013     1     1        -4        12 UA        1696 N39463  EWR   
##  7  2013     1     1        -5        19 B6         507 N516JB  EWR   
##  8  2013     1     1        -3       -14 EV        5708 N829AS  LGA   
##  9  2013     1     1        -3        -8 B6          79 N593JB  JFK   
## 10  2013     1     1        -2         8 AA         301 N3ALAA  LGA   
## # ... with 336,766 more rows, and 4 more variables: dest <chr>,
## #   distance <int>, hour <int>, minute <int>
# pick columns using a character vector of column names
cols <- c("carrier", "flight", "tailnum")
flights %>% 
  select(one_of(cols))
## # A tibble: 336,776 x 3
##    carrier flight tailnum
##    <chr>    <int> <chr>  
##  1 UA        1545 N14228 
##  2 UA        1714 N24211 
##  3 AA        1141 N619AA 
##  4 B6         725 N804JB 
##  5 DL         461 N668DN 
##  6 UA        1696 N39463 
##  7 B6         507 N516JB 
##  8 EV        5708 N829AS 
##  9 B6          79 N593JB 
## 10 AA         301 N3ALAA 
## # ... with 336,766 more rows
# select() can be used to rename columns, though all columns not mentioned are dropped
flights %>% 
  select(tail = tailnum)
## # A tibble: 336,776 x 1
##    tail  
##    <chr> 
##  1 N14228
##  2 N24211
##  3 N619AA
##  4 N804JB
##  5 N668DN
##  6 N39463
##  7 N516JB
##  8 N829AS
##  9 N593JB
## 10 N3ALAA
## # ... with 336,766 more rows
# rename() does the same thing, except all columns not mentioned are kept
flights %>% 
  rename(tail = tailnum)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tail <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <chr>

4.6.2 Choosing rows: filter, between, slice, sample_n, top_n, distinct

filter() allows you to select a subset of rows in a data frame. Like all single verbs, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is TRUE. For example, we can select all flights on departed between 6:00 and 6:05 in the morning with:

flights %>% 
  filter(dep_time >= 600, dep_time <= 605)
## # A tibble: 2,460 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1      600            600         0      851
##  2  2013     1     1      600            600         0      837
##  3  2013     1     1      601            600         1      844
##  4  2013     1     1      602            610        -8      812
##  5  2013     1     1      602            605        -3      821
##  6  2013     1     2      600            600         0      814
##  7  2013     1     2      600            605        -5      751
##  8  2013     1     2      600            600         0      819
##  9  2013     1     2      600            600         0      846
## 10  2013     1     2      600            600         0      737
## # ... with 2,450 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <chr>
# between() is a concise alternative for determing if numeric values fall in a range
flights %>% 
  filter(between(dep_time, 600, 605))
# side note: is.na() can also be useful when filtering
flights %>% 
  filter(!is.na(dep_time))
## # A tibble: 328,521 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 328,511 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <chr>

4.6.3 Adding new variables: mutate, transmute, add_rownames

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

# mutate() creates a new variable (and keeps all existing variables)
flights %>% mutate(speed = distance/air_time*60)
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <int>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>, time_hour <chr>, speed <dbl>
# transmute() only keeps the new variables
flights %>% transmute(speed = distance/air_time*60)
## # A tibble: 336,776 x 1
##    speed
##    <dbl>
##  1  370.
##  2  374.
##  3  408.
##  4  517.
##  5  394.
##  6  288.
##  7  404.
##  8  259.
##  9  405.
## 10  319.
## # ... with 336,766 more rows
# example data frame with row names
mtcars %>% head()
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# add_rownames() turns row names into an explicit variable
mtcars %>% add_rownames("model") %>% head()
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## # A tibble: 6 x 12
##   model    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda~  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Mazda~  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 Datsu~  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4 Horne~  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5 Horne~  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6 Valia~  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
# side note: dplyr no longer prints row names (ever) for local data frames
mtcars %>% tbl_df()
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

4.6.4 Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup

The verb summarise() collapses a data frame to a single row.It’s not that useful until chained with the group_by() verb below.

# summarise() can be used to count the number of rows in each group
flights %>% group_by(month) %>% summarise(cnt = n())
## # A tibble: 12 x 2
##    month   cnt
##    <int> <int>
##  1     1 27004
##  2     2 24951
##  3     3 28834
##  4     4 28330
##  5     5 28796
##  6     6 28243
##  7     7 29425
##  8     8 29327
##  9     9 27574
## 10    10 28889
## 11    11 27268
## 12    12 28135
# tally() and count() can do this more concisely
flights %>% group_by(month) %>% tally()
# you can sort by the count
flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))
## # A tibble: 12 x 2
##    month   cnt
##    <int> <int>
##  1     7 29425
##  2     8 29327
##  3    10 28889
##  4     3 28834
##  5     5 28796
##  6     4 28330
##  7     6 28243
##  8    12 28135
##  9     9 27574
## 10    11 27268
## 11     1 27004
## 12     2 24951
# you can sum over a specific variable instead of simply counting rows
flights %>% group_by(month) %>% summarise(dist = sum(distance))
## # A tibble: 12 x 2
##    month     dist
##    <int>    <int>
##  1     1 27188805
##  2     2 24975509
##  3     3 29179636
##  4     4 29427294
##  5     5 29974128
##  6     6 29856388
##  7     7 31149199
##  8     8 31149334
##  9     9 28711426
## 10    10 30012086
## 11    11 28639718
## 12    12 29954084
# tally() and count() have a wt parameter for this purpose
flights %>% group_by(month) %>% tally(wt = distance)
flights %>% count(month, wt = distance)
# group_size() returns the counts as a vector
flights %>% group_by(month) %>% group_size()
##  [1] 27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268
## [12] 28135
# n_groups() simply reports the number of groups
flights %>% group_by(month) %>% n_groups()
## [1] 12
# group by two variables, summarise, arrange (output is possibly confusing)
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)
# A tibble: 365 x 3
# Groups:   month [12]
   month   day   cnt
   <int> <int> <int>
 1    11    27  1014
 2     7    11  1006
 3     7     8  1004
 4     7    10  1004
 5    12     2  1004
 6     7    18  1003
 7     7    25  1003
 8     7    12  1002
 9     7     9  1001
10     7    17  1001
11     7    31  1001
12     8     7  1001
13     8     8  1001
14     8    12  1001
15     7    22  1000
16     7    24  1000
17     8     1  1000
18     8     5  1000
19     8    15  1000
20    11    21  1000
21     7    15   999
22     7    19   999
23     7    26   999
24     7    29   999
25     8     2   999
26     8     9   999
27    11    22   999
28     8    16   998
29     7    23   997
30     7    30   997
31     8    14   997
32     7    16   996
33     8     6   996
34     8    19   996
35     9    13   996
36     9    26   996
37     9    27   996
38     4    15   995
39     6    20   995
40     6    26   995
# ... with 325 more rows
# ungroup() before arranging to arrange across all groups
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))
# A tibble: 365 x 3
   month   day   cnt
   <int> <int> <int>
 1    11    27  1014
 2     7    11  1006
 3     7     8  1004
 4     7    10  1004
 5    12     2  1004
 6     7    18  1003
 7     7    25  1003
 8     7    12  1002
 9     7     9  1001
10     7    17  1001
# ... with 355 more rows

References

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.