Subsetting to remove columns from a data frame in R

12. August 2016 101, how to, r 0
Subsetting to remove columns from a data frame in R

So you’ve got a data frame in R and you need to remove one or more columns.  It’s an easy fix in R.  Below, I’ll show you two ways to remove a column from a data set: the base R way and the dplyr way.

As an example, I’ve created a simple data frame with three columns, but then I accidentally run dat$C = "oops", and instantly regret doing so.  [Yes, since this is such a simple data frame, I could easily rerun the code to recreate the data frame, but assume your script is much more complex and involved than this.]

x = seq(0,10)
y = seq(20,30)
dat = data.frame("A" = x, "B" = y, "L" = letters[1:11])
dat$C = "oops"


Data Frame:
   A  B L   C
1  0  20 a  oops
2  1  21 b  oops
3  2  22 c  oops
4  3  23 d  oops
5  4  24 e  oops
6  5  25 f  oops
7  6  26 g  oops
8  7  27 h  oops
9  8  28 i  oops
10 9  29 j  oops
11 10 30 k  oops

Subsetting the base R way

The base R way to remove a column is to subset the data frame and then assign it back. Subsetting is easy to learn once you understand how it works.  You’ll need to tell R what dataset you want to subset and then give R a list of arguments.  There is a built in function to subset data using the subset() function, but it’s not my preference as it takes a lot more typing.  Below are some examples of subsetting the dat data frame from the code above using a method I learned on my R learning journey.

Subset Example Expected Results
dat[ , ] returns all of the rows in the data frame
dat[1:5, ] returns rows 1 through 5 and all of the columns
dat[ , c(1,4)] returns all rows, but only shows the columns at index 1 and 4
dat[ , 5] returns all rows, but only shows column 5
dat[ , -4] returns all of the rows in the data frame and every column except for the one at index 4
dat[ , c(-2,-4)] returns all of the rows in the data frame and every column except for index 2 and index 4
dat[dat$L == 'a' , 1:2] returns only those rows where the data in column equals 'a' AND only shows columns 1 and 2

Using an example from above, we’ll subset the data frame, currently stored in dat, and remove column C (index 4), and store the result back to the dat variable.

dat = dat[,c(-4)]


   A  B L 
1  0  20 a
2  1  21 b
3  2  22 c
4  3  23 d
5  4  24 e
6  5  25 f
7  6  26 g
8  7  27 h
9  8  28 i
10 9  29 j
11 10 30 k

Subsetting with dplyr

It’s no secret, I am a huge fan of the dplyr package.  dplyr uses five primary verbs (more in the link) to manipulate data: select, filter, mutate, group_by, summarize–each one strung together by the pipe (%>%). There’s even a keyboard shortcut that makes entering a pipe super fast: cmd/ctrl + shift + m.

x = referencedData %>% select() %>% filter() %>% mutate() %>% group_by() %>% summarize()

The select function will subset a data frame based on the unquoted column name or index.  It’s very similar to the base r subset() function, but I think it’s more user friendly.  Another beautiful thing about dplyr is that you do not have to type the exact variable reference if you were using the other verbs.  If you were going to be subsetting the data on multiple columns, the base R way is to provide an exact reference to the column (e.g., dat$A = 1, dat$L = ‘a’).  In dplyr, you’ve already told the code from where all of the columns will be pulled.

Below is the dplyr way of subsetting the data.  The two code chunks below tell dplyr and R to take the data frame in variable, dat, select columns 1-4 (or A, B, and L), and assign the result back to the dat variable.

So, using dat = dat %>% select(1:4) or dat = dat %>% select(A,B,L) will give you the same result

dat = dat %>% select(A, B, L)


   A  B  L 
1  0  20 a
2  1  21 b
3  2  22 c
4  3  23 d
5  4  24 e
6  5  25 f
7  6  26 g
8  7  27 h
9  8  28 i
10 9  29 j
11 10 30 k


Either way you choose to remove a column, you’ll end up with the same result.  If you’re interested in learning more about the dplyr package, I would recommend this tutorial.

Leave a Reply