Subset Vectors in R
By: Karthik Janar in data-science Tutorials on 2018-05-01
In this tutorial, we"ll see how to extract elements from a vector based on some conditions that we specify. For example, we may only be interested in the first 20 elements of a vector, or only the elements that are not NA, or only those that are positive or correspond to a specific variable of interest. By the end of this tutorial, you'll know how to handle each of these scenarios.
First create a vector called x that contains a random ordering of 20 numbers (from a standard normal distribution) and 20 NAs.
y <- rnorm(10)
z <- rep(NA, 10)
x <- sample(c(y,z),20)
x
## [1] -1.2731728 NA 0.6789480 NA -0.4966300 NA
## [7] NA NA -0.4829580 NA -0.4601567 -0.4970408
## [13] -0.3120564 -0.2704296 NA NA NA NA
## [19] -0.5934239 -0.6427310
The way you tell R that you want to select some particular elements (i.e. a 'subset") from a vector is by placing an 'index vector" in square brackets immediately following the name of the vector.
For a simple example, try x[1:10] to view the first ten elements of x.
x[1:10]
## [1] -1.273173 NA 0.678948 NA -0.496630 NA NA
## [8] NA -0.482958 NA
Index vectors come in four different flavors - logical vectors, vectors of positive integers, vectors of negative integers, and vectors of character strings - each of which we"ll cover in this tutorial.
Let's start by indexing with logical vectors. One common scenario when working with real-world data is that we want to extract all elements of a vector that are not NA (i.e. missing data). Recall that is.na(x) yields a vector of logical values the same length as x, with TRUEs corresponding to NA values in x and FALSEs corresponding to non-NA values in x.
What do you think x[is.na(x)] will give you? It will return a vector of all NAs
x[is.na(x)]
## [1] NA NA NA NA NA NA NA NA NA NA
!
gives us the negation of a logical expression, so !is.na(x) can be read as 'is not NA". Therefore, if we want to create a vector called y that contains all of the non-NA values from x, we can use y <- x[!is.na(x)]. .
y <- x[!is.na(x)]
y
## [1] -1.2731728 0.6789480 -0.4966300 -0.4829580 -0.4601567 -0.4970408
## [7] -0.3120564 -0.2704296 -0.5934239 -0.6427310
Now that we"ve isolated the non-missing values of x and put them in y, we can subset y as we please. Recall that the expression y > 0 will give us a vector of logical values the same length as y, with TRUEs corresponding to values of y that are greater than zero and FALSEs corresponding to values of y that are less than or equal to zero. What do you think y[y > 0] will give you? A vector of all the positive elements of y.
Type y[y > 0] to see that we get all of the positive elements of y, which are also the positive elements of our original vector x.
y[y>0]
## [1] 0.678948
You might wonder why we didn't just start with x[x > 0] to isolate the positive elements of x. Try that now to see why.
x[x>0]
## [1] NA 0.678948 NA NA NA NA NA
## [8] NA NA NA NA
Since NA is not a value, but rather a placeholder for an unknown quantity, the expression NA > 0 evaluates to NA. Hence we get a bunch of NAs mixed in with our positive numbers when we do this.
Combining our knowledge of logical operators with our new knowledge of subsetting, we could do this - x[!is.na(x) & x > 0].
x[!is.na(x) & x>0]
## [1] 0.678948
In this case, we request only values of x that are both non-missing AND greater than zero.
Earlier we saw how to subset just the first ten values of x using x[1:10]. In this case, we"re providing a vector of positive integers inside of the square brackets, which tells R to return only the elements of x numbered 1 through 10.
Many programming languages use what's called 'zero-based indexing", which means that the first element of a vector is considered element 0. R uses 'one-based indexing", which (you guessed it!) means the first element of a vector is considered element 1.
Can you figure out how we"d subset the 3rd, 5th, and 7th elements of x? Hint - Use the c() function to specify the element numbers as a numeric vector.
x[c(3,5,7)]
## [1] 0.678948 -0.496630 NA
It's important that when using integer vectors to subset our vector x, we stick with the set of indexes {1, 2, -, 20} since x only has 20 elements. What happens if we ask for the zeroth element of x (i.e. x[0])?
x[0]
## numeric(0)
As you might expect, we get nothing useful. Unfortunately, R doesn't prevent us from doing this. What if we ask for the 3000th element of x?
x[3000]
## [1] NA
Again, nothing useful, but R doesn't prevent us from asking for it. This should be a cautionary note. You should always make sure that what you are asking for is within the bounds of the vector you're working with.
What if we"re interested in all elements of x EXCEPT the 2nd and 10th? It would be pretty tedious to construct a vector containing all numbers 1 through 20 EXCEPT 2 and 10.
R accepts negative integer indexes. Whereas x[c(2, 10)] gives us ONLY the 2nd and 10th elements of x, x[c(-2, -10)] gives us all elements of x EXCEPT for the 2nd and 10 elements.
x[c(-2,-10)]
## [1] -1.2731728 0.6789480 NA -0.4966300 NA NA
## [7] NA -0.4829580 -0.4601567 -0.4970408 -0.3120564 -0.2704296
## [13] NA NA NA NA -0.5934239 -0.6427310
A shorthand way of specifying multiple negative numbers is to put the negative sign out in front of the vector of positive numbers. Type x[-c(2, 10)] to get the exact same result.
x[-c(2,10)]
## [1] -1.2731728 0.6789480 NA -0.4966300 NA NA
## [7] NA -0.4829580 -0.4601567 -0.4970408 -0.3120564 -0.2704296
## [13] NA NA NA NA -0.5934239 -0.6427310
So far, we"ve covered three types of index vectors - logical, positive integer, and negative integer. The only remaining type requires us to introduce the concept of 'named" elements.
Create a numeric vector with three named elements using vect <- c(foo = 11, bar = 2, norf = NA). When we print vect to the console, you'll see that each element has a name.
vect <- c(foo = 11, bar = 2, norf = NA)
vect
## foo bar norf
## 11 2 NA
We can also get the names of vect by passing vect as an argument to the names() function.
names(vect)
## [1] "foo" "bar" "norf"
Alternatively, we can create an unnamed vector vect2 with c(11, 2, NA). Then, we can add the names
attribute to vect2 after the fact with names(vect2) <- c("foo", "bar", "norf").
vect2 <- c(11,2,NA)
names(vect2) <- c("foo","bar","norf")
Now, let's check that vect and vect2 are the same by passing them as arguments to the identical() function.
identical(vect,vect2)
## [1] TRUE
[1] TRUE
You can see that, vect and vect2 are identical named vectors.
Now, back to the matter of subsetting a vector by named elements. Which of the following commands do you think would give us the second element of vect?
vect["bar"]
## bar
## 2
Likewise, we can specify a vector of names with vect[c("foo", "bar")].
vect[c("foo", "bar")]
## foo bar
## 11 2
Now you know all four methods of subsetting data from vectors. Different approaches are best in different scenarios and when in doubt, try it out!
Add Comment
This policy contains information about your privacy. By posting, you are declaring that you understand this policy:
- Your name, rating, website address, town, country, state and comment will be publicly displayed if entered.
- Aside from the data entered into these form fields, other stored data about your comment will include:
- Your IP address (not displayed)
- The time/date of your submission (displayed)
- Your email address will not be shared. It is collected for only two reasons:
- Administrative purposes, should a need to contact you arise.
- To inform you of new comments, should you subscribe to receive notifications.
- A cookie may be set on your computer. This is used to remember your inputs. It will expire by itself.
This policy is subject to change at any time and without notice.
These terms and conditions contain rules about posting comments. By submitting a comment, you are declaring that you agree with these rules:
- Although the administrator will attempt to moderate comments, it is impossible for every comment to have been moderated at any given time.
- You acknowledge that all comments express the views and opinions of the original author and not those of the administrator.
- You agree not to post any material which is knowingly false, obscene, hateful, threatening, harassing or invasive of a person's privacy.
- The administrator has the right to edit, move or remove any comment for any reason and without notice.
Failure to comply with these rules may result in being banned from submitting further comments.
These terms and conditions are subject to change at any time and without notice.
Most Viewed Articles (in data-science ) |
Latest Articles (in data-science) |
- Data Science
- Android
- React Native
- AJAX
- ASP.net
- C
- C++
- C#
- Cocoa
- Cloud Computing
- HTML5
- Java
- Javascript
- JSF
- JSP
- J2ME
- Java Beans
- EJB
- JDBC
- Linux
- Mac OS X
- iPhone
- MySQL
- Office 365
- Perl
- PHP
- Python
- Ruby
- VB.net
- Hibernate
- Struts
- SAP
- Trends
- Tech Reviews
- WebServices
- XML
- Certification
- Interview
Comments