x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x
## a b c d e
## 5.4 6.2 7.1 4.8 7.5
atomic vectors
x[1]
## a
## 5.4
x[4]
## d
## 4.8
x[c(1, 3)]
## a c
## 5.4 7.1
x[1:4]
## a b c d
## 5.4 6.2 7.1 4.8
:
operator creates a sequence of numbers from the left element to the right1:4
## [1] 1 2 3 4
c(1, 2, 3, 4)
## [1] 1 2 3 4
x[c(1,1,3)]
## a a c
## 5.4 5.4 7.1
x[6]
## <NA>
## NA
x[0]
## named numeric(0)
NOTE about vector numbering: * R indexing starts at 1 - many programming languages (c & python) indexing starts at 0
x[-2]
## a c d e
## 5.4 7.1 4.8 7.5
x[c(-1,-5)] #or x[-c(1,5)]
## b c d
## 6.2 7.1 4.8
TIP for Order of Operations: * a common trip up for novices occurs when trying to skip slices of a vector * most ppl first try to negate a sequence like so:
x[-1:3]
This returns a cryptic error:
Error in x[-1:3]: only 0's may be mixed with negative subscripts
:
is really a function
-1
, and a second as 3
, so getnerates a sequence of numbers: c(-1, 0, 1, 2, 3)
-
operator applies tot he resultx[-(1:3)]
## d e
## 4.8 7.5
x <- x[-4]
x
## a b c e
## 5.4 6.2 7.1 7.5
http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-1
x[c("a","c")]
## a c
## 5.4 7.1
* usuall more reliable way to subset objects, index changes more than names
* unfortunatelly can't skip or remove as easily
x[c(FALSE,FALSE,TRUE,FALSE,TRUE)]
## c <NA>
## 7.1 NA
>
, <
, ==
) are logical vectors we can use the to subset vectorsx[x >7]
## c e
## 7.1 7.5
x > 7
generates a logical vector c(FALSE, FALSE, TRUE, FALSE, TRUE)
then selects the elements of x
corresponding to the TRUE
values
==
to mimic the previous method of indexing by nameremember you have to use ==
rather than =
for comparisons
x[names(x) == "a"]
## a
## 5.4
** TIP: Subsetting through other
logical operators** There are many situations in which you will wish to combine multiple conditions. To do so several logical operations exist in R:
|
logical OR
: returns TRUE
, if either the left or right are TRUE
.&
logical AND
: returns TRUE
if both the left and right are TRUE
!
logical NOT
: converts TRUE
to FALSE
and FALSE
to TRUE
&&
and ||
compare the individual elements of two vectors. Recycling rules also apply here.http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-1
** TIP: non-unique names ** * be aware that it is possible for multiple elements in a vector to have the same name. * e.g. for a data frame, columns can have the same name. (but R tries to avoid this and row names must be unique).
For example:
x <- 1:3
x
## [1] 1 2 3
names(x) <- c('a','a','a')
x
## a a a
## 1 2 3
x['a'] # only returns first value
## a
## 1
x[names(x) == 'a'] # returns all three values
## a a a
## 1 2 3
x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
x[-"a"]
!=
(not-equals) operator to construct a logical vector that will do what we want:x[names(x) != "a"]
## named integer(0)
"a"
and "c"
elements, so we try this:x[names(x)!=c("a","c")]
## Warning in names(x) != c("a", "c"): longer object length is not a multiple
## of shorter object length
## a
## 2
R did something, but it gave us a warning that we ought to pay attention to - and it apparently gave us the wrong answer (the “c” element is still included in the vector)!
So what does !=
actually do in this case? That’s an excellent question.
http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/
names(x) != c("a", "c")
## Warning in names(x) != c("a", "c"): longer object length is not a multiple
## of shorter object length
## [1] FALSE TRUE FALSE
Why does R give FALSE as the third element of this vector, when names(x)[3] != "c"
is obviously false?
When you use !=
, R tries to compare each element of the left argument with the corresponding element of its right argument.
What happens when you compare vectors of different lengths? Show graphic Example from lesson
** Explain Graphics ** * In this case R repeats c("a", "c")
as many times as necessary to match names(x)
, i.e. we get c("a","c","a","c","a")
.
Since the recycled "a"
doesn’t match the third element of names(x)
, the value of !=
is FALSE
.
Because in this case the longer vector length (5)
isn’t a multiple of the shorter vector length (2)`, R printed a warning message.
If we had been unlucky and names(x)
had contained six elements, R would silently have done the wrong thing (i.e., not what we intended it to do).
This recycling rule can can introduce hard-to-find and subtle bugs!
** how to handle recycling ** * The way to get R to do what we really want (match each element of the left argument with all of the elements of the right argument) it to use the %in%
operator.
help("%in%")
or ?"%in%"
The %in%
operator goes through each element of its left argument, in this case the names of x
, and asks, “Does this element occur in the second argument?”.
Here, since we want to exclude values, we also need a !
operator to change “in”
to “not in”
:
x[! names(x) %in% c("a","c")]
## named integer(0)
R recycling rule example same as lesson graphics
names(x) == c('a', 'c') #warnings
#== works slightly differently than %in%. It will compare each element of its left argument to the corresponding element of its right argument.
# R recycles the shorter vector in a equality comparison
c("a", "b", "c", "e") # names of x
| | | | # The elements == is comparing
c("a", "c")
c("a", "b", "c", "e") # names of x
| | | |
c("a", "c", "a", "c") # The elements == is comparing
Handling special values
At some point you will encounter functions in R which cannot handle missing, infinite, or undefined data.
special functions to deal with this:
is.na
will return all positions in a vector, matrix, or data.frame containing NA
.is.nan
, and is.infinite
will do the same for NaN
and Inf
.is.finite
will return all positions in a vector, matrix, or data.frame that do not contain NA
, NaN
or Inf
.na.omit
will filter out all missing values from a vectorfactor subsetting
Factor subsetting works the same way as vector subsetting.
f <- factor(c("a", "a", "b", "c", "c", "d"))
f[f == "a"]
## [1] a a
## Levels: a b c d
f[f %in% c("b", "c")]
## [1] b c c
## Levels: a b c d
f[1:3]
## [1] a a b
## Levels: a b c d
f[-3]
## [1] a a c c d
## Levels: a b c d
Matrix subsetting
Matrices are also subsetted using the [
function. In this case it takes two arguments: the first applying to the rows, the second to its columns:
set.seed(1)
m <- matrix(rnorm(6*4), ncol=4, nrow=6)
m[3:4, c(3,1)]
## [,1] [,2]
## [1,] 1.12493092 -0.8356286
## [2,] -0.04493361 1.5952808
m[, c(3,4)] #column
## [,1] [,2]
## [1,] -0.62124058 0.82122120
## [2,] -2.21469989 0.59390132
## [3,] 1.12493092 0.91897737
## [4,] -0.04493361 0.78213630
## [5,] -0.01619026 0.07456498
## [6,] 0.94383621 -1.98935170
m[3,] #if grabbing 1 row, R will convert to vector
## [1] -0.8356286 0.5757814 1.1249309 0.9189774
drop = FALSE
:m[3, , drop=FALSE] # keep as a matrix specify a third argument; drop = FALSE:
## [,1] [,2] [,3] [,4]
## [1,] -0.8356286 0.5757814 1.124931 0.9189774
m[, c(3,6)] #will throw error if out of range
m[5] #not very useful
## [1] 0.3295078
This usually isn’t useful, and often confusing to read.
but matricies are populated by column-major format
by default and elements are arraged column wise
matrix(1:6, nrow=2, ncol=3)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
byrow=TRUE
:matrix(1:6, nrow=2, ncol=3, byrow=TRUE) #populate by row use byrow=TRUE
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting#challenge-4
Three functions to subset lists are [
, [[
, $
Using [
will always return a list. If you want to subset a list, but not extract an element, then you will likely use [
.
xlist <- list(a = "Global Policy", b = 1:10, data = head(iris))
xlist[1]
## $a
## [1] "Global Policy"
this is a list with 1 element
we can subset elements teh same way as atomic vector using [
xlist[1:2]
## $a
## [1] "Global Policy"
##
## $b
## [1] 1 2 3 4 5 6 7 8 9 10
[[
xlist[[1]]
## [1] "Global Policy"
notice the result is a vector, not a list
You can’t extract more than one element at once:
xlist[[1:2]]
xlist[[-1]]
and you can’t use it to skil elements
but you can use names to both subset and extract elements
xlist[['a']]
## [1] "Global Policy"
$
is a shorthand way for extracting elements by name:xlist$data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-5
[
with one argument will act the same was as for lists, where each list element corresponds to a column. The resulting object will be a data frame:gapminder <- read.csv(file="data/gapminder-FiveYearData.csv")
head(gapminder[3])
## pop
## 1 8425333
## 2 9240934
## 3 10267083
## 4 11537966
## 5 13079460
## 6 14880372
Similarly, [[
will act to extract a single column:
head(gapminder[["lifeExp"]]) # [[ will act to extract a single column
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
$
provides a convenient shorthand to extract columns by name:head(gapminder$year) #$ provides shorthand to extract columns by name
## [1] 1952 1957 1962 1967 1972 1977
[
behaves the same way as for matrices:gapminder[1:3,]
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
gapminder[3,] #is a data frame b/c of the mixed types
## country year pop continent lifeExp gdpPercap
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-7