Subsetting Data

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x
##   a   b   c   d   e 
## 5.4 6.2 7.1 4.8 7.5

Accessing elements using their indices

x[1]
##   a 
## 5.4
x[4]
##   d 
## 4.8
x[c(1, 3)]
##   a   c 
## 5.4 7.1
x[1:4]
##   a   b   c   d 
## 5.4 6.2 7.1 4.8
1:4
## [1] 1 2 3 4
c(1, 2, 3, 4)
## [1] 1 2 3 4
x[c(1,1,3)]
##   a   a   c 
## 5.4 5.4 7.1
x[6]
## <NA> 
##   NA
x[0]
## named numeric(0)

NOTE about vector numbering: * R indexing starts at 1 - many programming languages (c & python) indexing starts at 0

Skipping and removing elements

x[-2]
##   a   c   d   e 
## 5.4 7.1 4.8 7.5
x[c(-1,-5)] #or x[-c(1,5)]
##   b   c   d 
## 6.2 7.1 4.8

TIP for Order of Operations: * a common trip up for novices occurs when trying to skip slices of a vector * most ppl first try to negate a sequence like so:

x[-1:3]

This returns a cryptic error:

Error in x[-1:3]: only 0's may be mixed with negative subscripts
x[-(1:3)]
##   d   e 
## 4.8 7.5
x <- x[-4]
x
##   a   b   c   e 
## 5.4 6.2 7.1 7.5

challenge 1

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-1

Subsetting by name

x[c("a","c")]
##   a   c 
## 5.4 7.1
* usuall more reliable way to subset objects, index changes more than names
* unfortunatelly can't skip or remove as easily

Subsetting through other logical operations

x[c(FALSE,FALSE,TRUE,FALSE,TRUE)]
##    c <NA> 
##  7.1   NA
x[x >7]
##   c   e 
## 7.1 7.5
x[names(x) == "a"]
##   a 
## 5.4

** TIP: Subsetting through other logical operators** There are many situations in which you will wish to combine multiple conditions. To do so several logical operations exist in R:

challenge 3

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-1

** TIP: non-unique names ** * be aware that it is possible for multiple elements in a vector to have the same name. * e.g. for a data frame, columns can have the same name. (but R tries to avoid this and row names must be unique).

For example:

x <- 1:3
x
## [1] 1 2 3
names(x) <- c('a','a','a')
x
## a a a 
## 1 2 3
x['a'] # only returns first value
## a 
## 1
x[names(x) == 'a'] # returns all three values
## a a a 
## 1 2 3

Skipping named elements

x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
x[-"a"]
x[names(x) != "a"]
## named integer(0)
x[names(x)!=c("a","c")]
## Warning in names(x) != c("a", "c"): longer object length is not a multiple
## of shorter object length
## a 
## 2

challenge 2

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/

Recycling

names(x) != c("a", "c")
## Warning in names(x) != c("a", "c"): longer object length is not a multiple
## of shorter object length
## [1] FALSE  TRUE FALSE

** Explain Graphics ** * In this case R repeats c("a", "c") as many times as necessary to match names(x), i.e. we get c("a","c","a","c","a").

** how to handle recycling ** * The way to get R to do what we really want (match each element of the left argument with all of the elements of the right argument) it to use the %in% operator.

x[! names(x) %in% c("a","c")]
## named integer(0)

R recycling rule example same as lesson graphics

names(x) == c('a', 'c') #warnings 
#== works slightly differently than %in%. It will compare each element of its left argument to the corresponding element of its right argument.
# R recycles the shorter vector in a equality comparison

c("a", "b", "c", "e")  # names of x
   |    |    |    |    # The elements == is comparing
c("a", "c")

c("a", "b", "c", "e")  # names of x
   |    |    |    |    
c("a", "c", "a", "c") # The elements == is comparing

Handling special values

At some point you will encounter functions in R which cannot handle missing, infinite, or undefined data.

special functions to deal with this:

factor subsetting

Factor subsetting works the same way as vector subsetting.

f <- factor(c("a", "a", "b", "c", "c", "d"))
f[f == "a"]
## [1] a a
## Levels: a b c d
f[f %in% c("b", "c")]
## [1] b c c
## Levels: a b c d
f[1:3]
## [1] a a b
## Levels: a b c d
f[-3]
## [1] a a c c d
## Levels: a b c d

Matrix subsetting

Matrices are also subsetted using the [ function. In this case it takes two arguments: the first applying to the rows, the second to its columns:

set.seed(1)
m <- matrix(rnorm(6*4), ncol=4, nrow=6)
m[3:4, c(3,1)]
##             [,1]       [,2]
## [1,]  1.12493092 -0.8356286
## [2,] -0.04493361  1.5952808
m[, c(3,4)] #column
##             [,1]        [,2]
## [1,] -0.62124058  0.82122120
## [2,] -2.21469989  0.59390132
## [3,]  1.12493092  0.91897737
## [4,] -0.04493361  0.78213630
## [5,] -0.01619026  0.07456498
## [6,]  0.94383621 -1.98935170
m[3,] #if grabbing 1 row, R will convert to vector
## [1] -0.8356286  0.5757814  1.1249309  0.9189774
m[3, , drop=FALSE] # keep as a matrix specify a third argument; drop = FALSE:
##            [,1]      [,2]     [,3]      [,4]
## [1,] -0.8356286 0.5757814 1.124931 0.9189774
m[, c(3,6)] #will throw error if out of range
m[5] #not very useful
## [1] 0.3295078
matrix(1:6, nrow=2, ncol=3)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
matrix(1:6, nrow=2, ncol=3, byrow=TRUE) #populate by row use  byrow=TRUE
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

challenge 4

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting#challenge-4

List subsetting

xlist <- list(a = "Global Policy", b = 1:10, data = head(iris))
xlist[1]
## $a
## [1] "Global Policy"
xlist[1:2]
## $a
## [1] "Global Policy"
## 
## $b
##  [1]  1  2  3  4  5  6  7  8  9 10
xlist[[1]]
## [1] "Global Policy"
xlist[[1:2]]
xlist[[-1]]
xlist[['a']]
## [1] "Global Policy"
xlist$data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

challenge 5

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-5

Data frames

gapminder <- read.csv(file="data/gapminder-FiveYearData.csv")
head(gapminder[3])
##        pop
## 1  8425333
## 2  9240934
## 3 10267083
## 4 11537966
## 5 13079460
## 6 14880372

Similarly, [[ will act to extract a single column:

head(gapminder[["lifeExp"]]) # [[ will act to extract a single column
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
head(gapminder$year) #$ provides shorthand to extract columns by name
## [1] 1952 1957 1962 1967 1972 1977
gapminder[1:3,]
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
gapminder[3,] #is a data frame b/c of the mixed types
##       country year      pop continent lifeExp gdpPercap
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007

Challenge 7

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-7