Subsetting Data

To be able to subset vectors, factors, matrices, lists, and data frames
To be able to extract individual and multiple elements: by index, by name, using comparison operations
To be able to skip and remove elements from various data structures.

Subsetting Data

There are Six different ways we can subset any kind of object
Three different subsetting operators for the different data structures
let’s start with simple numeric vectors

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x

##   a   b   c   d   e 
## 5.4 6.2 7.1 4.8 7.5

simple vectors that cannot be further simplified that contain character strings, numbers, or logical values are called atomic vectors

Accessing elements using their indices

extract elements of a vector we can give their corresp. indes

x[1]

##   a 
## 5.4

x[4]

##   d 
## 4.8

it may not look like it but the square brackets operator is a function
it means: “get me the nth element”
ask for multiple elements at once

x[c(1, 3)]

##   a   c 
## 5.4 7.1

or slices of the vector

x[1:4]

##   a   b   c   d 
## 5.4 6.2 7.1 4.8

the : operator creates a sequence of numbers from the left element to the right

1:4

## [1] 1 2 3 4

c(1, 2, 3, 4)

## [1] 1 2 3 4

we can ask for the same element multiple times

x[c(1,1,3)]

##   a   a   c 
## 5.4 5.4 7.1

if we ask for a number outside the vecotr, R will return missing values:

x[6]

## <NA> 
##   NA

this is a vector of length of one containing an NA, whose name is also NA
if we ask for the 0th element, we get an empty vector:

x[0]

## named numeric(0)

NOTE about vector numbering: * R indexing starts at 1 - many programming languages (c & python) indexing starts at 0

Skipping and removing elements

if we use a negative number as the index of a vector, R will return every element EXCEPT for the one specified

x[-2]

##   a   c   d   e 
## 5.4 7.1 4.8 7.5

we can skip multiple elements

x[c(-1,-5)] #or x[-c(1,5)]

##   b   c   d 
## 6.2 7.1 4.8

TIP for Order of Operations: * a common trip up for novices occurs when trying to skip slices of a vector * most ppl first try to negate a sequence like so:

x[-1:3]

This returns a cryptic error:

Error in x[-1:3]: only 0's may be mixed with negative subscripts

remember order of operations.
: is really a function
- what happens is it takes first arguemnt as -1, and a second as 3, so getnerates a sequence of numbers: c(-1, 0, 1, 2, 3)
correct solution is to wrap that function call in brackets, so that the - operator applies tot he result

x[-(1:3)]

##   d   e 
## 4.8 7.5

to remove elements from a vector we need to assign the results back onto the object

x <- x[-4]
x

##   a   b   c   e 
## 5.4 6.2 7.1 7.5

challenge 1

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-1

Subsetting by name

extract elements by using their name, instead of extracting by index:

x[c("a","c")]

##   a   c 
## 5.4 7.1

* usuall more reliable way to subset objects, index changes more than names
* unfortunatelly can't skip or remove as easily

Subsetting through other logical operations

Also use any logical vector to subset:

x[c(FALSE,FALSE,TRUE,FALSE,TRUE)]

##    c <NA> 
##  7.1   NA

Comparison operators (e.g. >, <, ==) are logical vectors we can use the to subset vectors
This example statement gives the same result as the previous one

x[x >7]

##   c   e 
## 7.1 7.5

breaking it down, the statement first evaluates x > 7
generates a logical vector c(FALSE, FALSE, TRUE, FALSE, TRUE) then selects the elements of x corresponding to the TRUE values
next example, we can use == to mimic the previous method of indexing by name
remember you have to use == rather than = for comparisons

x[names(x) == "a"]

##   a 
## 5.4

** TIP: Subsetting through other logical operators** There are many situations in which you will wish to combine multiple conditions. To do so several logical operations exist in R:

| logical OR: returns TRUE, if either the left or right are TRUE.
& logical AND: returns TRUE if both the left and right are TRUE
! logical NOT: converts TRUE to FALSE and FALSE to TRUE
&& and || compare the individual elements of two vectors. Recycling rules also apply here.

challenge 3

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-1

** TIP: non-unique names ** * be aware that it is possible for multiple elements in a vector to have the same name. * e.g. for a data frame, columns can have the same name. (but R tries to avoid this and row names must be unique).

For example:

x <- 1:3
x

## [1] 1 2 3

names(x) <- c('a','a','a')
x

## a a a 
## 1 2 3

x['a'] # only returns first value

## a 
## 1

x[names(x) == 'a'] # returns all three values

## a a a 
## 1 2 3

Skipping named elements

skipping of removing named elemnts is a little harder
If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn’t know how to take the negative of a string:

x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
x[-"a"]

However, we can use the != (not-equals) operator to construct a logical vector that will do what we want:

x[names(x) != "a"]

## named integer(0)

Skipping multiple named indices is a little bit harder still
Suppose we want to drop the "a" and "c" elements, so we try this:

x[names(x)!=c("a","c")]

## Warning in names(x) != c("a", "c"): longer object length is not a multiple
## of shorter object length

## a 
## 2

R did something, but it gave us a warning that we ought to pay attention to - and it apparently gave us the wrong answer (the “c” element is still included in the vector)!
So what does != actually do in this case? That’s an excellent question.

challenge 2

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/

Recycling

Let’s take a look at the comparison component of this code:

names(x) != c("a", "c")

## Warning in names(x) != c("a", "c"): longer object length is not a multiple
## of shorter object length

## [1] FALSE  TRUE FALSE

Why does R give FALSE as the third element of this vector, when names(x)[3] != "c" is obviously false?
When you use !=, R tries to compare each element of the left argument with the corresponding element of its right argument.
What happens when you compare vectors of different lengths? Show graphic Example from lesson

** Explain Graphics ** * In this case R repeats c("a", "c") as many times as necessary to match names(x), i.e. we get c("a","c","a","c","a").

Since the recycled "a" doesn’t match the third element of names(x), the value of != is FALSE.
Because in this case the longer vector length (5) isn’t a multiple of the shorter vector length (2)`, R printed a warning message.
If we had been unlucky and names(x) had contained six elements, R would silently have done the wrong thing (i.e., not what we intended it to do).
This recycling rule can can introduce hard-to-find and subtle bugs!

** how to handle recycling ** * The way to get R to do what we really want (match each element of the left argument with all of the elements of the right argument) it to use the %in% operator.

Getting help for operators;
- help("%in%") or ?"%in%"
The %in% operator goes through each element of its left argument, in this case the names of x, and asks, “Does this element occur in the second argument?”.
Here, since we want to exclude values, we also need a ! operator to change “in” to “not in”:

x[! names(x) %in% c("a","c")]

## named integer(0)

R recycling rule example same as lesson graphics

names(x) == c('a', 'c') #warnings 
#== works slightly differently than %in%. It will compare each element of its left argument to the corresponding element of its right argument.
# R recycles the shorter vector in a equality comparison

c("a", "b", "c", "e")  # names of x
   |    |    |    |    # The elements == is comparing
c("a", "c")

c("a", "b", "c", "e")  # names of x
   |    |    |    |    
c("a", "c", "a", "c") # The elements == is comparing

Handling special values

At some point you will encounter functions in R which cannot handle missing, infinite, or undefined data.

special functions to deal with this:

is.na will return all positions in a vector, matrix, or data.frame containing NA.
likewise, is.nan, and is.infinite will do the same for NaN and Inf.
is.finite will return all positions in a vector, matrix, or data.frame that do not contain NA, NaN or Inf.
na.omit will filter out all missing values from a vector

factor subsetting

Factor subsetting works the same way as vector subsetting.

f <- factor(c("a", "a", "b", "c", "c", "d"))
f[f == "a"]

## [1] a a
## Levels: a b c d

f[f %in% c("b", "c")]

## [1] b c c
## Levels: a b c d

f[1:3]

## [1] a a b
## Levels: a b c d

An important note is that skipping elements will not remove the level even if no more of that category exists in the factor:

f[-3]

## [1] a a c c d
## Levels: a b c d

Matrix subsetting

Matrices are also subsetted using the [ function. In this case it takes two arguments: the first applying to the rows, the second to its columns:

set.seed(1)
m <- matrix(rnorm(6*4), ncol=4, nrow=6)
m[3:4, c(3,1)]

##             [,1]       [,2]
## [1,]  1.12493092 -0.8356286
## [2,] -0.04493361  1.5952808

You can leave the first or second arguments blank to retrieve all the rows or columns respectively:

m[, c(3,4)] #column

##             [,1]        [,2]
## [1,] -0.62124058  0.82122120
## [2,] -2.21469989  0.59390132
## [3,]  1.12493092  0.91897737
## [4,] -0.04493361  0.78213630
## [5,] -0.01619026  0.07456498
## [6,]  0.94383621 -1.98935170

If we only access one row or column, R will automatically convert the result to a vector:

m[3,] #if grabbing 1 row, R will convert to vector

## [1] -0.8356286  0.5757814  1.1249309  0.9189774

If you want to keep the output as a matrix, you need to specify a third argument; drop = FALSE:

m[3, , drop=FALSE] # keep as a matrix specify a third argument; drop = FALSE:

##            [,1]      [,2]     [,3]      [,4]
## [1,] -0.8356286 0.5757814 1.124931 0.9189774

Unlike vectors, if we try to access a row or column outside of the matrix, R will throw an error:

m[, c(3,6)] #will throw error if out of range

because matricies are really vectors, you can just use single indexing

m[5] #not very useful

## [1] 0.3295078

This usually isn’t useful, and often confusing to read.
but matricies are populated by column-major format by default and elements are arraged column wise

matrix(1:6, nrow=2, ncol=3)

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

If you wish to populate the matrix by row, use byrow=TRUE:

matrix(1:6, nrow=2, ncol=3, byrow=TRUE) #populate by row use  byrow=TRUE

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Matrices can also be subsetted using their rownames and column names instead of their row and column indices.

challenge 4

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting#challenge-4

List subsetting

Now we’ll introduce 3 functions to subset lists.
Three functions to subset lists are [, [[, $
Using [ will always return a list. If you want to subset a list, but not extract an element, then you will likely use [.

xlist <- list(a = "Global Policy", b = 1:10, data = head(iris))
xlist[1]

## $a
## [1] "Global Policy"

this is a list with 1 element
we can subset elements teh same way as atomic vector using [

xlist[1:2]

## $a
## [1] "Global Policy"
## 
## $b
##  [1]  1  2  3  4  5  6  7  8  9 10

to get at individual elements of a list you need to use [[

xlist[[1]]

## [1] "Global Policy"

notice the result is a vector, not a list
You can’t extract more than one element at once:

xlist[[1:2]]

xlist[[-1]]

and you can’t use it to skil elements
but you can use names to both subset and extract elements

xlist[['a']]

## [1] "Global Policy"

the function $ is a shorthand way for extracting elements by name:

xlist$data

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

challenge 5

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-5

Data frames

Remember data frames are lists under the hood, so similar methods apply.
[ with one argument will act the same was as for lists, where each list element corresponds to a column. The resulting object will be a data frame:

gapminder <- read.csv(file="data/gapminder-FiveYearData.csv")
head(gapminder[3])

##        pop
## 1  8425333
## 2  9240934
## 3 10267083
## 4 11537966
## 5 13079460
## 6 14880372

Similarly, [[ will act to extract a single column:

head(gapminder[["lifeExp"]]) # [[ will act to extract a single column

## [1] 28.801 30.332 31.997 34.020 36.088 38.438

And $ provides a convenient shorthand to extract columns by name:

head(gapminder$year) #$ provides shorthand to extract columns by name

## [1] 1952 1957 1962 1967 1972 1977

With two arguments, [ behaves the same way as for matrices:

gapminder[1:3,]

##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007

If we subset a single row, the result will be a data frame (because the elements are mixed types):

gapminder[3,] #is a data frame b/c of the mixed types

##       country year      pop continent lifeExp gdpPercap
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007

Challenge 7

http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting##challenge-7

Subsetting Data

Reid Otsuji adapted from Tim Dennis

01/12/2018

Subsetting Data

Accessing elements using their indices

Skipping and removing elements

challenge 1

Subsetting by name

Subsetting through other logical operations

challenge 3

Skipping named elements

challenge 2

Recycling

challenge 4

List subsetting

challenge 5

Data frames

Challenge 7