Intro to R

55 min (40min teaching, 15min exercises) ** should break 1/2 thru shoot to finish at 3:15-25**

Data Structures

Objectives

How to read data into R?
What are the basic data types?
How to represent categorical information in R?
R deals with tabular data very well.
Let’s create a small data set for cats saved as a csv file.
You can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File -> New File -> Text File menu item.

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

Load csv in R

Read the file in:

cats <- read.csv(file = "data/feline-data.csv")
cats

read.csv used for reading in tablular data stored in a text file, comma seperated values
tabs and Commas are the most common punctuation characters used to seperate data points in csv files
Tabs also common, use read.delim
If not tabs or commas, you can use the more general read.table
we can begin exploring the dataset using the $ operator
$ operator pulls out columns by specifying them

cats$weight

cats$coat

We can do operations on columns:

## say we discovered that the scale weighs two Kg light:
cats$weight + 2

paste("My cat is", cats$coat)

but what about

cats$weight + cats$coat

understanding what happened here is important and key to sucessfully analyzing data in R

Data Types

In the last example, if you guessed that 2.1 plus "black" is nonsense, you are right
important concept in programming language is data types
we can ask what type of data something is in R for example:

typeof(cats$weight)

There are 5 main data types:

double - double precision float number (used for math)
integer - used for IDs or Values
complex -
logical
character

typeof(3.14)

## [1] "double"

typeof(1L) # the L suffix forces the number to be an integer, since by default R uses float numbers, R stores as integer

## [1] "integer"

note the L suffix for indicating an number is an iteger

typeof(1+1i) #complex numbers with real and imaginary parts

## [1] "complex"

typeof(TRUE) # TRUE or FALSE

## [1] "logical"

typeof('banana')

## [1] "character"

no matter how complicated our analysis is all data in R is interepeted as of one of these basic types
as an example, a user added details of another cat edit data/feline-data and add new tabby data line tabby, 2.3 or 2.4,1 re-save as data/feline-data.csv

file.show("data/feline-data.csv")

load the new file & check what type of data we find in weight column

cats <- read.csv(file="data/feline-data.csv")
typeof(cats$weight)

oh no, our weights aren’t the double type anymore!
if we try to do the same math we did on them before we get unexpected results

cats$weight + 2

what happened?

R reads a csv into a table, it insists that everything in the table is the same basic type
if it can’t understand in the column as a double then nobody in the column gets to be a double
the table that R loaded our cats data into is called a data.frame and our first example of something called a data structure
a structure that R knows how to build out of basic data types.
we can see that it is a data.frame by calling the class function on it:

class(cats)

in order to use our data in R, we need to understand what the basic data sructures are & how they behave
for now let’s remove that extra line from out cats data and reload the data
open the file and delete the bottom line

before:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1

after:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

cats <- read.csv(file="data/feline-data.csv")

Vectors and Type Coercion

To better understand data structure behavior, let’s look at another of the data structures: the vector.

my_vector <- vector(length = 3)
my_vector

## [1] FALSE FALSE FALSE

a vector is an ordered list of things, with a special condition that

everything in the vector must be the same bacis data type

if you dont choose a datatype, it’ll default to logcial
you can declare an empty vector of whatever type you like

another_vector <- vector(mode='character', length = 3)
another_vector

## [1] "" "" ""

check the vector with str

str(another_vector)

##  chr [1:3] "" "" ""

the somewhat cryptic output from the str command indicates the basic data type found in this vector which is achr, character
also indicates the indexes of the vector [1:3]
and a few examples of what’s in the vector - empty strings here

for example using the cats dataset:

str(cats$weight)

cats$weight is a vector, too– the columns of data we load into R data.frames are all vectors
and that’s the root of why R forces everything in a coloumn to be the same basic type

Why is R so opinionated about what we put in our columns of data? How does this help us?

By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency, like consistently using the same separator in our data files, is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.

you can make vectors with explicit contents with thecombine or c function

concat_vector <- c(2,6,3)
concat_vector

## [1] 2 6 3

given what we’ve learned so far, what would this produce?:

quiz_vector <- c(2,6,'3')

str(quiz_vector)

##  chr [1:3] "2" "6" "3"

what happened above was type conversion/coercion and is the source of surprises and the reason we need to be aware of basic data types and how R deals with them and how R interprets data types
When R encounters a mix of types (here in cats data, numeric and character) to be combined in to a vector, it will force them to be the same type

coercion <- c('a', TRUE)
str(coercion)

##  chr [1:2] "a" "TRUE"

another_coercion_vector <- c(0, TRUE)
another_coercion_vector

## [1] 0 1

the coercion rules go: logical -> integer -> numeric -> complex ->character
Where -> can be read as are transformed into
You can try to force coercion against this flow using the as. functions

character_vector_example <- c('0','2','4')
character_vector_example

## [1] "0" "2" "4"

character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric

## [1] 0 2 4

numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical

## [1] FALSE  TRUE  TRUE

as you can see surprising things can happen when R forces one basic data type into another
Nitty gritty of type coercion aside the point is:
if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame
make sure everything is the same type in your vectors and your columns of data.frames or you will get bad surprises
but coercion can be useful!
ex. in our cat’s data likes_string is number, but we know that the 1s and 0s actually represent TRUE and FALSE
we should use logical datatype here - whcih has 2 states - TRUE or FALSE
we can coerce this column by using as.logical function

cats$likes_string <- as.numeric(cats$likes_string)
str(cats$likes_string)

cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
str(cats$likes_string)

the combine/concatenate c() function, will also append things to an existing vector:

ab_vector <- c('a', 'b')
ab_vector

## [1] "a" "b"

example, Add a character

concat_example <- c(ab_vector, 'SWC')
concat_example

## [1] "a"   "b"   "SWC"

you can also make a series of numbers

mySeries <- 1:10
mySeries

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(10)

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(1,10, by=0.1)

##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
## [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
## [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
## [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
## [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0

we can ask a few questions of vectors

sequence_example <- seq(10)
head(sequence_example, n=2)

## [1] 1 2

tail(sequence_example, n=4)

## [1]  7  8  9 10

length(sequence_example)

## [1] 10

class(sequence_example)

## [1] "integer"

typeof(sequence_example)

## [1] "integer"

finally, you can give names to elements in your vector

names_example <- 5:8
names(names_example) <- c("a", "b", "c", "d")
names_example

## a b c d 
## 5 6 7 8

names(names_example)

## [1] "a" "b" "c" "d"

Challenge 1

Data Frames

we said that columns in data.frames were vectors

str(cats$weight)

str(cats$likes_string)

str(cats$coat)

Factors

another important data structure is a factor
factors appear to be character data, but are used to represent categorical information
Example: a vector of strings labelling cat colorations for all cats in our study:

coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats

str(coats)

we can turn a vector into a factor like so

CATegories <- factor(coats)

class(CATegories)

str(CATegories)

now R has noticed that there are three possible categories in our data
but prints out a bunch of numbers instead
R has replaced our human-readable categories with human readable numbered indices
this is necessary as many statistical calculations utilise such numerical representations for categorical data:

typeof(coats)

typeof(CATegories)

Challenge 2

catsstring <- read.csv(file="data/feline-data.csv", stringsAsFactors=FALSE)

str(catsstring$coat)

catsstring <- read.csv(file="data/feline-data.csv", colClasses=c(NA, NA, "character"))
str(catsstring$coat)

in modelling functions, it’s important to know what the baseline levels are
this is assumed to be the first factor, but by default factors are labelled in aphabetical
you can change this by specifying levels:

mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)

##  Factor w/ 2 levels "control","case": 2 1 1 2

in this case we explicitly told R that ‘control’ should be represented by 1 and case by 2
this designation can be very important for interpreting the results of stats model

Lists

another data structure you’ll want to learn is a list
a list is simpler in some ways than other types because you can put anything in it

list_example <- list(1, "a", TRUE, 1+4i)
list_example

## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i

another_list <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
another_list

## $title
## [1] "Research Bazaar"
## 
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
## [1] TRUE

we can now understand something surprising in our cats data.frame what happens if we:

typeof(cats)

we see that data.frames look like lists under the hood
this is because a data.frame is really a list of vectors and factors
in order to hold those columns that are a mix of vectors and factors, the data.frame needs something a bit more flexible than a vector to put all those columns together into a table
data.frame is a special ist in which all vectors must have same length
in our cats exampel we have an integer, a double and logical variable

cats$coat

cats[,1]

typeof(cats[,1])

str(cats[,1])

each row is an observation of different variables, itself a data.frame and thus can be composed of element of different types

cats[1,]

typeof(cats[1,])

str(cats[1,])

Challenge 3

Matrices

Last but not least is the matrix
we can declare a matrix of zeros:

matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0

And similar to other data structures, we can ask things about our matrix:

class(matrix_example)

## [1] "matrix"

typeof(matrix_example)

## [1] "double"

str(matrix_example)

##  num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...

dim(matrix_example)

## [1] 3 6

nrow(matrix_example)

## [1] 3

Challege 4

matrix_example <- matrix(0, ncol=6, nrow=3)
length(matrix_example)

## [1] 18

Challenge 5

x <- matrix(1:50, ncol=5, nrow=10)

x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row