Intro to R

Data Structures

How to read data into R?
What are the basic data types?
How to represent categorical information in R?
R deals with tabular data very well.
Let’s create a csv file of cats.
You can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File -> New File -> Text File menu item.

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

Read the file in:

cats <- read.csv(file = "data/feline-data.csv")

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'data/feline-
## data.csv'

cats

##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1

read.csv used fr reading in tablula data stored in a text file, comma
Tabs also common, use read.delim
If not tabs or commas, you can use the more general read.table
we can begin exploring using the $ operator

cats$weight

## [1] 2.1 5.0 3.2

cats$coat

## [1] calico black  tabby 
## Levels: black calico tabby

We can do operations on columns:

## say we discovered that the scale weighs two Kg light:
cats$weight + 2

## [1] 4.1 7.0 5.2

paste("My cat is", cats$coat)

## [1] "My cat is calico" "My cat is black"  "My cat is tabby"

but what about

cats$weight + cats$coat

## Warning in Ops.factor(cats$weight, cats$coat): '+' not meaningful for
## factors

## [1] NA NA NA

understanding what happened here is important

Data Types

If you guessed that 2.1 plus “black” is nonsense, you are right
important concept in programming language is data types
we can inquire what type of data something is:

typeof(cats$weight)

## [1] "double"

5 main types:

double
integer
complex
logical
character

typeof(3.14)

## [1] "double"

typeof(1L)

## [1] "integer"

typeof(1+1i)

## [1] "complex"

typeof(TRUE)

## [1] "logical"

typeof('banna')

## [1] "character"

note the L suffix for indicating an number is an iteger

no matter how complicated our analysis is all data in R is interepeted as of one of these basic types
a user added details of another cat

file.show("data/feline-data_v2.csv")

load the new file & check what type of data we find in weight

cats <- read.csv(file="data/feline-data_v2.csv")

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'data/feline-
## data_v2.csv'

typeof(cats$weight)

## [1] "integer"

oh no, our weights aren’t the double type anymore!
if we try to do the dsme math we did on them before we get unexpected results

cats$weight + 2

## Warning in Ops.factor(cats$weight, 2): '+' not meaningful for factors

## [1] NA NA NA NA

what happened?

R reads a csv into a table, it insists that everything in the tabl is the same basic type
if it can’t understand in the column as a double then nobody in the column gets to be a double
the table that R loaded our cats data into is called a data.frame and our first example of something called a data structure
a structure that R knows how to build out of basic data types.
we can see that it is a data.frame by calling the class function on it:

class(cats)

## [1] "data.frame"

in order to use our data in R, we need to understand what the basic data sructures are & how they behave
for now let’s remove that extra line from out cats data and reload the data
open the file and delete the bottom line

before:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1

after:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

cats <- read.csv(file="data/feline-data.csv")

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on 'data/feline-
## data.csv'

Vectors and Type Coercion

my_vector <- vector(length = 3)
my_vector

## [1] FALSE FALSE FALSE

a vector is an ordered list of things, with a special condition that everything in the vector must be the same bacis data type
fi you dont choose a datatype, it’ll default to logcial
you can declare an empty vector of whatever type you like

another_vector <- vector(mode='character', length = 3)
another_vector

## [1] "" "" ""

check the vector with str

str(another_vector)

##  chr [1:3] "" "" ""

somewhat cryptic but indicates the type is vector and a chr, character
also indicates the indexes of the vector [1:3]
and a few examples of what’s in the vector - empty strings here

str(cats$weight)

##  num [1:3] 2.1 5 3.2

a vector, too– the columns of data we load into R data.frames are all vectors
and that’s the root of why R forces everything in a coloumn to be the same type

Why is R so opinionated about what we put in our columns of data? How does this help us?

By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency, like consistently using the same separator in our data files, is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.

you can make vectors with explicit contents with a c function

concat_vector <- c(2,6,3)
concat_vector

## [1] 2 6 3

given what we’ve learned so far, what would this produce:

quiz_vector <- c(2,6,'3')

str(quiz_vector)

##  chr [1:3] "2" "6" "3"

what happened above was type conversion and is the source of surprises and the reason we need to be aware of basic data types and how R deals with them
When R encounters a mix of types (here numeric and character) to be combined in to a vector, it will force them to be the same type

coercion <- c('a', TRUE)
str(coercion)

##  chr [1:2] "a" "TRUE"

another_coercion_vector <- c(0, TRUE)
another_coercion_vector

## [1] 0 1

the coercion rules go: logical -> integer -> numeric -> complex ->
Where -> can be read as are transformed into
You can try to force coercion against this flow using the as. functions

character_vector_example <- c('0','2','4')
character_vector_example

## [1] "0" "2" "4"

character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric

## [1] 0 2 4

numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical

## [1] FALSE  TRUE  TRUE

as you can see surprising things can happen when R forces one basic data type into another
Nitty gritty of type coercion aside the point is:
if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame
make sure everything is the same type in your vectors and your columns of data.frames or you will get bad surprises
but coercion can be useful!
ex. in our cat’s data likes_string is number, but we know that the 1s and 0s actually represent TRUE and FALSE
we should use logical datatype here - TRUE or FALSE
we can coerce this column by using as.logical

cats$likes_string

## [1] 1 0 1

cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string

## [1]  TRUE FALSE  TRUE

concatenate will also append things to an existing vector:

ab_vector <- c('a', 'b')
ab_vector

## [1] "a" "b"

concat_example <- c(ab_vector, 'SWC')
concat_example

## [1] "a"   "b"   "SWC"

you can also make a series of numbers

mySeries <- 1:10
mySeries

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(10)

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(1,10, by=0.1)

##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
## [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
## [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
## [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
## [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0

we can ask a few questions of vectors

sequence_example <- seq(10)
head(sequence_example, n=2)

## [1] 1 2

tail(sequence_example, n=4)

## [1]  7  8  9 10

length(sequence_example)

## [1] 10

class(sequence_example)

## [1] "integer"

typeof(sequence_example)

## [1] "integer"

finally, you can give names to elements in your vector

names_example <- 5:8
names(names_example) <- c("a", "b", "c", "d")
names_example

## a b c d 
## 5 6 7 8

names(names_example)

## [1] "a" "b" "c" "d"

Challenge 1

http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1#challenge-1

Data Frames

we said that columns in data.frames were vectors

str(cats$weight)

##  num [1:3] 2.1 5 3.2

str(cats$likes_string)

##  logi [1:3] TRUE FALSE TRUE

str(cats$coat)

##  Factor w/ 3 levels "black","calico",..: 2 1 3

Factors

another important data structure is a factor
factors appear to be character data, but are used to represent categorical information
EX: a vector of strings labelling cat colorations for all cats in our study:

coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats

## [1] "tabby"         "tortoiseshell" "tortoiseshell" "black"        
## [5] "tabby"

str(coats)

##  chr [1:5] "tabby" "tortoiseshell" "tortoiseshell" "black" ...

we can turn a vector into a factor like so

CATegories <- factor(coats)
class(CATegories)

## [1] "factor"

str(CATegories)

##  Factor w/ 3 levels "black","tabby",..: 2 3 3 1 2

now R has noticed that there are three possible categories in our data
but prints out a bunch of numbers
R has replaced our human-readable categories with numbered indices

typeof(coats)

## [1] "character"

typeof(CATegories)

## [1] "integer"

Challenge 2

http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1#challenge-2

in modelling functions, it’s important to know what the baseline levels are
this is assumed to be the first factor, but by default factors are labelled in aphabetical order
you can change this by specifying levels:

mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)

##  Factor w/ 2 levels "control","case": 2 1 1 2

in this case we explicitly told R that ‘control’ should be represented by 1 and case by 2
this designation can be very important for interpreting the results of stats model

Lists

another data structure you’ll want to learn is a list
simpler in some ways than other types b/c you can put anything in it

list_example <- list(1, "a", TRUE, 1+4i)
list_example

## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i

another_list <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
another_list

## $title
## [1] "Research Bazaar"
## 
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
## [1] TRUE

we can now understand something surprising in our data.frame what happens if we:

typeof(cats)

## [1] "list"

we see that data.frames look like lists under the hood
data.frame is really a list of vectors and factors
in order to hold those columns that are a mix of vectors and factors, the data.frame needs something a bit more flexible than a vector to put all those columns together into a table
data.frame is a special ist in which all vectors must have same length
in our cats exampel we have an integer, a double and logical variable

cats$coat

## [1] calico black  tabby 
## Levels: black calico tabby

cats[,1]

## [1] calico black  tabby 
## Levels: black calico tabby

typeof(cats[,1])

## [1] "integer"

str(cats[,1])

##  Factor w/ 3 levels "black","calico",..: 2 1 3

*each row is an observation of different variables, itself a data.frame and thus can be composed of element of diff. types

cats[1,]

##     coat weight likes_string
## 1 calico    2.1         TRUE

typeof(cats[1,])

## [1] "list"

str(cats[1,])

## 'data.frame':    1 obs. of  3 variables:
##  $ coat        : Factor w/ 3 levels "black","calico",..: 2
##  $ weight      : num 2.1
##  $ likes_string: logi TRUE

Challenge 3

http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1#challenge-3

Matrices

Last but not least - matrix
we can declare a matrix of zeros:

matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0

And similar to other data structures, we can ask things about our matrix:

class(matrix_example)

## [1] "matrix"

typeof(matrix_example)

## [1] "double"

str(matrix_example)

##  num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...

dim(matrix_example)

## [1] 3 6

nrow(matrix_example)

## [1] 3

Challege 4

http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1#challenge-4

Challenge 5

http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1#challenge-5