55 min (40min teaching, 15min exercises) ** should break 1/2 thru shoot to finish at 3:15-25**

Data Structures

Objectives
  • How to read data into R?
  • What are the basic data types?
  • How to represent categorical information in R?

  • R deals with tabular data very well.
  • Let’s create a small data set for cats saved as a csv file.
  • You can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File -> New File -> Text File menu item.

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
Load csv in R
  • Read the file in:
cats <- read.csv(file = "data/feline-data.csv")
cats
  • read.csv used for reading in tablular data stored in a text file, comma seperated values
  • tabs and Commas are the most common punctuation characters used to seperate data points in csv files
  • Tabs also common, use read.delim
  • If not tabs or commas, you can use the more general read.table

  • we can begin exploring the dataset using the $ operator
  • $ operator pulls out columns by specifying them

cats$weight
cats$coat
  • We can do operations on columns:
## say we discovered that the scale weighs two Kg light:
cats$weight + 2
paste("My cat is", cats$coat)

but what about

cats$weight + cats$coat
  • understanding what happened here is important and key to sucessfully analyzing data in R

Data Types

typeof(cats$weight)

There are 5 main data types:

typeof(3.14)
## [1] "double"
typeof(1L) # the L suffix forces the number to be an integer, since by default R uses float numbers, R stores as integer
## [1] "integer"

note the L suffix for indicating an number is an iteger

typeof(1+1i) #complex numbers with real and imaginary parts
## [1] "complex"
typeof(TRUE) # TRUE or FALSE
## [1] "logical"
typeof('banana')
## [1] "character"
file.show("data/feline-data.csv")
cats <- read.csv(file="data/feline-data.csv")
typeof(cats$weight)
cats$weight + 2

what happened?

class(cats)

before:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1

after:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
cats <- read.csv(file="data/feline-data.csv")

Vectors and Type Coercion

To better understand data structure behavior, let’s look at another of the data structures: the vector.

my_vector <- vector(length = 3)
my_vector
## [1] FALSE FALSE FALSE

everything in the vector must be the same bacis data type

another_vector <- vector(mode='character', length = 3)
another_vector
## [1] "" "" ""
str(another_vector)
##  chr [1:3] "" "" ""

for example using the cats dataset:

str(cats$weight)
Why is R so opinionated about what we put in our columns of data? How does this help us?

By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency, like consistently using the same separator in our data files, is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.

  • you can make vectors with explicit contents with thecombine or c function
concat_vector <- c(2,6,3)
concat_vector
## [1] 2 6 3
  • given what we’ve learned so far, what would this produce?:
quiz_vector <- c(2,6,'3')
str(quiz_vector)
##  chr [1:3] "2" "6" "3"
  • what happened above was type conversion/coercion and is the source of surprises and the reason we need to be aware of basic data types and how R deals with them and how R interprets data types
  • When R encounters a mix of types (here in cats data, numeric and character) to be combined in to a vector, it will force them to be the same type
coercion <- c('a', TRUE)
str(coercion)
##  chr [1:2] "a" "TRUE"
another_coercion_vector <- c(0, TRUE)
another_coercion_vector
## [1] 0 1
  • the coercion rules go: logical -> integer -> numeric -> complex ->character
  • Where -> can be read as are transformed into
  • You can try to force coercion against this flow using the as. functions
character_vector_example <- c('0','2','4')
character_vector_example
## [1] "0" "2" "4"
character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric
## [1] 0 2 4
numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical
## [1] FALSE  TRUE  TRUE
  • as you can see surprising things can happen when R forces one basic data type into another
  • Nitty gritty of type coercion aside the point is:
  • if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame
  • make sure everything is the same type in your vectors and your columns of data.frames or you will get bad surprises

  • but coercion can be useful!
  • ex. in our cat’s data likes_string is number, but we know that the 1s and 0s actually represent TRUE and FALSE
  • we should use logical datatype here - whcih has 2 states - TRUE or FALSE
  • we can coerce this column by using as.logical function

cats$likes_string <- as.numeric(cats$likes_string)
str(cats$likes_string)
cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
str(cats$likes_string)
  • the combine/concatenate c() function, will also append things to an existing vector:
ab_vector <- c('a', 'b')
ab_vector
## [1] "a" "b"
  • example, Add a character
concat_example <- c(ab_vector, 'SWC')
concat_example
## [1] "a"   "b"   "SWC"
  • you can also make a series of numbers
mySeries <- 1:10
mySeries
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10, by=0.1)
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
## [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
## [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
## [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
## [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0
  • we can ask a few questions of vectors
sequence_example <- seq(10)
head(sequence_example, n=2)
## [1] 1 2
tail(sequence_example, n=4)
## [1]  7  8  9 10
length(sequence_example)
## [1] 10
class(sequence_example)
## [1] "integer"
typeof(sequence_example)
## [1] "integer"
  • finally, you can give names to elements in your vector
names_example <- 5:8
names(names_example) <- c("a", "b", "c", "d")
names_example
## a b c d 
## 5 6 7 8
names(names_example)
## [1] "a" "b" "c" "d"

Challenge 1

Data Frames

str(cats$weight)
str(cats$likes_string)
str(cats$coat)

Factors

coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats
str(coats)
CATegories <- factor(coats)
class(CATegories)
str(CATegories)
typeof(coats)
typeof(CATegories)

Challenge 2

catsstring <- read.csv(file="data/feline-data.csv", stringsAsFactors=FALSE)

str(catsstring$coat)
catsstring <- read.csv(file="data/feline-data.csv", colClasses=c(NA, NA, "character"))
str(catsstring$coat)
mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)
##  Factor w/ 2 levels "control","case": 2 1 1 2

Lists

list_example <- list(1, "a", TRUE, 1+4i)
list_example
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i
another_list <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
another_list
## $title
## [1] "Research Bazaar"
## 
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
## [1] TRUE
typeof(cats)
cats$coat
cats[,1]
typeof(cats[,1])
str(cats[,1])
cats[1,]
typeof(cats[1,])
str(cats[1,])

Challenge 3

Matrices

matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0
class(matrix_example)
## [1] "matrix"
typeof(matrix_example)
## [1] "double"
str(matrix_example)
##  num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
dim(matrix_example)
## [1] 3 6
nrow(matrix_example)
## [1] 3

Challege 4

matrix_example <- matrix(0, ncol=6, nrow=3)
length(matrix_example)
## [1] 18

Challenge 5

x <- matrix(1:50, ncol=5, nrow=10)
x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row