55 min (40min teaching, 15min exercises) ** should break 1/2 thru shoot to finish at 3:15-25**
How to represent categorical information in R?
You can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File -> New File -> Text File menu item.
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
cats <- read.csv(file = "data/feline-data.csv")
cats
read.csv used for reading in tablular data stored in a text file, comma seperated valuestabs and Commas are the most common punctuation characters used to seperate data points in csv filesread.delimIf not tabs or commas, you can use the more general read.table
$ operator$ operator pulls out columns by specifying them
cats$weight
cats$coat
## say we discovered that the scale weighs two Kg light:
cats$weight + 2
paste("My cat is", cats$coat)
but what about
cats$weight + cats$coat
2.1 plus "black" is nonsense, you are rightdata typestypeof(cats$weight)
There are 5 main data types:
double - double precision float number (used for math)integer - used for IDs or Valuescomplex -logicalcharactertypeof(3.14)
## [1] "double"
typeof(1L) # the L suffix forces the number to be an integer, since by default R uses float numbers, R stores as integer
## [1] "integer"
note the L suffix for indicating an number is an iteger
typeof(1+1i) #complex numbers with real and imaginary parts
## [1] "complex"
typeof(TRUE) # TRUE or FALSE
## [1] "logical"
typeof('banana')
## [1] "character"
no matter how complicated our analysis is all data in R is interepeted as of one of these basic types
as an example, a user added details of another cat edit data/feline-data and add new tabby data line tabby, 2.3 or 2.4,1 re-save as data/feline-data.csv
file.show("data/feline-data.csv")
weight columncats <- read.csv(file="data/feline-data.csv")
typeof(cats$weight)
cats$weight + 2
what happened?
double then nobody in the column gets to be a doublea structure that R knows how to build out of basic data types.
we can see that it is a data.frame by calling the class function on it:
class(cats)
for now let’s remove that extra line from out cats data and reload the data
open the file and delete the bottom line
before:
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1
after:
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
cats <- read.csv(file="data/feline-data.csv")
To better understand data structure behavior, let’s look at another of the data structures: the vector.
my_vector <- vector(length = 3)
my_vector
## [1] FALSE FALSE FALSE
vector is an ordered list of things, with a special condition thateverything in the vector must be the same bacis data type
logcialdeclare an empty vector of whatever type you likeanother_vector <- vector(mode='character', length = 3)
another_vector
## [1] "" "" ""
str(another_vector)
## chr [1:3] "" "" ""
str command indicates the basic data type found in this vector which is achr, characterfor example using the cats dataset:
str(cats$weight)
data.frames are all vectorsBy keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency, like consistently using the same separator in our data files, is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.
combine or c functionconcat_vector <- c(2,6,3)
concat_vector
## [1] 2 6 3
quiz_vector <- c(2,6,'3')
str(quiz_vector)
## chr [1:3] "2" "6" "3"
coercion <- c('a', TRUE)
str(coercion)
## chr [1:2] "a" "TRUE"
another_coercion_vector <- c(0, TRUE)
another_coercion_vector
## [1] 0 1
-> can be read as are transformed intoas. functionscharacter_vector_example <- c('0','2','4')
character_vector_example
## [1] "0" "2" "4"
character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric
## [1] 0 2 4
numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical
## [1] FALSE TRUE TRUE
type coercion may well be to blamemake sure everything is the same type in your vectors and your columns of data.frames or you will get bad surprises
TRUE and FALSEwe can coerce this column by using as.logical function
cats$likes_string <- as.numeric(cats$likes_string)
str(cats$likes_string)
cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
str(cats$likes_string)
c() function, will also append things to an existing vector:ab_vector <- c('a', 'b')
ab_vector
## [1] "a" "b"
concat_example <- c(ab_vector, 'SWC')
concat_example
## [1] "a" "b" "SWC"
mySeries <- 1:10
mySeries
## [1] 1 2 3 4 5 6 7 8 9 10
seq(10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10, by=0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [29] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1
## [43] 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
## [57] 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
## [71] 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3
## [85] 9.4 9.5 9.6 9.7 9.8 9.9 10.0
sequence_example <- seq(10)
head(sequence_example, n=2)
## [1] 1 2
tail(sequence_example, n=4)
## [1] 7 8 9 10
length(sequence_example)
## [1] 10
class(sequence_example)
## [1] "integer"
typeof(sequence_example)
## [1] "integer"
names_example <- 5:8
names(names_example) <- c("a", "b", "c", "d")
names_example
## a b c d
## 5 6 7 8
names(names_example)
## [1] "a" "b" "c" "d"
Challenge 1
data.frames were vectorsstr(cats$weight)
str(cats$likes_string)
str(cats$coat)
coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats
str(coats)
CATegories <- factor(coats)
class(CATegories)
str(CATegories)
numbered indicestypeof(coats)
typeof(CATegories)
Challenge 2
catsstring <- read.csv(file="data/feline-data.csv", stringsAsFactors=FALSE)
str(catsstring$coat)
catsstring <- read.csv(file="data/feline-data.csv", colClasses=c(NA, NA, "character"))
str(catsstring$coat)
factors are labelled in aphabeticalmydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)
## Factor w/ 2 levels "control","case": 2 1 1 2
listlist_example <- list(1, "a", TRUE, 1+4i)
list_example
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
another_list <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
another_list
## $title
## [1] "Research Bazaar"
##
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## [1] TRUE
data.frame what happens if we:typeof(cats)
data.frames look like lists under the hooddata.frame is a special ist in which all vectors must have same length
in our cats exampel we have an integer, a double and logical variable
cats$coat
cats[,1]
typeof(cats[,1])
str(cats[,1])
observation of different variables, itself a data.frame and thus can be composed of element of different typescats[1,]
typeof(cats[1,])
str(cats[1,])
Challenge 3
matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
class(matrix_example)
## [1] "matrix"
typeof(matrix_example)
## [1] "double"
str(matrix_example)
## num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
dim(matrix_example)
## [1] 3 6
nrow(matrix_example)
## [1] 3
Challege 4
matrix_example <- matrix(0, ncol=6, nrow=3)
length(matrix_example)
## [1] 18
Challenge 5
x <- matrix(1:50, ncol=5, nrow=10)
x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row