55 min (40min teaching, 15min exercises) ** should break 1/2 thru shoot to finish at 3:15-25**
How to represent categorical information in R?
You can create data/feline-data.csv
using a text editor (Nano), or within RStudio with the File -> New File -> Text File
menu item.
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
cats <- read.csv(file = "data/feline-data.csv")
cats
read.csv
used for reading in tablular data stored in a text file, comma seperated valuestabs
and Commas
are the most common punctuation characters used to seperate data points in csv filesread.delim
If not tabs or commas, you can use the more general read.table
$
operator$
operator pulls out columns by specifying them
cats$weight
cats$coat
## say we discovered that the scale weighs two Kg light:
cats$weight + 2
paste("My cat is", cats$coat)
but what about
cats$weight + cats$coat
2.1 plus "black"
is nonsense, you are rightdata types
typeof(cats$weight)
There are 5 main data types:
double
- double precision float number (used for math)integer
- used for IDs or Valuescomplex
-logical
character
typeof(3.14)
## [1] "double"
typeof(1L) # the L suffix forces the number to be an integer, since by default R uses float numbers, R stores as integer
## [1] "integer"
note the L suffix for indicating an number is an iteger
typeof(1+1i) #complex numbers with real and imaginary parts
## [1] "complex"
typeof(TRUE) # TRUE or FALSE
## [1] "logical"
typeof('banana')
## [1] "character"
no matter how complicated our analysis is all data in R is interepeted as of one of these basic types
as an example, a user added details of another cat edit data/feline-data and add new tabby data line tabby, 2.3 or 2.4,1
re-save as data/feline-data.csv
file.show("data/feline-data.csv")
weight
columncats <- read.csv(file="data/feline-data.csv")
typeof(cats$weight)
cats$weight + 2
what happened?
double
then nobody in the column gets to be a doublea structure that R knows how to build out of basic data types.
we can see that it is a data.frame
by calling the class
function on it:
class(cats)
for now let’s remove that extra line from out cats data and reload the data
open the file and delete the bottom line
before:
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1
after:
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
cats <- read.csv(file="data/feline-data.csv")
To better understand data structure behavior, let’s look at another of the data structures: the vector.
my_vector <- vector(length = 3)
my_vector
## [1] FALSE FALSE FALSE
vector
is an ordered list of things
, with a special condition thateverything in the vector must be the same bacis data type
logcial
declare an empty vector of whatever type you like
another_vector <- vector(mode='character', length = 3)
another_vector
## [1] "" "" ""
str(another_vector)
## chr [1:3] "" "" ""
str
command indicates the basic data type found in this vector which is achr, character
for example using the cats dataset:
str(cats$weight)
data.frames
are all vectorsBy keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency, like consistently using the same separator in our data files, is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.
combine or c
functionconcat_vector <- c(2,6,3)
concat_vector
## [1] 2 6 3
quiz_vector <- c(2,6,'3')
str(quiz_vector)
## chr [1:3] "2" "6" "3"
coercion <- c('a', TRUE)
str(coercion)
## chr [1:2] "a" "TRUE"
another_coercion_vector <- c(0, TRUE)
another_coercion_vector
## [1] 0 1
->
can be read as are transformed intoas.
functionscharacter_vector_example <- c('0','2','4')
character_vector_example
## [1] "0" "2" "4"
character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric
## [1] 0 2 4
numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical
## [1] FALSE TRUE TRUE
type coercion may well be to blame
make sure everything is the same type in your vectors and your columns of data.frames
or you will get bad surprises
TRUE
and FALSE
we can coerce this column by using as.logical
function
cats$likes_string <- as.numeric(cats$likes_string)
str(cats$likes_string)
cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
str(cats$likes_string)
c()
function, will also append things to an existing vector:ab_vector <- c('a', 'b')
ab_vector
## [1] "a" "b"
concat_example <- c(ab_vector, 'SWC')
concat_example
## [1] "a" "b" "SWC"
mySeries <- 1:10
mySeries
## [1] 1 2 3 4 5 6 7 8 9 10
seq(10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10, by=0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [29] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1
## [43] 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
## [57] 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
## [71] 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3
## [85] 9.4 9.5 9.6 9.7 9.8 9.9 10.0
sequence_example <- seq(10)
head(sequence_example, n=2)
## [1] 1 2
tail(sequence_example, n=4)
## [1] 7 8 9 10
length(sequence_example)
## [1] 10
class(sequence_example)
## [1] "integer"
typeof(sequence_example)
## [1] "integer"
names_example <- 5:8
names(names_example) <- c("a", "b", "c", "d")
names_example
## a b c d
## 5 6 7 8
names(names_example)
## [1] "a" "b" "c" "d"
Challenge 1
data.frames
were vectorsstr(cats$weight)
str(cats$likes_string)
str(cats$coat)
coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats
str(coats)
CATegories <- factor(coats)
class(CATegories)
str(CATegories)
numbered indices
typeof(coats)
typeof(CATegories)
Challenge 2
catsstring <- read.csv(file="data/feline-data.csv", stringsAsFactors=FALSE)
str(catsstring$coat)
catsstring <- read.csv(file="data/feline-data.csv", colClasses=c(NA, NA, "character"))
str(catsstring$coat)
factors are labelled in aphabetical
mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)
## Factor w/ 2 levels "control","case": 2 1 1 2
list
list_example <- list(1, "a", TRUE, 1+4i)
list_example
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
another_list <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
another_list
## $title
## [1] "Research Bazaar"
##
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## [1] TRUE
data.frame
what happens if we:typeof(cats)
data.frames
look like lists
under the hooddata.frame
is a special ist in which all vectors must have same length
in our cats exampel we have an integer, a double and logical variable
cats$coat
cats[,1]
typeof(cats[,1])
str(cats[,1])
observation
of different variables, itself a data.frame
and thus can be composed of element of different typescats[1,]
typeof(cats[1,])
str(cats[1,])
Challenge 3
matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
class(matrix_example)
## [1] "matrix"
typeof(matrix_example)
## [1] "double"
str(matrix_example)
## num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
dim(matrix_example)
## [1] 3 6
nrow(matrix_example)
## [1] 3
Challege 4
matrix_example <- matrix(0, ncol=6, nrow=3)
length(matrix_example)
## [1] 18
Challenge 5
x <- matrix(1:50, ncol=5, nrow=10)
x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row