Data Structures Lesson Notes

Teaching: 40 min

Exercises: 15 min

Questions

  • How can I read data in R?
  • What are the basic data types in R?
  • How do I represent categorical information in R?

Objectives

  • To be aware of the different types of data.
  • To begin exploring data frames, and understand how it’s related to * vectors, factors and lists.
  • To be able to ask questions from R about the type, class, and structure of an object.

Begin Lesson here

  • One of R’s most powerful features is its ability to deal with tabular data
  • start by making a toy dataset in your data/ directory, called feline-data.csv:

create data/feline-data.csv using:

  • a text editor (Nano) or
  • within RStudio with the File -> New File -> Text File menu item.

add the following data to the file:

In [ ]:
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

make sure the data file is saved in the home/data folder

In [1]:
cats  <- read.csv(file = "~/Desktop/gps_r/feline-data.csv")
In [2]:
cats
coatweightlikes_string
1calico2.1 1
2black5 0
3tabby3.2 1
  • he read.csv function is used for reading in tabular data stored in a text file where the columns of data are delimited by commas (csv = comma separated values).

  • Tabs are also commonly used to separated columns - if your data are in this format you can use the function read.delim.

  • There is also the general read.table function that is used if the columns in your data are delimited by a character other that commas or tabs.

With data loaded, we can now explore our data set, pull out columns and specify the musing the $ operator:

In [3]:
cats$weight
  1. 2.1
  2. 5
  3. 3.2
In [5]:
cats$coat
  1. calico
  2. black
  3. tabby

We can do other operations on the columns:

In [6]:
## say we discovered that the scale weights two Kg light:
cats$weight + 2
  1. 4.1
  2. 7
  3. 5.2
In [7]:
paste("my cat is", cats$coat)
  1. 'my cat is calico'
  2. 'my cat is black'
  3. 'my cat is tabby'

But what about if we type this:

In [8]:
cats$weight + cats$coat
Warning message in Ops.factor(cats$weight, cats$coat):
“‘+’ not meaningful for factors”
  1. NA
  2. NA
  3. NA
  • We get a warning message

understanding what happned here is key to successfully analyzing data in R

Data Types

The last command we ran returned an error because

  • 2.1 + "black" is nonsense
  • If you guess this is the reason, you are right
  • you have some intuition for an important concept in programming called
          data types
  • for exxample we can ask what type of data something is:
In [9]:
typeof(cats$weight)
'double'

There are 5 main data types:

  • double
  • integer
  • complex
  • logical
  • character

Here are a few more examples of checking the data type:

In [10]:
typeof(1L)
'integer'
In [12]:
typeof(1+1i)
'complex'
In [13]:
typeof(TRUE)
'logical'
In [14]:
typeof('banana')
'character'
  • Note the L suffix to insist that a number is an integer.
  • No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.

For the next example:

  • a user has added details fo another cat.
  • this information is in the file data/feline-data_v2.csv

create feline-data_v2.csv text file in RStudio

  • in file browser view feline-data.csv text file
  • in menu select text file
  • copy and paste data from feline-data.csv to new file
  • add the new data: tabby,2.3 or 2.4,1
  • save new file as feline-data_v2.csv

Load the new cats data like before, and check what type of data we find in the weight column:

In [15]:
cats  <- read.csv(file="~/Desktop/gps_r/feline-data_v2.csv")
typeof(cats$weight)
'integer'
  • Oh no, our weights aren’t the double type anymore!

  • If we try to do the same math we did on them before, we run into trouble:

In [16]:
cats$weight + 2
Warning message in Ops.factor(cats$weight, 2):
“‘+’ not meaningful for factors”
  1. NA
  2. NA
  3. NA
  4. NA

what happened?

  • R reads a csv into a table, it insists that everything in a column be the same basic type

  • if it can’t understand everything in the column as a double, then nobody in the column gets to be a double

  • the table that R loaded our cats data into is called a data.frame and our first example of something called a data structure

  • a data structure is a structure that R knows how to build out of basic data types.

  • we can see that it is a data.frame by calling the class function on it:

In [17]:
class(cats)
'data.frame'

Now, in order to successfully use our data in R, we need to understand what the basic data structures are, and how they behave.

For now, let’s remove that extra line from our cats data and reload it, while we investigate this behavior further:

in RStudio reload the feline-data.csv

In [18]:
cats  <- read.csv(file="~/Desktop/gps_r/feline-data.csv")

Vectors and Type Coercion

To better understand this behavior, let’s meet another of the data structures: the vector.

In [19]:
my_vector <- vector(length = 3)
my_vector
  1. FALSE
  2. FALSE
  3. FALSE

A vector in R is essentially an ordered list of things, with the special condition:

  • that everything in the vector must be the same basic data type.

  • If you don’t choose the datatype, it’ll default to logical; or, you can declare an empty vector of whatever type you like.

In [20]:
another_vector <- vector(mode='character', length=3)
another_vector
  1. ''
  2. ''
  3. ''

You can check if something is a vector:

In [22]:
str(another_vector)
 chr [1:3] "" "" ""

The cryptic output from this command indicates the basic data type found in this vector -

  • in this case chr, character;
  • an indication of the number of things in the vector - actually, the indexes of the vector, in this case [1:3];
  • and a few examples of what’s actually in the vector - in this case empty character strings.

If we similarly do:

In [23]:
str(cats$weight)
 num [1:3] 2.1 5 3.2

we see that that’s a vector, too - the columns of data we load into R data.frames are all vectors,

  • and that’s the root of why R forces everything in a column to be the same basic data type.

Discussion 1

Why is R so opinionated about what we put in our columns of data? How does this help us?

By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data;

  • if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time.

  • This consistency, like consistently using the same separator in our data files, is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R

You can also make vectors with explicit contents with the combine function:

In [24]:
combine_vector <- c(2,6,3)
combine_vector
  1. 2
  2. 6
  3. 3

Thinking about what we have covered so far, what do you thing the following will produce?

In [25]:
quiz_vector <- c(2,6,'3')
In [26]:
str(quiz_vector)
 chr [1:3] "2" "6" "3"
  • This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them.

  • When R encounters a mix of types (here numeric and character) to be combined into a single vector, it will force them all to be the same type.

Now Consider these examples:

In [28]:
coercion_vector <- c('a', TRUE)
coercion_vector
  1. 'a'
  2. 'TRUE'
In [29]:
another_coercion_vector <- c(0, TRUE)
another_coercion_vector
  1. 0
  2. 1
  • The coercion rules go: ** logical -> integer -> numeric -> complex -> character,

  • where -> can be read as and transformed into.

  • You can try to force coercion against this flow using the as. functions:
In [30]:
character_vector_example <- c('0','2','4')
character_vector_example
  1. '0'
  2. '2'
  3. '4'
In [31]:
character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric
  1. 0
  2. 2
  3. 4
In [32]:
numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical
  1. FALSE
  2. TRUE
  3. TRUE
  • As you can see, some surprising things can happen when R forces one basic data type into another!

  • Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame;

  • make sure everything is the same type in your vectors and your columns of data.frames, or you will get nasty surprises!

  • But coercion can also be very useful!

  • For example, in our cats data likes_string is numeric, but we know that the 1s and 0s actually represent TRUE and FALSE (a common way of representing them).

  • We should use the logical datatype here, which has two states: TRUE or FALSE, which is exactly what our data represents.

  • We can ‘coerce’ this column to be logical by using the as.logical function:

In [33]:
cats$likes_string
  1. 1
  2. 0
  3. 1
In [34]:
cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
  1. TRUE
  2. FALSE
  3. TRUE

Combine c() or concatenate will also append things to an existing vector:

In [35]:
ab_vector <- c('a', 'b')
ab_vector
  1. 'a'
  2. 'b'
In [36]:
combine_example <- c(ab_vector, 'SWC')
combine_example
  1. 'a'
  2. 'b'
  3. 'SWC'

You can also make series of numbers:

In [37]:
mySeries <- 1:10
mySeries
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
In [38]:
seq(10)
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
In [40]:
seq(1,10, by=0.1)
  1. 1
  2. 1.1
  3. 1.2
  4. 1.3
  5. 1.4
  6. 1.5
  7. 1.6
  8. 1.7
  9. 1.8
  10. 1.9
  11. 2
  12. 2.1
  13. 2.2
  14. 2.3
  15. 2.4
  16. 2.5
  17. 2.6
  18. 2.7
  19. 2.8
  20. 2.9
  21. 3
  22. 3.1
  23. 3.2
  24. 3.3
  25. 3.4
  26. 3.5
  27. 3.6
  28. 3.7
  29. 3.8
  30. 3.9
  31. 4
  32. 4.1
  33. 4.2
  34. 4.3
  35. 4.4
  36. 4.5
  37. 4.6
  38. 4.7
  39. 4.8
  40. 4.9
  41. 5
  42. 5.1
  43. 5.2
  44. 5.3
  45. 5.4
  46. 5.5
  47. 5.6
  48. 5.7
  49. 5.8
  50. 5.9
  51. 6
  52. 6.1
  53. 6.2
  54. 6.3
  55. 6.4
  56. 6.5
  57. 6.6
  58. 6.7
  59. 6.8
  60. 6.9
  61. 7
  62. 7.1
  63. 7.2
  64. 7.3
  65. 7.4
  66. 7.5
  67. 7.6
  68. 7.7
  69. 7.8
  70. 7.9
  71. 8
  72. 8.1
  73. 8.2
  74. 8.3
  75. 8.4
  76. 8.5
  77. 8.6
  78. 8.7
  79. 8.8
  80. 8.9
  81. 9
  82. 9.1
  83. 9.2
  84. 9.3
  85. 9.4
  86. 9.5
  87. 9.6
  88. 9.7
  89. 9.8
  90. 9.9
  91. 10

We can ask a few questions about vectors:

In [41]:
sequence_example <- seq(10)
head(sequence_example, n=2)
  1. 1
  2. 2
In [42]:
tail(sequence_example, n=4)
  1. 7
  2. 8
  3. 9
  4. 10
In [43]:
length(sequence_example)
10
In [44]:
class(sequence_example)
'integer'
In [45]:
typeof(sequence_example)
'integer'

Finally, you can give names to elements in your vector:

In [46]:
names_example <- 5:8
names(names_example) <- c("a", "b", "c", "d")
names_example
a
5
b
6
c
7
d
8
In [47]:
names(names_example)
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'

Challenge or Discussion

Start by making a vector with the numbers 1 through 26. Multiply the vector by 2, and give the resulting vector names A through Z

(hint: there is a built in vector called LETTERS)

In [48]:
x <- 1:26
x <- x * 2
names(x) <- LETTERS

Data Frames

Now we are going to briefly cover data frames. Previously, We said that columns in data.frames were vectors:

In [50]:
str(cats$weight)
 num [1:3] 2.1 5 3.2
In [51]:
str(cats$likes_string)
 logi [1:3] TRUE FALSE TRUE

These make sense. But what about:

In [52]:
str(cats$coat)
 Factor w/ 3 levels "black","calico",..: 2 1 3

Factors

  • Another important data structure is called a factor.

  • Factors usually look like character data, but are typically used to represent categorical information.

  • For example, let’s make a vector of strings labelling cat colorations for all the cats in our study:

In [53]:
coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats
  1. 'tabby'
  2. 'tortoiseshell'
  3. 'tortoiseshell'
  4. 'black'
  5. 'tabby'
In [54]:
str(coats)
 chr [1:5] "tabby" "tortoiseshell" "tortoiseshell" "black" ...

We can turn a vector into a factor like so:

In [55]:
CATegories <- factor(coats)
class(CATegories)
'factor'
In [56]:
str(CATegories)
 Factor w/ 3 levels "black","tabby",..: 2 3 3 1 2
  • Now R has noticed that there are three possible categories in our data - but it also did something surprising;

  • instead of printing out the strings we gave it, we got a bunch of numbers instead.

  • R has replaced our human-readable categories with numbered indices under the hood:

In [57]:
typeof(coats)
'character'
In [58]:
typeof(CATegories)
'integer'

Challenge 2

Is there a factor in our cats data.frame? what is its name? Try using ?read.csv to figure out how to keep text columns as character vectors instead of factors; then write a command or two to show that the factor in cats is actually a character vector when loaded in this way.

One solution is use the argument stringAsFactors:

In [61]:
cats <- read.csv(file="~/Desktop/gps_r/feline-data.csv", stringsAsFactors=FALSE)
str(cats$coat)
 chr [1:3] "calico" "black" "tabby"

Another solution is use the argument colClasses that allow finer control.

In [81]:
cats <- read.csv(file="~/Desktop/gps_r/feline-data.csv", colClasses=c(NA, NA, "character"))
str(cats$coat)
 Factor w/ 3 levels "black","calico",..: 2 1 3

Note: new students find the help files difficult to understand; make sure to let them know that this is typical, and encourage them to take their best guess based on semantic meaning, even if they aren’t sure.

  • In modelling functions, it’s important to know what the baseline levels are.

  • This is assumed to be the first factor, but by default factors are labelled in alphabetical order.

  • You can change this by specifying the levels:

In [63]:
mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)
 Factor w/ 2 levels "control","case": 2 1 1 2
  • In this case, we’ve explicitly told R that “control” should represented by 1, and “case” by 2.

  • This designation can be very important for interpreting the results of statistical models!

Lists

  • Another data structure you’ll want in your bag of tricks is the list.

  • A list is simpler in some ways than the other types, because you can put anything you want in it:

In [64]:
list_example <- list(1, "a", TRUE, 1+4i)
list_example
  1. 1
  2. 'a'
  3. TRUE
  4. 1+4i
In [65]:
another_list <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
another_list
$title
'Research Bazaar'
$numbers
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
$data
TRUE
  • We can now understand something a bit surprising in our data.frame

  • what happens if we run:

In [66]:
typeof(cats)
'list'
  • We see that data.frames look like lists ‘under the hood’

    • this is because a data.frame is really a list of vectors and factors, as they have to be
    • in order to hold those columns that are a mix of vectors and factors, the data.frame needs something a bit more flexible than a vector to put all the columns together into a familiar table.
  • In other words, a data.frame is a special list in which all the vectors must have the same length.

  • In our cats example, we have an integer, a double and a logical variable.

  • As we have seen already, each column of data.frame is a vector.

In [67]:
cats$coat
  1. calico
  2. black
  3. tabby
In [68]:
cats[,1]
  1. calico
  2. black
  3. tabby
In [79]:
typeof(cats[,1])
'integer'
In [78]:
str(cats[,1])
 Factor w/ 3 levels "black","calico",..: 2 1 3

Each row is an observation of different variables, itself a data.frame, and thus can be composed of element of different types.

  • run this code if likes_string is wrong
In [84]:
cats <- read.csv(file="~/Desktop/gps_r/feline-data.csv")
In [87]:
cats$likes_string <- as.logical(cats$likes_string)
In [91]:
cats$likes_string
  1. TRUE
  2. FALSE
  3. TRUE

continue lesson

In [92]:
cats[1,]
coatweightlikes_string
1calico2.1 1
In [93]:
str(cats[1,])
'data.frame':	1 obs. of  3 variables:
 $ coat        : Factor w/ 3 levels "black","calico",..: 2
 $ weight      : num 2.1
 $ likes_string: logi TRUE

Challenge 3

There are several subtly different ways to call variables, observations and elements from data.frames:

  • cats[1]
  • cats[[1]]
  • cats$coat
  • cats["coat"]
  • cats[1, 1]
  • cats[, 1]
  • cats[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function typeof() to examine what is returned in each case.

Matricies

  • Last but not least is the matrix.

  • We can declare a matrix full of zeros:

In [94]:
matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example
000000
000000
000000

And similar to other data structures, we can ask things about our matrix:

In [95]:
class(matrix_example)
'matrix'
In [96]:
typeof(matrix_example)
'double'
In [97]:
str(matrix_example)
 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
In [98]:
dim(matrix_example)
  1. 3
  2. 6
In [99]:
nrow(matrix_example)
3
In [100]:
ncol(matrix_example)
6

Challenge 6

Challenge 7

In [ ]:

Next Lesson is Data Frames