Lesson objective: * Be able to add and remove rows and columns. * Be able to remove rows with NA values. * Be able to append two data frames * Be able to articulate what a factor is and how to convert between factor and character. * Be able to find basic properties of a data frames including size, class or type of the columns, names, and first few rows.

How can I manipulate a dataframe?

Now we will learn a few more things about working with data frames

adding columns and rows in a data.frame

reload cats csv if needed to reset cats variable

cats <- read.csv(file="feline-data.csv")
age <- c(2,3,5,12)
cats
##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1

cbind(cats, age)

cats <- cbind(cats, age) * produces error Error in data.frame(…, check.names = FALSE): arguments imply differing number of rows: 3, 2

nrow(cats)
## [1] 3
length(age)
## [1] 4
cats
##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1
age <- c(4,5,8)
cats <- cbind(cats, age)
cats
##     coat weight likes_string age
## 1 calico    2.1            1   4
## 2  black    5.0            0   5
## 3  tabby    3.2            1   8

Now how about adding rows - in this case, we saw last time that the rows of a data.frame are made of lists:

newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
## Warning in `[<-.factor`(`*tmp*`, ri, value = "tortoiseshell"): invalid
## factor level, NA generated

Factors

  • another thing to look out for - when R creates a factor, it only allows whatever is originally there when our data was loaded which was (‘black’, ‘calico’, ‘tabby’)
  • anything that doesn’t fit into one of its categories is rejected and becomes NA
  • to add the new row, we need to explicitly add is as a level in the factor
levels(cats$coat)
## [1] "black"  "calico" "tabby"
levels(cats$coat) <- c(levels(cats$coat), 'tortoiseshell')
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
cats
##            coat weight likes_string age
## 1        calico    2.1            1   4
## 2         black    5.0            0   5
## 3         tabby    3.2            1   8
## 4          <NA>    3.3            1   9
## 5 tortoiseshell    3.3            1   9
  • Alternatively we can change a factor column to a character vector
  • but we lose handy categories of the factor but can subseqently add any word we want to the column
str(cats)
## 'data.frame':    5 obs. of  4 variables:
##  $ coat        : Factor w/ 4 levels "black","calico",..: 2 1 3 NA 4
##  $ weight      : num  2.1 5 3.2 3.3 3.3
##  $ likes_string: int  1 0 1 1 1
##  $ age         : num  4 5 8 9 9
cats$coat <- as.character(cats$coat)
str(cats)
## 'data.frame':    5 obs. of  4 variables:
##  $ coat        : chr  "calico" "black" "tabby" NA ...
##  $ weight      : num  2.1 5 3.2 3.3 3.3
##  $ likes_string: int  1 0 1 1 1
##  $ age         : num  4 5 8 9 9

Removing rows

cats
##            coat weight likes_string age
## 1        calico    2.1            1   4
## 2         black    5.0            0   5
## 3         tabby    3.2            1   8
## 4          <NA>    3.3            1   9
## 5 tortoiseshell    3.3            1   9
cats[-4,]
##            coat weight likes_string age
## 1        calico    2.1            1   4
## 2         black    5.0            0   5
## 3         tabby    3.2            1   8
## 5 tortoiseshell    3.3            1   9
na.omit(cats)
##            coat weight likes_string age
## 1        calico    2.1            1   4
## 2         black    5.0            0   5
## 3         tabby    3.2            1   8
## 5 tortoiseshell    3.3            1   9
cats <- na.omit(cats)

appending to a data frame

cats <- rbind(cats, cats)
cats
##             coat weight likes_string age
## 1         calico    2.1            1   4
## 2          black    5.0            0   5
## 3          tabby    3.2            1   8
## 5  tortoiseshell    3.3            1   9
## 11        calico    2.1            1   4
## 21         black    5.0            0   5
## 31         tabby    3.2            1   8
## 51 tortoiseshell    3.3            1   9
rownames(cats) <- NULL
cats
##            coat weight likes_string age
## 1        calico    2.1            1   4
## 2         black    5.0            0   5
## 3         tabby    3.2            1   8
## 4 tortoiseshell    3.3            1   9
## 5        calico    2.1            1   4
## 6         black    5.0            0   5
## 7         tabby    3.2            1   8
## 8 tortoiseshell    3.3            1   9

Challenge 1

Realistic example

So far:

Gapmidner data info:

  • For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007
  • The main data frame gapminder has 1704 rows and 6 variables:
  • country - factor with 142 levels
  • continent - factor with 5 levels
  • year - ranges from 1952 to 2007 in increments of 5 years
  • lifeExp - life expectancy at birth, in years
  • pop - population
  • gdpPercap - GDP per capita (US$, inflation-adjusted)
gapminder <- read.csv("gapminder-FiveYearData.csv") #file path will be different on your computer. Us the path where the .csv files is saved on your computer

loading data tips

  • mention tab-separated values files (.tsv)
  • to specify a tab separator use “\tab” or read.delim()
  • files can be downloading via the internet using the download.file and the read.csv can be executed to read the downloaded file such as:
download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv", destfile = "data/gapminder-FiveYearData.csv")
gapminder <- read.csv("data/gapminder-FiveYearData.csv")
  • alternately you can read files directly into R from the internet by replacing file paths with a web address in read.csv
  • note when you do this there is not local copy saved
gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv")
  • You can read directly from excel spreadsheets without converting them to plain text by using the readxl package

working with gapminder dataset

  • let’s investigate gapminder
  • the first thing we should always do is check out what the data looks like with str
str(gapminder)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
typeof(gapminder$year)
## [1] "integer"
typeof(gapminder$lifeExp)
## [1] "double"
typeof(gapminder$country)
## [1] "integer"
str(gapminder$country)
##  Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
  • we can also interrogat the data.frame for info about its dimensions
  • remember that str(gapminder) said there were 1704 observations of 6 variables in gapminder
  • what do you think the following will produce?
length(gapminder)
## [1] 6
  • a fair guess would say that the length of a data.frame would be the number of rows it has (1704)
  • not the case, remember, a data.frame is a list of vectors and factors:
typeof(gapminder)
## [1] "list"
  • when length gave us 6 it’s because gapminder is built out of a list of 6 columns
  • to get number of rows and columns in our data set try:
nrow(gapminder)
## [1] 1704
ncol(gapminder)
## [1] 6
  • or both at once
dim(gapminder)
## [1] 1704    6
  • we also would want to know what the titles of all the columns are, so we can ask for them by name:
colnames(gapminder)
## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
  • it is important to ask if the structure R is reporting matches our intuition or expectations
  • do the basic data types reported make sense?
  • if not we need to sort out problems now before they turn into negative surprises down the road
  • remmeber how R interprets data, and the importance of strict consistency in how we record our data

  • once we are happy that the data types and structures seem reasonable, it’s time to start digging into our data properly

head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134
  • to make sure our analysis is reproducible we should put the code into a script file so we can come back later

Challenge 2

Challege 3

Cheat Sheets