Lesson objective: * Be able to add and remove rows and columns. * Be able to remove rows with NA values. * Be able to append two data frames * Be able to articulate what a factor is and how to convert between factor and character. * Be able to find basic properties of a data frames including size, class or type of the columns, names, and first few rows.
How can I manipulate a dataframe?
data.frame
- the table created by loading data from a csv fileNow we will learn a few more things about working with data frames
reload cats csv if needed to reset cats variable
cats <- read.csv(file="feline-data.csv")
age <- c(2,3,5,12)
cats
## coat weight likes_string
## 1 calico 2.1 1
## 2 black 5.0 0
## 3 tabby 3.2 1
cbind(cats, age)
cats <- cbind(cats, age)
* produces error Error in data.frame(…, check.names = FALSE): arguments imply differing number of rows: 3, 2
nrow(cats)
## [1] 3
length(age)
## [1] 4
nrow(cats) = length(age)
.cats
## coat weight likes_string
## 1 calico 2.1 1
## 2 black 5.0 0
## 3 tabby 3.2 1
age <- c(4,5,8)
cats <- cbind(cats, age)
cats
## coat weight likes_string age
## 1 calico 2.1 1 4
## 2 black 5.0 0 5
## 3 tabby 3.2 1 8
Now how about adding rows - in this case, we saw last time that the rows of a data.frame are made of lists:
newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
## Warning in `[<-.factor`(`*tmp*`, ri, value = "tortoiseshell"): invalid
## factor level, NA generated
levels(cats$coat)
## [1] "black" "calico" "tabby"
levels(cats$coat) <- c(levels(cats$coat), 'tortoiseshell')
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
cats
## coat weight likes_string age
## 1 calico 2.1 1 4
## 2 black 5.0 0 5
## 3 tabby 3.2 1 8
## 4 <NA> 3.3 1 9
## 5 tortoiseshell 3.3 1 9
str(cats)
## 'data.frame': 5 obs. of 4 variables:
## $ coat : Factor w/ 4 levels "black","calico",..: 2 1 3 NA 4
## $ weight : num 2.1 5 3.2 3.3 3.3
## $ likes_string: int 1 0 1 1 1
## $ age : num 4 5 8 9 9
cats$coat <- as.character(cats$coat)
str(cats)
## 'data.frame': 5 obs. of 4 variables:
## $ coat : chr "calico" "black" "tabby" NA ...
## $ weight : num 2.1 5 3.2 3.3 3.3
## $ likes_string: int 1 0 1 1 1
## $ age : num 4 5 8 9 9
cats
## coat weight likes_string age
## 1 calico 2.1 1 4
## 2 black 5.0 0 5
## 3 tabby 3.2 1 8
## 4 <NA> 3.3 1 9
## 5 tortoiseshell 3.3 1 9
cats[-4,]
## coat weight likes_string age
## 1 calico 2.1 1 4
## 2 black 5.0 0 5
## 3 tabby 3.2 1 8
## 5 tortoiseshell 3.3 1 9
notice the comma with nothing after it to indicate we want to drop that row
Note we can remove both new rows at once by putting the row numbers inside of a vector: cats[c(-4,-5),]
or we can also drop all rows with NA values
na.omit(cats)
## coat weight likes_string age
## 1 calico 2.1 1 4
## 2 black 5.0 0 5
## 3 tabby 3.2 1 8
## 5 tortoiseshell 3.3 1 9
cats <- na.omit(cats)
columns
are vectors or factors, and rows are listscats <- rbind(cats, cats)
cats
## coat weight likes_string age
## 1 calico 2.1 1 4
## 2 black 5.0 0 5
## 3 tabby 3.2 1 8
## 5 tortoiseshell 3.3 1 9
## 11 calico 2.1 1 4
## 21 black 5.0 0 5
## 31 tabby 3.2 1 8
## 51 tortoiseshell 3.3 1 9
rownames(cats) <- NULL
cats
## coat weight likes_string age
## 1 calico 2.1 1 4
## 2 black 5.0 0 5
## 3 tabby 3.2 1 8
## 4 tortoiseshell 3.3 1 9
## 5 calico 2.1 1 4
## 6 black 5.0 0 5
## 7 tabby 3.2 1 8
## 8 tortoiseshell 3.3 1 9
Challenge 1
So far:
data.frames
with our cat datacountry
- factor with 142 levelscontinent
- factor with 5 levelsyear
- ranges from 1952 to 2007 in increments of 5 yearslifeExp
- life expectancy at birth, in yearspop
- populationgdpPercap
- GDP per capita (US$, inflation-adjusted)gapminder <- read.csv("gapminder-FiveYearData.csv") #file path will be different on your computer. Us the path where the .csv files is saved on your computer
read.delim()
download.file
and the read.csv can be executed to read the downloaded file such as:download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv", destfile = "data/gapminder-FiveYearData.csv")
gapminder <- read.csv("data/gapminder-FiveYearData.csv")
read.csv
gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv")
str
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
typeof(gapminder$year)
## [1] "integer"
typeof(gapminder$lifeExp)
## [1] "double"
typeof(gapminder$country)
## [1] "integer"
str(gapminder$country)
## Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
length(gapminder)
## [1] 6
typeof(gapminder)
## [1] "list"
nrow(gapminder)
## [1] 1704
ncol(gapminder)
## [1] 6
dim(gapminder)
## [1] 1704 6
colnames(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
remmeber how R interprets data, and the importance of strict consistency in how we record our data
once we are happy that the data types and structures seem reasonable, it’s time to start digging into our data properly
head(gapminder)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
Challenge 2
Challege 3