First part of this notebook based on Karthik Ram’s GGPLOT2 Lecture (CC-By 2.0)

#install.packages('ggplot2')
#library(ggplot2)

GOALS: Students should be able to use ggplot2 to generate publication quality graphics and understand and use the basics of the grammar of graphics.

DataViz

Terminology:

First Plots with GGPLOT

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

library(ggplot2)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

WE can use the data() function to show the available built-in data sets in R.

data()
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()

Basic structure

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()

myplot <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width))
myplot + geom_point()

Increase size of points

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(size = 3)

Make it colorful

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(size = 3)

Differentiate points by shape

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(aes(shape = Species), size = 3)

Exercise 1

# Make a small sample of the diamonds dataset
d2 <- diamonds[sample(1:dim(diamonds)[1], 1000), ]

Then generate this plot below. (open 09-plot-ggplot2-ex-1-1.png)

ggplot(d2, aes(carat, price, color = color)) + geom_point() + theme_gray()

Switch to Gapminder Data

#gapminder <- read.csv("https://goo.gl/BtBnPg", header = T)
gapminder <- read.csv('data/gapminder-FiveYearData.csv', header=T)
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

NOTE:

  • First we call the ggplot function -any arguments we provide the ggplot function are considered global options: they apply to all layers
  • We passed two arguments to ggplot:
  • data
  • an aes function - which tells ggplot how variables map to aesthetic properties
  • x & y locations

Alone the ggplot call isn’t enough to render the plot.

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap))
## If run, would produce an error.

Need to tell ggplot how we want to present variables by specifying a geom layer. In the above example we used geom_point to create a scatter plot.

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
  geom_point()

Box plots

See ?geom boxplot for list of options

library(MASS)
head(birthwt)
   low age lwt race smoke ptl ht ui ftv  bwt
85   0  19 182    2     0   0  0  1   0 2523
86   0  33 155    3     0   0  0  0   3 2551
87   0  20 105    1     1   0  0  0   1 2557
88   0  21 108    1     1   0  0  1   2 2594
89   0  18 107    1     1   0  0  1   0 2600
91   0  21 124    3     0   0  0  0   0 2622
ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot()

Histograms

See ?geom histogram for list of options

h <- ggplot(faithful, aes(x = waiting))
h + geom_histogram(binwidth = 30, colour = "black")

h <- ggplot(faithful, aes(x = waiting))
h + geom_histogram(binwidth = 8, fill = "steelblue", colour = "black")

Line plots

Download data

download.file('https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv', 'data/climate.csv')
#climate <- read.csv(text=RCurl::getURL(https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv))
climate <- read.csv("data/climate.csv", header = T)
ggplot(climate, aes(Year, Anomaly10y)) +
  geom_line()

We can also plot confidence regions

ggplot(climate, aes(Year, Anomaly10y)) +
  geom_ribbon(aes(ymin = Anomaly10y - Unc10y, ymax = Anomaly10y + Unc10y),
              fill = "blue", alpha = .1) +
  geom_line(color = "steelblue")

Exercise 2

Modify the previous plot and change it such that there are three lines instead of one with a confidence band.

cplot <- ggplot(climate, aes(Year, Anomaly10y))
cplot <- cplot + geom_line(size = 0.7, color = "black")
cplot <- cplot + geom_line(aes(Year, Anomaly10y + Unc10y), linetype = "dashed", size = 0.7, color = "red")
cplot <- cplot + geom_line(aes(Year, Anomaly10y - Unc10y), linetype = "dashed", size = 0.7, color = "red")
cplot + theme_gray()

#theme_classic
#theme_bw()
#theme_minimal()

Gapminder line graph

Using scatter plot not the best way to visualize change over time. Let’s use line plot.

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line()

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line() + geom_point()

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) +
  geom_line(aes(color=continent)) + geom_point()

Bar Plots

ggplot(iris, aes(Species, Sepal.Length)) +
  geom_bar(stat = "identity")

library(tidyr)
#df <- melt(iris, id.vars = "Species")
df <- gather(iris, variable, value, -Species )
ggplot(df, aes(Species, value, fill = variable)) +
  geom_bar(stat = "identity")

The heights of the bars commonly represent one of two things: either a count of cases in each group, or the values in a column of the data frame. By default, geom_bar uses stat=“bin”. This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. If you want the heights of the bars to represent values in the data, use stat=“identity” and map a value to the y aesthetic.

Dplyr and Tidyr

These two packages are the Swiss army knives of R. dplyr * filter * select * mutate * tidyr. * gather * spread * separate

Let’s look at iris again.

iris[1:2, ]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
df <- gather(iris, variable, value, -Species ) 
df[1:2, ]
  Species     variable value
1  setosa Sepal.Length   5.1
2  setosa Sepal.Length   4.9
ggplot(df, aes(Species, value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge")

Exercise 3

Using the d2 dataset you created earlier, generate this plot below. Take a quick look at the data first to see if it needs to be binned

d2 <- diamonds[sample(1:dim(diamonds)[1], 1000), ]
ggplot(d2, aes(clarity, fill = cut)) +
  geom_bar(position = "dodge")

Exercise 4

clim <- read.csv('data/climate.csv', header = TRUE)
clim$sign <- ifelse(clim$Anomaly10y<0, FALSE, TRUE)
# or as simple as
# clim$sign <- clim$Anomaly10y < 0
ggplot(clim, aes(Year, Anomaly10y)) + geom_bar(stat = "identity", aes(fill = sign)) + theme_gray()

Density Plots

ggplot(faithful, aes(waiting)) + geom_density()

ggplot(faithful, aes(waiting)) +
  geom_density(fill = "blue", alpha = 0.1)

ggplot(faithful, aes(waiting)) +
  geom_line(stat = "density")

Colors

aes(color = variable) 
aes(color = "black")
# Or add it as a scale
scale_fill_manual(values = c("color1", "color2"))
library(RColorBrewer) 
display.brewer.all() 

Using a color brewer palette

#df  <- melt(iris, id.vars = "Species")
ggplot(df, aes(Species, value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set1")

Manual color scale

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(Species ~ .) +
scale_color_manual(values = c("red", "green", "blue"))

Transformations and statistics

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap, color=continent)) +
  geom_point()

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
  geom_point() + scale_y_log10()

ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
  geom_point() + scale_y_log10() + geom_smooth(method="lm")

pwd <- ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
  geom_point() + scale_y_log10() + geom_smooth(method="lm", size=1.5)
  1. Here we set the size aesthetic by passing it as an argument to geom_smooth.
  2. use the aes function to define a mapping between data variables and their visual representation.

With iris data - smooth

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(aes(shape = Species), size = 3) +
  geom_smooth(method = "lm")

Within facet

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(aes(shape = Species), size = 3) +
  geom_smooth(method = "lm") +
  facet_grid(. ~ Species)

Multi-panel figures: FACEting

starts.with <- substr(gapminder$country, start = 1, stop = 1)
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]
ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country)

Modifying text

ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country) +
  xlab("Year") + ylab("Life expectancy") + ggtitle("Figure 1") +
  scale_colour_discrete(name="Continent") +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

http://swcarpentry.github.io/r-novice-gapminder/08-plot-ggplot2#challenge-5

With iris along coloumns

#str(iris)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_grid(Species ~ .)

### And along rows

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_grid(. ~ Species)

Or wrap your panels

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_wrap( ~ Species)

Themes

Themes are a great way to define custom plots.

+theme() #### see ?theme() for more options

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 1.2, shape = 16) +
facet_wrap( ~ Species) +
theme(legend.key = element_rect(fill = NA),
legend.position = "bottom",
strip.background = element_rect(fill = NA),
axis.title.y = element_text(angle = 0)) 

#install.packages('ggthemes')
library(ggthemes)

Then add one of these themes to your plot

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 1.2, shape = 16) +
facet_wrap( ~ Species) +
theme_solarized() +
theme(legend.key = element_rect(fill = NA),
legend.position = "bottom",
strip.background = element_rect(fill = NA),
axis.title.y = element_text(angle = 0)) 

How to save your plots

ggsave('~/path/to/figure/filename.png')
ggsave(plot1, file = "~/path/to/figure/filename.png")
ggsave(file = "/path/to/figure/filename.png", width = 6,
height =4)
ggsave(file = "/path/to/figure/filename.eps")
ggsave(file = "/path/to/figure/filename.jpg")
ggsave(file = "/path/to/figure/filename.pdf")

Resources:

This is just a taste of what you can do with ggplot2. RStudio provides a really useful cheat sheet of the different layers available, and more extensive documentation is available on the ggplot2 website. Finally, if you have no idea how to change something, a quick Google search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify!