First part of this notebook based on Karthik Ram’s GGPLOT2 Lecture (CC-By 2.0)
#install.packages('ggplot2')
#library(ggplot2)
GOALS: Students should be able to use ggplot2 to generate publication quality graphics and understand and use the basics of the grammar of graphics.
ggplot2
is built on the grammar-of-graphics:ggplot2
is thinking about a figure in layers – think of ArcGIS or programs like Photoshopgeom_point()
, geom bar()
, geom density()
, geom line()
, geom area()
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
library(ggplot2)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
WE can use the data()
function to show the available built-in data sets in R.
data()
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Basic structure
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
myplot <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width))
myplot + geom_point()
ggplot
function.Increase size of points
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 3)
Make it colorful
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 3)
Differentiate points by shape
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(aes(shape = Species), size = 3)
# Make a small sample of the diamonds dataset
d2 <- diamonds[sample(1:dim(diamonds)[1], 1000), ]
Then generate this plot below. (open 09-plot-ggplot2-ex-1-1.png)
ggplot(d2, aes(carat, price, color = color)) + geom_point() + theme_gray()
Switch to Gapminder Data
#gapminder <- read.csv("https://goo.gl/BtBnPg", header = T)
gapminder <- read.csv('data/gapminder-FiveYearData.csv', header=T)
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
NOTE:
ggplot
function -any arguments we provide the ggplot
function are considered global options: they apply to all layersggplot
:aes
function - which tells ggplot how variables map to aesthetic propertiesx
& y
locationsAlone the ggplot call isn’t enough to render the plot.
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap))
## If run, would produce an error.
Need to tell ggplot how we want to present variables by specifying a geom layer. In the above example we used geom_point
to create a scatter plot.
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point()
See ?geom boxplot for list of options
library(MASS)
head(birthwt)
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
87 0 20 105 1 1 0 0 0 1 2557
88 0 21 108 1 1 0 0 1 2 2594
89 0 18 107 1 1 0 0 1 0 2600
91 0 21 124 3 0 0 0 0 0 2622
ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot()
See ?geom histogram for list of options
h <- ggplot(faithful, aes(x = waiting))
h + geom_histogram(binwidth = 30, colour = "black")
h <- ggplot(faithful, aes(x = waiting))
h + geom_histogram(binwidth = 8, fill = "steelblue", colour = "black")
Download data
download.file('https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv', 'data/climate.csv')
#climate <- read.csv(text=RCurl::getURL(https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv))
Anomaly10y
is a 10-year running average of the deviation (in Celsius) from the average 1950–1980 temperature, and Unc10y
is the 95% confidence interval. We’ll set ymax and ymin to Anomaly10y plus or minus Unc10y (Figure 4-25):climate <- read.csv("data/climate.csv", header = T)
ggplot(climate, aes(Year, Anomaly10y)) +
geom_line()
We can also plot confidence regions
Anomaly10y
is a 10-year running average of the deviation (in Celsius) from the average 1950–1980 temperature, and Unc10y
is the 95% confidence interval. We’ll set ymax and ymin to Anomaly10y plus or minus Unc10y (Figure 4-25):ggplot(climate, aes(Year, Anomaly10y)) +
geom_ribbon(aes(ymin = Anomaly10y - Unc10y, ymax = Anomaly10y + Unc10y),
fill = "blue", alpha = .1) +
geom_line(color = "steelblue")
Modify the previous plot and change it such that there are three lines instead of one with a confidence band.
cplot <- ggplot(climate, aes(Year, Anomaly10y))
cplot <- cplot + geom_line(size = 0.7, color = "black")
cplot <- cplot + geom_line(aes(Year, Anomaly10y + Unc10y), linetype = "dashed", size = 0.7, color = "red")
cplot <- cplot + geom_line(aes(Year, Anomaly10y - Unc10y), linetype = "dashed", size = 0.7, color = "red")
cplot + theme_gray()
#theme_classic
#theme_bw()
#theme_minimal()
Using scatter plot not the best way to visualize change over time. Let’s use line plot.
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line()
added a by aesthetic to get a line per country and color by continent
visualize both lines and points on the plot?
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line() + geom_point()
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) +
geom_line(aes(color=continent)) + geom_point()
ggplot
to the geom_line
layer so it no longer applies to the pointsggplot(iris, aes(Species, Sepal.Length)) +
geom_bar(stat = "identity")
library(tidyr)
#df <- melt(iris, id.vars = "Species")
df <- gather(iris, variable, value, -Species )
ggplot(df, aes(Species, value, fill = variable)) +
geom_bar(stat = "identity")
The heights of the bars commonly represent one of two things: either a count of cases in each group, or the values in a column of the data frame. By default, geom_bar uses stat=“bin”. This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. If you want the heights of the bars to represent values in the data, use stat=“identity” and map a value to the y aesthetic.
These two packages are the Swiss army knives of R. dplyr * filter * select * mutate * tidyr. * gather * spread * separate
Let’s look at iris again.
iris[1:2, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
df <- gather(iris, variable, value, -Species )
df[1:2, ]
Species variable value
1 setosa Sepal.Length 5.1
2 setosa Sepal.Length 4.9
ggplot(df, aes(Species, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge")
Using the d2 dataset you created earlier, generate this plot below. Take a quick look at the data first to see if it needs to be binned
d2 <- diamonds[sample(1:dim(diamonds)[1], 1000), ]
ggplot(d2, aes(clarity, fill = cut)) +
geom_bar(position = "dodge")
ifelse
function to create clim$sign
clim <- read.csv('data/climate.csv', header = TRUE)
clim$sign <- ifelse(clim$Anomaly10y<0, FALSE, TRUE)
# or as simple as
# clim$sign <- clim$Anomaly10y < 0
ggplot(clim, aes(Year, Anomaly10y)) + geom_bar(stat = "identity", aes(fill = sign)) + theme_gray()
ggplot(faithful, aes(waiting)) + geom_density()
ggplot(faithful, aes(waiting)) +
geom_density(fill = "blue", alpha = 0.1)
ggplot(faithful, aes(waiting)) +
geom_line(stat = "density")
aes(color = variable)
aes(color = "black")
# Or add it as a scale
scale_fill_manual(values = c("color1", "color2"))
library(RColorBrewer)
display.brewer.all()
#df <- melt(iris, id.vars = "Species")
ggplot(df, aes(Species, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(Species ~ .) +
scale_color_manual(values = c("red", "green", "blue"))
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap, color=continent)) +
geom_point()
y
axis using the scale functionsalpha
function, which is helpful when you have a large amount of data which is v. clusteredggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10()
log10
function applied a transformation to the values of the gdpPercap column before rendering them on the plotThis makes it easier to visualize the spread of data on the y-axis.
We can fit a simple relationship to the data by adding another layer, geom_smooth
:
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10() + geom_smooth(method="lm")
geom_smooth
layer:pwd <- ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10() + geom_smooth(method="lm", size=1.5)
geom_smooth
.aes
function to define a mapping between data variables and their visual representation.ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(aes(shape = Species), size = 3) +
geom_smooth(method = "lm")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(aes(shape = Species), size = 3) +
geom_smooth(method = "lm") +
facet_grid(. ~ Species)
starts.with <- substr(gapminder$country, start = 1, stop = 1)
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]
ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country)
facet_wrap
layer took a “formula” as its argument, denoted by the tilde (~).ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country) +
xlab("Year") + ylab("Life expectancy") + ggtitle("Figure 1") +
scale_colour_discrete(name="Continent") +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
http://swcarpentry.github.io/r-novice-gapminder/08-plot-ggplot2#challenge-5
#str(iris)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(Species ~ .)
### And along rows
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(. ~ Species)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_wrap( ~ Species)
Themes are a great way to define custom plots.
+theme() #### see ?theme() for more options
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 1.2, shape = 16) +
facet_wrap( ~ Species) +
theme(legend.key = element_rect(fill = NA),
legend.position = "bottom",
strip.background = element_rect(fill = NA),
axis.title.y = element_text(angle = 0))
#install.packages('ggthemes')
library(ggthemes)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 1.2, shape = 16) +
facet_wrap( ~ Species) +
theme_solarized() +
theme(legend.key = element_rect(fill = NA),
legend.position = "bottom",
strip.background = element_rect(fill = NA),
axis.title.y = element_text(angle = 0))
ggsave('~/path/to/figure/filename.png')
ggsave(plot1, file = "~/path/to/figure/filename.png")
ggsave(file = "/path/to/figure/filename.png", width = 6,
height =4)
ggsave(file = "/path/to/figure/filename.eps")
ggsave(file = "/path/to/figure/filename.jpg")
ggsave(file = "/path/to/figure/filename.pdf")
This is just a taste of what you can do with ggplot2
. RStudio provides a really useful cheat sheet of the different layers available, and more extensive documentation is available on the ggplot2 website. Finally, if you have no idea how to change something, a quick Google search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify!