setwd("~/Desktop/workshops/intro-r/")
If you don’t have the data from yesterday’s workshop, load it.
metadata <- read.csv("http://www.datacarpentry.org/R-genomics/data/Ecoli_metadata.csv")
head(metadata)
> # sample generation clade strain cit run genome_size
> # 1 REL606 0 <NA> REL606 unknown 4.62
> # 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
> # 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.60
> # 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
> # 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
> # 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
genome_size
vector in our dataset.genome_size <- metadata$genome_size
plot(genome_size)
plot(genome_size, pch = 8)
plot(genome_size, pch = 8, main = "Scatter plot of genome sizes")
hist(genome_size)
boxplot(genome_size ~ cit, metadata)
boxplot(genome_size ~ cit, metadata, col = c("pink", "purple", "darkgrey"),
main = "average expression difference between celltypes", ylab = "Expression")
ggplot2
that adds functionality to the basic plots seen above.ggplot
is best used on data in the data.frame form# install.packages('ggplot2')
library(ggplot2)
ggplot
function is used to initialize the basic graph structure, then we add to itggplot(metadata)
geom_point
, for scatter plots, dot plots)geom_lines
, for time series, etc.)geom_boxplot
)ggplot(metadata) + geom_point()
aes()
functiongeom_point
requires mappings for x and y.ggplot(metadata) + geom_point(aes(x = sample, y = genome_size))
theme
layer.theme
system handles non-data plot elements such as:
ggplot(metadata) + geom_point(aes(x = sample, y = genome_size, color = generation,
shape = cit), size = rel(3)) + theme(axis.text.x = element_text(angle = 45,
hjust = 1))
geom_bar
which requires a statistical transformation.ggplot(metadata) + geom_bar(aes(x = genome_size))
Let’s specify a binwidth.
ggplot(metadata) + geom_bar(aes(x = genome_size), stat = "bin", binwidth = 0.05)
Adjust the binwidth for this graph and see how it changes the plot.
ggplot(metadata) + geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) +
ggtitle("Boxplot of genome size by citrate mutant type") + xlab("citrate mutant") +
ylab("genome size") + theme(panel.grid.major = element_line(size = 0.5,
color = "grey"), axis.text.x = element_text(angle = 45, hjust = 1), axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
pdf("figure/boxplot.pdf")
ggplot(metadata) + geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) +
ggtitle("Boxplot of genome size by citrate mutant type") + xlab("citrate mutant") +
ylab("genome size") + theme(panel.grid.major = element_line(size = 0.5,
color = "grey"), axis.text.x = element_text(angle = 45, hjust = 1), axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
dev.off()
ggplot2
is diff. from R’s base graphics b/c it is built on top of a grammar inspired by Leland Wilkinson’s Grammar of Graphics (Springer, 2005).
ggplot2
plots is built by adding layers to a plot that maps aesthetic properties of geomtetric objects to dataLet’s start by loading our recombination on human genetic diversity data Dataset_S1.txt. If we downloaded to our machine we can load from file:
d <- read.csv("data/Dataset_S1.txt")
We can also read from the web.
d <- read.csv("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/Dataset_S1.txt")
diversity
along chromosome in the diversity
column in our d
dataframe.cent
variable we added on Monday that has TRUE/FALSE values indicating whether the current window is fully within a centromeric region using comparison and logical operations# fixing the colnames to 'percent.GC' from X.GC
colnames(d)[12] <- "percent.GC"
# diversity estimate scaled
d$diversity <- d$Pi/(10 * 1000) # rescale, removing 10x and making per bp
summary(d$diversity)
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 0.0000000 0.0005577 0.0010420 0.0012390 0.0016880 0.0265300
# position -midpoint b/t start and end
d$position <- (d$end + d$start)/2
# add logical for whether window is in a centromeric region
d$cent <- d$start >= 25800000 & d$end <= 29700000
We are ready to plot:
ggplot(d) + geom_point(aes(x = position, y = diversity))
We use + xlab('label')
and + ylab('label')
for adding lables to the axes. Add a x label of “Chromosome position (basepairs)” to the x axes and “Nucleotide diversity” to the y.
ggplot(d) + geom_point(aes(x = position, y = diversity)) + xlab("chromosome position (basepairs)") +
ylab("nucleotide diversity")
scale_x_continuous(limits=c(start, end))
where start and end are the start and end of the axes (and scale_y_continuous() for the y axis).You can change an axis to a log10-scale using the functions scale_x_log10() and scale_y_log10(). ggplot2 has numerous other scale options for discrete scales, other axes transforms, and color scales; see http://docs.ggplot2.org for more detail.
This creates the same mapping as above:
ggplot(d, aes(x = position, y = diversity)) + geom_point()
ggplot2
makes it easy to find out what’s going on with EDAcent
- this indicates whether the window falls in the centromeric region of this chromosomeggplot(d) + geom_point(aes(x = position, y = diversity, color = cent))
ggplot2
well chosen defaults that allows you to quickly create/adjust plots without fussing & then go back to change if needed
ggplot(d) + geom_point(aes(x = position, y = diversity), alpha = 0.01)
* Notice we set the alpha
outside the aes()
function * This is b/c we aren’t mapping to a variable in the data frame, but giving it a fixed value for all data points * Beyond showing lack of diversity est. in the centromeric windows, nothing really shows * Part of problem still overplotting * Bigger issue is that windows span 63 megabases and difficult to detect regional patterns with data at this scale
geom_density()
calculates densityggplot(d) + geom_density(aes(x = diversity), fill = "black")
ggplot2
uses a line to draw densities, fill=black
fill in the lines so clearercolor
to a discrete-valued column in our dataframegeom_density
will create separate density plots, group data by column mapped to colorggplot(d) + geom_density(aes(x = diversity, fill = cent), alpha = 0.4)
ggplot2
’s smoothing functionswith high dimension datasets viz is the easiest way to spot issues
b/c we use same x and y for smooth and scatter we can put all in ggplot()
ggplot(d, aes(x = depth, y = total.SNPs)) + geom_point() + geom_smooth()
ggplot2
uses generalized additive models (GAM) to fit this smoothed curve for datasets with >1000 rowsmethod
argument in geom_smooth()ggplot2
also adds a confidence intervals around smoothing curve (this can be turned off)ggplot(d, aes(x = percent.GC, y = depth)) + geom_point() + geom_smooth()
Let’s use the motifs data and do a bit of cleaning on it first. Both of these datasets were created using the GenomicRanges tools we will learn about in Chapter 9, from tracks downloaded directly from the UCSC Genome Browser.
mtfs <- read.delim("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/motif_recombrates.txt",
header = TRUE)
rpts <- read.delim("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/motif_repeats.txt",
header = TRUE)
Our goal is to merge the column name in the rpts dataframe into the mtfs column, so we know which repeat each motif is contained in (if any). The link between these two datasets are the positions of each motif, identified by the chromosome and motif start position columns chr and motif_start.
# concatenating columns chr+motif_start into a single key string column can
# simplify merging
mtfs$pos <- paste(mtfs$chr, mtfs$motif_start, sep = "-")
rpts$pos <- paste(rpts$chr, rpts$motif_start, sep = "-")
# use match() to find where each of the mtfs$pos keys occur in the rpts$pos.
# We’ll create this indexing vector first before doing the merge
i <- match(mtfs$pos, rpts$pos)
# using this indexing vector we select out elements of rpts$name and merge
# these into mtfs:
mtfs$repeat_name <- rpts$name[i]
# same effect as above using match()’s results to i and use this directly
mtfs$repeat_name <- rpts$name[match(mtfs$pos, rpts$pos)]
Now let’s plot the relationship b/t recombination rate and distance to a motif using the mtfs
from above.
p <- ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1)
p <- p + geom_smooth(method = "loess", se = FALSE, span = 1/10)
print(p)
geom_smooth()
standard error estimates are turned off and smoothing span
adjusted and method set to loess
Use help in R (or Rstudio) ?geom_smooth
and try other parameters for span
& method
.
mtfs
mtfs$motif
column contains two variations of the sequence motif.unique(mtfs$motif)
> # [1] CCTCCCTGACCAC CCTCCCTAGCCAC
> # Levels: CCTCCCTAGCCAC CCTCCCTGACCAC
ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1) + geom_smooth(aes(color = motif),
method = "loess", se = FALSE, span = 1/10)
aes(color=motif)
in the geom_smoothggplot2
is via faceting or coditioning plotting (subplots)p <- ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1, color = "grey")
p <- p + geom_smooth(method = "loess", se = FALSE, span = 1/10)
p <- p + facet_wrap(~motif)
print(p)
ggplot2
has two facet methods: facet_wrap()
and facet_grid()
facet_wrap()
takes a factor column and creates a panel for each level and wraps horizontallyfacet_grid()
allows finder control of facets by allowing you to specify the columns to use for vertical and horz facets:p <- ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1, color = "grey")
p <- p + geom_smooth(method = "loess", se = FALSE, span = 1/16)
p <- p + facet_grid(repeat_name ~ motif)
print(p)
* shows some of the same patterns. * Motif CCTCCCTAGCCACon the THE1B repeat background has strong effect as does CCTCCCTGACCAC on the L2 repeat background. * you can get a sense of the data tha goes in this plot with:
table(mtfs$repeat_name, mtfs$motif, useNA = "ifany")
> #
> # CCTCCCTAGCCAC CCTCCCTGACCAC
> # L2 457 4110
> # THE1B 4651 0
> # <NA> 2908 7924
~
used with facet_wrap()
and facet_grid
is how we specify model formulas in Rhelp(formula)
for morescales = "free_x"
or scales = "free_y"
or scales = "free"
for bothp <- ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1, color = "grey")
p <- p + geom_smooth(method = "loess", se = FALSE, span = 1/10)
p <- p + facet_wrap(~motif, scales = "free_y")
print(p)
Use facets to look at the data but group by chromosome
facet_wrap(~chr)