The data we’ll use comes from the article entitled “The Influence of Recombination on Human Genetic Diversity” (see below: Spencer 2006). The data is contained in Dataset_S1.txt
and found on GitHub as part of example code for the O’Reilly book Bioinformatics Data Skills. The following tutorial is adapted from that book and Data Carpentry Lessons.
Before we download the data we need to find out our working directory and set it to the appropriate folder.
getwd() # tells us our current working directory
> # [1] "/Users/jtdennis/Desktop/workshops/intro-r"
Now set it to where you want to do your work, which will also have a folder for your data.
setwd("~/Desktop/workshops/intro-r/") #sets working directory
# list.files() #helpful command to list the files in a directory
list.dirs() # will list directories
> # [1] "." "./data"
> # [3] "./intro-to-r_cache" "./intro-to-r_cache/html"
Let’s now download the data and place it in our data
folder. Note: we could directly read a file in from the web, but we first want to inspect the file before we try and read it.
download.file("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/Dataset_S1.txt",
"data/Dataset_S1.txt")
We can confirm it downloaded into our folder by listing files in that folder.
list.files("data/")
> # [1] "Dataset_S1.txt"
Before we attempt to load the Dataset_S1.txt
data set, we want to inspect a part of it. We can either drop into a terminal client and use head -n 10 Dataset_S1.txt
to inspect the file:
head -n 10 data/Dataset_S1.txt
> # start,end,total SNPs,total Bases,depth,unique SNPs,dhSNPs,reference Bases,Theta,Pi,Heterozygosity,%GC,Recombination,Divergence,Constraint,SNPs
> # 55001,56000,0,1894,3.41,0,0,556,0,0,0,54.8096,0.009601574,0.003006012,0,0
> # 56001,57000,5,6683,6.68,2,2,1000,8.007,10.354,7.481,42.4424,0.009601574,0.01801802,0,0
> # 57001,58000,1,9063,9.06,1,0,1000,3.51,1.986,1.103,37.2372,0.009601574,0.007007007,0,0
> # 58001,59000,7,10256,10.26,3,2,1000,9.929,9.556,6.582,38.038,0.009601574,0.01201201,0,0
> # 59001,60000,4,8057,8.06,4,0,1000,12.915,8.506,4.965,41.3413,0.009601574,0.02402402,0,0
> # 60001,61000,6,7051,7.05,2,1,1000,7.817,9.121,8.864,36.1361,0.009601574,0.01601602,0,0
> # 61001,62000,7,6950,6.95,2,1,1000,8.603,8.062,10.431,36.5365,0.009601574,0.01201201,0,0
> # 62001,63000,1,8834,8.83,1,0,1000,4.01,2.065,1.132,35.5355,0.009601574,0.01501502,0,0
> # 63001,64000,1,9629,9.63,1,0,1000,3.369,1.879,1.039,43.5435,0.009601574,0.009009009,58,1
This prints out the first 10 lines of the file. Alternately, we can attempt to open the file in a text editor (not word or other markup rich word processors, but somethign like notepad, sublime, textmate). From above output we can see the file is comma separated value file with header values. We might want to use tail to see if the end of the file is as tidy as the head.
tail -n 10 data/Dataset_S1.txt
> # 63682001,63683000,4,7501,7.5,2,1,1000,8.849,8.478,5.332,48.8488,0.000650676,0.01453488,0,0
> # 63683001,63684000,3,7982,7.98,3,0,1000,12.698,6.833,3.759,43.7437,0.000650676,0.01453488,0,0
> # 63684001,63685000,0,3525,3.53,0,0,1000,0,0,0,47.5475,0.000650676,0.01453488,0,0
> # 63685001,63686000,0,2203,3.17,0,0,696,0,0,0,71.5716,0.000650676,0.01453488,0,0
> # 63686001,63687000,3,6994,6.99,3,0,1000,10.714,7.376,4.29,49.6496,0.000650676,0.01453488,0,0
> # 63687001,63688000,4,6979,7,2,1,997,7.101,7.396,5.731,43.2432,0.000650676,0.01453488,0,0
> # 63688001,63689000,2,4698,4.7,2,0,1000,7.266,6.355,4.257,40.1401,0.000650676,0.01453488,0,0
> # 63689001,63690000,7,10363,10.36,6,1,1000,21.745,12.105,6.755,43.1431,0.000650676,0.01453488,0,0
> # 63690001,63691000,5,9885,9.89,1,1,1000,3.685,5.901,5.058,34.2342,0.000650676,0.01453488,0,0
> # 63691001,63692000,1,3309,6.29,1,0,526,3.022,3.022,3.022,35.6732,0.000650676,0.01453488,0,0
Looks good and tidy. We can now read the data into R.
Note: If your data is untidy (not ready for analysis because of various reasons), you will need to reshape or clean your. Read Hadley Wickham’s article Tidy Data (???) for more information on how to do this in R.
Note: If your data is too big to read, try a few of the below stratedgies.
If these don’t work try the packages readr
or data.table
, both improving read speed. Finally, if data still too big, you’ll want to keep the bulk of data out of memory use a database like SQLite
(RSQLite
package).
d <- read.csv("data/Dataset_S1.txt")
In R, we can do this using the head()
built in function. If you are famililar with UNIX
, the head()
function immulates the functionality that head
provide on the command line, but against R objects. There is also a tail
in R for inspecting the end lines of files. Let’s use head()
and tail()
to look in the file
head(d)
> # start end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 1 55001 56000 0 1894 3.41 0 0
> # 2 56001 57000 5 6683 6.68 2 2
> # 3 57001 58000 1 9063 9.06 1 0
> # 4 58001 59000 7 10256 10.26 3 2
> # reference.Bases Theta Pi Heterozygosity X.GC Recombination
> # 1 556 0.000 0.000 0.000 54.8096 0.009601574
> # 2 1000 8.007 10.354 7.481 42.4424 0.009601574
> # 3 1000 3.510 1.986 1.103 37.2372 0.009601574
> # 4 1000 9.929 9.556 6.582 38.0380 0.009601574
> # Divergence Constraint SNPs
> # 1 0.003006012 0 0
> # 2 0.018018020 0 0
> # 3 0.007007007 0 0
> # 4 0.012012010 0 0
> # [ reached getOption("max.print") -- omitted 2 rows ]
?read.csv
to see argumentsdata.frame
is the de facto data structure for most tabular data and what we use for statistics and plotting.data.frame
is a collection of vectors of identical lengths.
data.frame
created read.csv()
or read.table()
data.frame
converts (= coerces) columns that contain characters (i.e., text) into the factor data type – we’ll return to this laterstr(d)
> # 'data.frame': 59140 obs. of 16 variables:
> # $ start : int 55001 56001 57001 58001 59001 60001 61001 62001 63001 64001 ...
> # $ end : int 56000 57000 58000 59000 60000 61000 62000 63000 64000 65000 ...
> # $ total.SNPs : int 0 5 1 7 4 6 7 1 1 3 ...
> # $ total.Bases : int 1894 6683 9063 10256 8057 7051 6950 8834 9629 7999 ...
> # $ depth : num 3.41 6.68 9.06 10.26 8.06 ...
> # $ unique.SNPs : int 0 2 1 3 4 2 2 1 1 1 ...
> # $ dhSNPs : int 0 2 0 2 0 1 1 0 0 1 ...
> # $ reference.Bases: int 556 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
> # $ Theta : num 0 8.01 3.51 9.93 12.91 ...
> # $ Pi : num 0 10.35 1.99 9.56 8.51 ...
> # $ Heterozygosity : num 0 7.48 1.1 6.58 4.96 ...
> # $ X.GC : num 54.8 42.4 37.2 38 41.3 ...
> # $ Recombination : num 0.0096 0.0096 0.0096 0.0096 0.0096 ...
> # $ Divergence : num 0.00301 0.01802 0.00701 0.01201 0.02402 ...
> # $ Constraint : int 0 0 0 0 0 0 0 0 58 0 ...
> # $ SNPs : int 0 0 0 0 0 0 0 0 1 1 ...
head()
and str()
are helpful to to check contents of a data frame, but there are more functions that aid in telling what’s in a data frame.
dim()
, nrow
, ncol()
head()
, tail()
names()
- returns column namesrownames()
str()
structure of the object, summary()
summary stats for each colBased on the give table of functions to asses data structure, can you answer the following questions?
depth
?We have Dataset_S1.txt as dataframe d
with columns (variables) and rows. Let’s step through finding out more about the dimensions, rows, columns, and row and column names of a dataframe.
head(d, n = 3)
> # start end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 1 55001 56000 0 1894 3.41 0 0
> # 2 56001 57000 5 6683 6.68 2 2
> # 3 57001 58000 1 9063 9.06 1 0
> # reference.Bases Theta Pi Heterozygosity X.GC Recombination
> # 1 556 0.000 0.000 0.000 54.8096 0.009601574
> # 2 1000 8.007 10.354 7.481 42.4424 0.009601574
> # 3 1000 3.510 1.986 1.103 37.2372 0.009601574
> # Divergence Constraint SNPs
> # 1 0.003006012 0 0
> # 2 0.018018020 0 0
> # 3 0.007007007 0 0
nrow(d)
> # [1] 59140
ncol(d)
> # [1] 16
dim(d)
> # [1] 59140 16
colnames()
and row.names()
.colnames(d)
> # [1] "start" "end" "total.SNPs"
> # [4] "total.Bases" "depth" "unique.SNPs"
> # [7] "dhSNPs" "reference.Bases" "Theta"
> # [10] "Pi" "Heterozygosity" "X.GC"
> # [13] "Recombination" "Divergence" "Constraint"
> # [16] "SNPs"
X.GC
to somethign bettercolnames(d)[12]
> # [1] "X.GC"
colnames(d)[12] <- "percent.GC"
colnames(d)[12]
> # [1] "percent.GC"
We also could set row names using `row.names() <- (row names must be unique).
We can access a single column using the $
operator.
d$depth
> # [1] 3.41 6.68 9.06 10.26 8.06 7.05 6.95 8.83 9.63 8.00 8.38
> # [12] 8.43 10.84 5.38 8.01 8.60 8.78 5.28 6.65 7.85 8.63 9.02
> # [23] 8.59 8.85 7.00 6.23 8.38 10.31 10.15 11.48 15.12 7.98 8.81
> # [34] 9.40 12.09 10.11 6.63 8.37 8.79 9.53 7.63 9.67 7.83 9.74
> # [45] 8.49 8.39 11.14 8.89 7.02 8.65 6.32 4.20 1.64 3.38 1.83
> # [56] 4.34 5.52 10.54 7.00 10.50 9.08 8.91 7.37 8.65 10.09 10.41
> # [67] 10.36 13.26 14.32 8.76 6.83 9.43 9.19 8.39 7.73
> # [ reached getOption("max.print") -- omitted 59065 entries ]
depth
column as a vectormean()
or summary()
mean(d$depth)
> # [1] 8.183938
summary(d$depth)
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 1.000 6.970 8.170 8.184 9.400 21.910
$
is a shortcut operator for the more general bracket operator used to access rows, columns, and cells.df[row,col]
We can return all rows and just columns we can omit the row index.
d[, 1:2]
> # start end
> # 1 55001 56000
> # 2 56001 57000
> # 3 57001 58000
> # 4 58001 59000
> # 5 59001 60000
> # 6 60001 61000
> # 7 61001 62000
> # 8 62001 63000
> # 9 63001 64000
> # 10 64001 65000
> # 11 65001 66000
> # 12 66001 67000
> # 13 67001 68000
> # 14 68001 69000
> # 15 69001 70000
> # 16 70001 71000
> # 17 71001 72000
> # 18 72001 73000
> # 19 73001 74000
> # 20 74001 75000
> # 21 75001 76000
> # 22 76001 77000
> # 23 77001 78000
> # 24 78001 79000
> # 25 79001 80000
> # 26 80001 81000
> # 27 81001 82000
> # 28 82001 83000
> # 29 83001 84000
> # 30 84001 85000
> # 31 85001 86000
> # 32 86001 87000
> # 33 87001 88000
> # 34 88001 89000
> # 35 89001 90000
> # 36 90001 91000
> # 37 91001 92000
> # [ reached getOption("max.print") -- omitted 59103 rows ]
d[, c("start", "end")]
> # start end
> # 1 55001 56000
> # 2 56001 57000
> # 3 57001 58000
> # 4 58001 59000
> # 5 59001 60000
> # 6 60001 61000
> # 7 61001 62000
> # 8 62001 63000
> # 9 63001 64000
> # 10 64001 65000
> # 11 65001 66000
> # 12 66001 67000
> # 13 67001 68000
> # 14 68001 69000
> # 15 69001 70000
> # 16 70001 71000
> # 17 71001 72000
> # 18 72001 73000
> # 19 73001 74000
> # 20 74001 75000
> # 21 75001 76000
> # 22 76001 77000
> # 23 77001 78000
> # 24 78001 79000
> # 25 79001 80000
> # 26 80001 81000
> # 27 81001 82000
> # 28 82001 83000
> # 29 83001 84000
> # 30 84001 85000
> # 31 85001 86000
> # 32 86001 87000
> # 33 87001 88000
> # 34 88001 89000
> # 35 89001 90000
> # 36 90001 91000
> # 37 91001 92000
> # [ reached getOption("max.print") -- omitted 59103 rows ]
start
and end
positions we’d:d[1, c("start", "end")]
> # start end
> # 1 55001 56000
drop
variable to FALSEd[, "start", drop = FALSE]
> # start
> # 1 55001
> # 2 56001
> # 3 57001
> # 4 58001
> # 5 59001
> # 6 60001
> # 7 61001
> # 8 62001
> # 9 63001
> # 10 64001
> # 11 65001
> # 12 66001
> # 13 67001
> # 14 68001
> # 15 69001
> # 16 70001
> # 17 71001
> # 18 72001
> # 19 73001
> # 20 74001
> # 21 75001
> # 22 76001
> # 23 77001
> # 24 78001
> # 25 79001
> # 26 80001
> # 27 81001
> # 28 82001
> # 29 83001
> # 30 84001
> # 31 85001
> # 32 86001
> # 33 87001
> # 34 88001
> # 35 89001
> # 36 90001
> # 37 91001
> # 38 92001
> # 39 93001
> # 40 94001
> # 41 95001
> # 42 96001
> # 43 97001
> # 44 98001
> # 45 99001
> # 46 100001
> # 47 101001
> # 48 102001
> # 49 103001
> # 50 104001
> # 51 105001
> # 52 106001
> # 53 107001
> # 54 108001
> # 55 109001
> # 56 110001
> # 57 111001
> # 58 112001
> # 59 113001
> # 60 114001
> # 61 115001
> # 62 116001
> # 63 117001
> # 64 118001
> # 65 119001
> # 66 120001
> # 67 121001
> # 68 122001
> # 69 123001
> # 70 124001
> # 71 125001
> # 72 126001
> # 73 127001
> # 74 128001
> # 75 129001
> # [ reached getOption("max.print") -- omitted 59065 rows ]
d$cent <- d$start >= 25800000 & d$end <= 29700000
&
operates on each element of the two vectors created by ddstart > = 25800000andddend <= 29700000 and returns TRUE when both are >true.table(d$cent)
> #
> # FALSE TRUE
> # 58455 685
sum()
to count up the truessum(d$cent)
> # [1] 685
Lastly, note that according to the supplementary material of this paper, the diversity estimates (columns Theta, Pi, and Heterozygosity) are all scaled up by 10x in the dataset (see supplementary Text S1 for more details). We’ll use the nucleotide diversity column Pi later in this chapter in plots, and it would be useful to have this scaled as per basepair nucleotide diversity (so as to make the scale more intuitive). We can create a new rescaled column called diversity with:
d$diversity <- d$Pi/(10 * 1000) # rescale, removing 10x and making per bp
summary(d$diversity)
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 0.0000000 0.0005577 0.0010420 0.0012390 0.0016880 0.0265300
Average nucleotide diversity per basepair in this data is around 0.001 (0.12%), roughly what we’d expect from other estimates of human diversity (Hernandez et al., 2012, Perry et al., 2012).
summary(d$total.SNPs)
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 0.000 3.000 7.000 8.906 12.000 93.000
d$total.SNPs >= 85
> # [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [ reached getOption("max.print") -- omitted 59065 entries ]
d[d$total.SNPs >= 85, ]
> # start end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 2567 2621001 2622000 93 11337 11.34 13 10
> # 12968 13023001 13024000 88 11784 11.78 11 1
> # 43165 47356001 47357000 87 12505 12.50 9 7
> # reference.Bases Theta Pi Heterozygosity percent.GC
> # 2567 1000 43.420 50.926 81.589 43.9439
> # 12968 1000 33.413 19.030 74.838 28.8288
> # 43165 1000 29.621 27.108 69.573 46.7467
> # Recombination Divergence Constraint SNPs cent diversity
> # 2567 0.000706536 0.01701702 0 1 FALSE 0.0050926
> # 12968 0.000082600 0.01401401 0 1 FALSE 0.0019030
> # 43165 0.000500577 0.02002002 0 7 FALSE 0.0027108
d[d$Pi > 16 & d$percent.GC > 80, ]
> # start end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 58550 63097001 63098000 5 947 2.39 2 1
> # 58641 63188001 63189000 2 1623 3.21 2 0
> # 58642 63189001 63190000 5 1395 1.89 3 2
> # reference.Bases Theta Pi Heterozygosity percent.GC
> # 58550 397 37.544 41.172 52.784 82.0821
> # 58641 506 16.436 16.436 12.327 82.3824
> # 58642 738 35.052 41.099 35.842 80.5806
> # Recombination Divergence Constraint SNPs cent diversity
> # 58550 0.000781326 0.03826531 226 1 FALSE 0.0041172
> # 58641 0.000347382 0.01678657 148 0 FALSE 0.0016436
> # 58642 0.000347382 0.01793722 0 0 FALSE 0.0041099
d[d$Pi > 16 & d$percent.GC > 80, c("start", "end", "depth", "Pi")]
> # start end depth Pi
> # 58550 63097001 63098000 2.39 41.172
> # 58641 63188001 63189000 3.21 16.436
> # 58642 63189001 63190000 1.89 41.099
d[d$Pi > 16 & d$percent.GC > 80, c("start", "end", "Pi", "depth")]
> # start end Pi depth
> # 58550 63097001 63098000 41.172 2.39
> # 58641 63188001 63189000 16.436 3.21
> # 58642 63189001 63190000 41.099 1.89
summary(d$depth[d$percent.GC >= 80])
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 1.05 1.89 2.14 2.24 2.78 3.37
summary(d$depth[d$percent.GC < 80])
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 1.000 6.970 8.170 8.185 9.400 21.910
Fairly large diff, but consider how many windows this includes. There are only nine windows that have GC content over 80%.
sum(d$percent.GC >= 80)
> # [1] 9
d$cent is a logical vector we can subset directly and use the
!` operator to get its complement.summary(d$Pi[d$cent])
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 0.00 7.95 16.08 20.41 27.36 194.40
summary(d$Pi[!d$cent])
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 0.000 5.557 10.370 12.290 16.790 265.300
d$Pi > 3
> # [1] FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
> # [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
> # [23] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> # [34] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
> # [45] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> # [56] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
> # [67] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
> # [ reached getOption("max.print") -- omitted 59065 entries ]
which(d$Pi > 3)
> # [1] 2 4 5 6 7 10 12 13 14 15 16 17 18 20 21 22 23 24 25 28 29 30 31
> # [24] 32 33 34 35 37 38 39 41 42 43 44 45 47 48 49 50 51 52 53 54 55 56 57
> # [47] 60 61 62 63 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84
> # [70] 85 86 87 88 89 90
> # [ reached getOption("max.print") -- omitted 51319 entries ]
which() also has two related functions that return the index of the first minimum or maximum element of a vector: which.min() and which.max(). For example:
d[which.min(d$total.Bases), ]
> # start end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 25689 25785001 25786000 0 110 1 0 0
> # reference.Bases Theta Pi Heterozygosity percent.GC Recombination
> # 25689 110 0 0 0 38.8388 4.63e-05
> # Divergence Constraint SNPs cent diversity
> # 25689 0.04946043 0 0 FALSE 0
d[which.max(d$depth), ]
> # start end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 8718 8773001 8774000 58 21914 21.91 7 4
> # reference.Bases Theta Pi Heterozygosity percent.GC Recombination
> # 8718 1000 17.676 14.199 26.581 39.3393 0.001990459
> # Divergence Constraint SNPs cent diversity
> # 8718 0.01601602 0 1 FALSE 0.0014199
subset() takes two arguments: the dataframe to operate on, and then conditions to include a row. With subset(), d[d$Pi > 16 & d$percent.GC > 80, ] can be expressed as:
subset(d, Pi > 16 & percent.GC > 80)
> # start end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 58550 63097001 63098000 5 947 2.39 2 1
> # 58641 63188001 63189000 2 1623 3.21 2 0
> # 58642 63189001 63190000 5 1395 1.89 3 2
> # reference.Bases Theta Pi Heterozygosity percent.GC
> # 58550 397 37.544 41.172 52.784 82.0821
> # 58641 506 16.436 16.436 12.327 82.3824
> # 58642 738 35.052 41.099 35.842 80.5806
> # Recombination Divergence Constraint SNPs cent diversity
> # 58550 0.000781326 0.03826531 226 1 FALSE 0.0041172
> # 58641 0.000347382 0.01678657 148 0 FALSE 0.0016436
> # 58642 0.000347382 0.01793722 0 0 FALSE 0.0041099
Optionally, a third argument can be supplied to specify which columns (and in what order) to include:
subset(d, Pi > 16 & percent.GC > 80, c(start, end, Pi, percent.GC, depth))
> # start end Pi percent.GC depth
> # 58550 63097001 63098000 41.172 82.0821 2.39
> # 58641 63188001 63189000 16.436 82.3824 3.21
> # 58642 63189001 63190000 41.099 80.5806 1.89
# install.packages('ggplot2')
library(ggplot2)
ggplot2 to create a scatterplot of nucleotide diversity along the chromosome in the diversity column in our d dataframe. Because our data is window-based, we’ll first add a column called position to our dataframe that’s the midpoint between each window:
d$position <- (d$end + d$start)/2
ggplot(d) + geom_point(aes(x = position, y = diversity))
Spencer, Panos AND Hunt, Chris C. A AND Deloukas. 2006. “The Influence of Recombination on Human Genetic Diversity.” PLoS Genet 2 (9). Public Library of Science: 1–11. doi:10.1371/journal.pgen.0020148.