Working with Data

The data we’ll use comes from the article entitled “The Influence of Recombination on Human Genetic Diversity” (see below: Spencer 2006). The data is contained in Dataset_S1.txt and found on GitHub as part of example code for the O’Reilly book Bioinformatics Data Skills. The following tutorial is adapted from that book and Data Carpentry Lessons.

Loading Data into R

Before we download the data we need to find out our working directory and set it to the appropriate folder.

getwd()  # tells us our current working directory
> # [1] "/Users/jtdennis/Desktop/workshops/intro-r"

Now set it to where you want to do your work, which will also have a folder for your data.

setwd("~/Desktop/workshops/intro-r/")  #sets working directory
# list.files() #helpful command to list the files in a directory
list.dirs()  # will list directories 
> # [1] "."                       "./data"                 
> # [3] "./intro-to-r_cache"      "./intro-to-r_cache/html"

Let’s now download the data and place it in our data folder. Note: we could directly read a file in from the web, but we first want to inspect the file before we try and read it.

download.file("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/Dataset_S1.txt", 
    "data/Dataset_S1.txt")

We can confirm it downloaded into our folder by listing files in that folder.

list.files("data/")
> # [1] "Dataset_S1.txt"

Before we attempt to load the Dataset_S1.txt data set, we want to inspect a part of it. We can either drop into a terminal client and use head -n 10 Dataset_S1.txt to inspect the file:

head -n 10 data/Dataset_S1.txt
> # start,end,total SNPs,total Bases,depth,unique SNPs,dhSNPs,reference Bases,Theta,Pi,Heterozygosity,%GC,Recombination,Divergence,Constraint,SNPs
> # 55001,56000,0,1894,3.41,0,0,556,0,0,0,54.8096,0.009601574,0.003006012,0,0
> # 56001,57000,5,6683,6.68,2,2,1000,8.007,10.354,7.481,42.4424,0.009601574,0.01801802,0,0
> # 57001,58000,1,9063,9.06,1,0,1000,3.51,1.986,1.103,37.2372,0.009601574,0.007007007,0,0
> # 58001,59000,7,10256,10.26,3,2,1000,9.929,9.556,6.582,38.038,0.009601574,0.01201201,0,0
> # 59001,60000,4,8057,8.06,4,0,1000,12.915,8.506,4.965,41.3413,0.009601574,0.02402402,0,0
> # 60001,61000,6,7051,7.05,2,1,1000,7.817,9.121,8.864,36.1361,0.009601574,0.01601602,0,0
> # 61001,62000,7,6950,6.95,2,1,1000,8.603,8.062,10.431,36.5365,0.009601574,0.01201201,0,0
> # 62001,63000,1,8834,8.83,1,0,1000,4.01,2.065,1.132,35.5355,0.009601574,0.01501502,0,0
> # 63001,64000,1,9629,9.63,1,0,1000,3.369,1.879,1.039,43.5435,0.009601574,0.009009009,58,1

This prints out the first 10 lines of the file. Alternately, we can attempt to open the file in a text editor (not word or other markup rich word processors, but somethign like notepad, sublime, textmate). From above output we can see the file is comma separated value file with header values. We might want to use tail to see if the end of the file is as tidy as the head.

tail -n 10 data/Dataset_S1.txt
> # 63682001,63683000,4,7501,7.5,2,1,1000,8.849,8.478,5.332,48.8488,0.000650676,0.01453488,0,0
> # 63683001,63684000,3,7982,7.98,3,0,1000,12.698,6.833,3.759,43.7437,0.000650676,0.01453488,0,0
> # 63684001,63685000,0,3525,3.53,0,0,1000,0,0,0,47.5475,0.000650676,0.01453488,0,0
> # 63685001,63686000,0,2203,3.17,0,0,696,0,0,0,71.5716,0.000650676,0.01453488,0,0
> # 63686001,63687000,3,6994,6.99,3,0,1000,10.714,7.376,4.29,49.6496,0.000650676,0.01453488,0,0
> # 63687001,63688000,4,6979,7,2,1,997,7.101,7.396,5.731,43.2432,0.000650676,0.01453488,0,0
> # 63688001,63689000,2,4698,4.7,2,0,1000,7.266,6.355,4.257,40.1401,0.000650676,0.01453488,0,0
> # 63689001,63690000,7,10363,10.36,6,1,1000,21.745,12.105,6.755,43.1431,0.000650676,0.01453488,0,0
> # 63690001,63691000,5,9885,9.89,1,1,1000,3.685,5.901,5.058,34.2342,0.000650676,0.01453488,0,0
> # 63691001,63692000,1,3309,6.29,1,0,526,3.022,3.022,3.022,35.6732,0.000650676,0.01453488,0,0

Looks good and tidy. We can now read the data into R.

Note: If your data is untidy (not ready for analysis because of various reasons), you will need to reshape or clean your. Read Hadley Wickham’s article Tidy Data (???) for more information on how to do this in R.

Note: If your data is too big to read, try a few of the below stratedgies.

  1. Chunking up your data (working with a chromosome at a time).
  2. Try working with random subset instead of the whole dataset
  3. There are a few tricks in R to make read.csv or read.delim functions load large data files faster:
    • Look at colClasses argument and set the datatype
    • R can skip columns by setting their vals in colClasses to “NULL” in quotes
    • Try setting nrow in read.delim() to the length of your dataset can help

If these don’t work try the packages readr or data.table, both improving read speed. Finally, if data still too big, you’ll want to keep the bulk of data out of memory use a database like SQLite (RSQLite package).

d <- read.csv("data/Dataset_S1.txt")

In R, we can do this using the head() built in function. If you are famililar with UNIX, the head() function immulates the functionality that head provide on the command line, but against R objects. There is also a tail in R for inspecting the end lines of files. Let’s use head() and tail() to look in the file

head(d)
> #   start   end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 1 55001 56000          0        1894  3.41           0      0
> # 2 56001 57000          5        6683  6.68           2      2
> # 3 57001 58000          1        9063  9.06           1      0
> # 4 58001 59000          7       10256 10.26           3      2
> #   reference.Bases  Theta     Pi Heterozygosity    X.GC Recombination
> # 1             556  0.000  0.000          0.000 54.8096   0.009601574
> # 2            1000  8.007 10.354          7.481 42.4424   0.009601574
> # 3            1000  3.510  1.986          1.103 37.2372   0.009601574
> # 4            1000  9.929  9.556          6.582 38.0380   0.009601574
> #    Divergence Constraint SNPs
> # 1 0.003006012          0    0
> # 2 0.018018020          0    0
> # 3 0.007007007          0    0
> # 4 0.012012010          0    0
> #  [ reached getOption("max.print") -- omitted 2 rows ]
  • note that header is set to default true.
  • note on wide to long example for genomics
  • data often recorded by humans in wide format
  • ?read.csv to see arguments

Data Frames

  • data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting.
  • A data.frame is a collection of vectors of identical lengths.
    • Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors).
    • The str() function is useful to inspect the data types of the columns.
  • data.frame created read.csv() or read.table()
  • data.frame converts (= coerces) columns that contain characters (i.e., text) into the factor data type – we’ll return to this later
str(d)
> # 'data.frame':   59140 obs. of  16 variables:
> #  $ start          : int  55001 56001 57001 58001 59001 60001 61001 62001 63001 64001 ...
> #  $ end            : int  56000 57000 58000 59000 60000 61000 62000 63000 64000 65000 ...
> #  $ total.SNPs     : int  0 5 1 7 4 6 7 1 1 3 ...
> #  $ total.Bases    : int  1894 6683 9063 10256 8057 7051 6950 8834 9629 7999 ...
> #  $ depth          : num  3.41 6.68 9.06 10.26 8.06 ...
> #  $ unique.SNPs    : int  0 2 1 3 4 2 2 1 1 1 ...
> #  $ dhSNPs         : int  0 2 0 2 0 1 1 0 0 1 ...
> #  $ reference.Bases: int  556 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
> #  $ Theta          : num  0 8.01 3.51 9.93 12.91 ...
> #  $ Pi             : num  0 10.35 1.99 9.56 8.51 ...
> #  $ Heterozygosity : num  0 7.48 1.1 6.58 4.96 ...
> #  $ X.GC           : num  54.8 42.4 37.2 38 41.3 ...
> #  $ Recombination  : num  0.0096 0.0096 0.0096 0.0096 0.0096 ...
> #  $ Divergence     : num  0.00301 0.01802 0.00701 0.01201 0.02402 ...
> #  $ Constraint     : int  0 0 0 0 0 0 0 0 58 0 ...
> #  $ SNPs           : int  0 0 0 0 0 0 0 0 1 1 ...

Inspectcing data frames

head() and str() are helpful to to check contents of a data frame, but there are more functions that aid in telling what’s in a data frame.

  • Size: dim(), nrow, ncol()
  • Content: head(), tail()
  • Names: names() - returns column names
  • rownames()
  • Summary: str() structure of the object, summary() summary stats for each col

Challenge

Based on the give table of functions to asses data structure, can you answer the following questions?

  1. What is the class of the object metadata?
  2. How many rows and how many columns are in this object?
  3. Mean of the column depth?

Exploring and Transforming Dataframes

We have Dataset_S1.txt as dataframe d with columns (variables) and rows. Let’s step through finding out more about the dimensions, rows, columns, and row and column names of a dataframe.

  • Each column of dataframe is a vector and has a type.
  • But dataframe can contain many columns of all different types, heterogeneous types of vectors.
  • Let’s look at the dataframe
head(d, n = 3)
> #   start   end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 1 55001 56000          0        1894  3.41           0      0
> # 2 56001 57000          5        6683  6.68           2      2
> # 3 57001 58000          1        9063  9.06           1      0
> #   reference.Bases Theta     Pi Heterozygosity    X.GC Recombination
> # 1             556 0.000  0.000          0.000 54.8096   0.009601574
> # 2            1000 8.007 10.354          7.481 42.4424   0.009601574
> # 3            1000 3.510  1.986          1.103 37.2372   0.009601574
> #    Divergence Constraint SNPs
> # 1 0.003006012          0    0
> # 2 0.018018020          0    0
> # 3 0.007007007          0    0
  • Find dimensions of the dataset.
nrow(d)
> # [1] 59140
ncol(d)
> # [1] 16
dim(d)
> # [1] 59140    16
  • Print the columns of this dataframe using colnames() and row.names().
colnames(d)
> #  [1] "start"           "end"             "total.SNPs"     
> #  [4] "total.Bases"     "depth"           "unique.SNPs"    
> #  [7] "dhSNPs"          "reference.Bases" "Theta"          
> # [10] "Pi"              "Heterozygosity"  "X.GC"           
> # [13] "Recombination"   "Divergence"      "Constraint"     
> # [16] "SNPs"
  • R’s read.csv renamed some cols for us - spaces become periods, the %GC has been changed to X.
  • Let’s change X.GC to somethign better
colnames(d)[12]
> # [1] "X.GC"
colnames(d)[12] <- "percent.GC"
colnames(d)[12]
> # [1] "percent.GC"
  • We also could set row names using `row.names() <- (row names must be unique).

  • We can access a single column using the $ operator.

d$depth
> #  [1]  3.41  6.68  9.06 10.26  8.06  7.05  6.95  8.83  9.63  8.00  8.38
> # [12]  8.43 10.84  5.38  8.01  8.60  8.78  5.28  6.65  7.85  8.63  9.02
> # [23]  8.59  8.85  7.00  6.23  8.38 10.31 10.15 11.48 15.12  7.98  8.81
> # [34]  9.40 12.09 10.11  6.63  8.37  8.79  9.53  7.63  9.67  7.83  9.74
> # [45]  8.49  8.39 11.14  8.89  7.02  8.65  6.32  4.20  1.64  3.38  1.83
> # [56]  4.34  5.52 10.54  7.00 10.50  9.08  8.91  7.37  8.65 10.09 10.41
> # [67] 10.36 13.26 14.32  8.76  6.83  9.43  9.19  8.39  7.73
> #  [ reached getOption("max.print") -- omitted 59065 entries ]
  • Returns the depth column as a vector
  • We can pass this to mean() or summary()
mean(d$depth)
> # [1] 8.183938
summary(d$depth)
> #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
> #   1.000   6.970   8.170   8.184   9.400  21.910
  • $ is a shortcut operator for the more general bracket operator used to access rows, columns, and cells.
  • Remember when accessing elements of a vector we used vec[2]
  • Since dataframes are 2-dimensional we used two indexes separated by comma
  • df[row,col]

  • We can return all rows and just columns we can omit the row index.

d[, 1:2]
> #          start      end
> # 1        55001    56000
> # 2        56001    57000
> # 3        57001    58000
> # 4        58001    59000
> # 5        59001    60000
> # 6        60001    61000
> # 7        61001    62000
> # 8        62001    63000
> # 9        63001    64000
> # 10       64001    65000
> # 11       65001    66000
> # 12       66001    67000
> # 13       67001    68000
> # 14       68001    69000
> # 15       69001    70000
> # 16       70001    71000
> # 17       71001    72000
> # 18       72001    73000
> # 19       73001    74000
> # 20       74001    75000
> # 21       75001    76000
> # 22       76001    77000
> # 23       77001    78000
> # 24       78001    79000
> # 25       79001    80000
> # 26       80001    81000
> # 27       81001    82000
> # 28       82001    83000
> # 29       83001    84000
> # 30       84001    85000
> # 31       85001    86000
> # 32       86001    87000
> # 33       87001    88000
> # 34       88001    89000
> # 35       89001    90000
> # 36       90001    91000
> # 37       91001    92000
> #  [ reached getOption("max.print") -- omitted 59103 rows ]
  • We can also use column names
d[, c("start", "end")]
> #          start      end
> # 1        55001    56000
> # 2        56001    57000
> # 3        57001    58000
> # 4        58001    59000
> # 5        59001    60000
> # 6        60001    61000
> # 7        61001    62000
> # 8        62001    63000
> # 9        63001    64000
> # 10       64001    65000
> # 11       65001    66000
> # 12       66001    67000
> # 13       67001    68000
> # 14       68001    69000
> # 15       69001    70000
> # 16       70001    71000
> # 17       71001    72000
> # 18       72001    73000
> # 19       73001    74000
> # 20       74001    75000
> # 21       75001    76000
> # 22       76001    77000
> # 23       77001    78000
> # 24       78001    79000
> # 25       79001    80000
> # 26       80001    81000
> # 27       81001    82000
> # 28       82001    83000
> # 29       83001    84000
> # 30       84001    85000
> # 31       85001    86000
> # 32       86001    87000
> # 33       87001    88000
> # 34       88001    89000
> # 35       89001    90000
> # 36       90001    91000
> # 37       91001    92000
> #  [ reached getOption("max.print") -- omitted 59103 rows ]
  • If we want just first row of start and end positions we’d:
d[1, c("start", "end")]
> #   start   end
> # 1 55001 56000
  • when accessing a single column from a dataframe, R returns a vector.
  • Turn this behavior off by setting drop variable to FALSE
d[, "start", drop = FALSE]
> #          start
> # 1        55001
> # 2        56001
> # 3        57001
> # 4        58001
> # 5        59001
> # 6        60001
> # 7        61001
> # 8        62001
> # 9        63001
> # 10       64001
> # 11       65001
> # 12       66001
> # 13       67001
> # 14       68001
> # 15       69001
> # 16       70001
> # 17       71001
> # 18       72001
> # 19       73001
> # 20       74001
> # 21       75001
> # 22       76001
> # 23       77001
> # 24       78001
> # 25       79001
> # 26       80001
> # 27       81001
> # 28       82001
> # 29       83001
> # 30       84001
> # 31       85001
> # 32       86001
> # 33       87001
> # 34       88001
> # 35       89001
> # 36       90001
> # 37       91001
> # 38       92001
> # 39       93001
> # 40       94001
> # 41       95001
> # 42       96001
> # 43       97001
> # 44       98001
> # 45       99001
> # 46      100001
> # 47      101001
> # 48      102001
> # 49      103001
> # 50      104001
> # 51      105001
> # 52      106001
> # 53      107001
> # 54      108001
> # 55      109001
> # 56      110001
> # 57      111001
> # 58      112001
> # 59      113001
> # 60      114001
> # 61      115001
> # 62      116001
> # 63      117001
> # 64      118001
> # 65      119001
> # 66      120001
> # 67      121001
> # 68      122001
> # 69      123001
> # 70      124001
> # 71      125001
> # 72      126001
> # 73      127001
> # 74      128001
> # 75      129001
> #  [ reached getOption("max.print") -- omitted 59065 rows ]
d$cent <- d$start >= 25800000 & d$end <= 29700000
  • Single ampersand (&), which is the vectorized version of logical AND
  • & operates on each element of the two vectors created by ddstart >  = 25800000andddend <= 29700000 and returns TRUE when both are >true.
  • To tally we could use table():
table(d$cent)
> # 
> # FALSE  TRUE 
> # 58455   685
  • We can also use sum() to count up the trues
sum(d$cent)
> # [1] 685

Lastly, note that according to the supplementary material of this paper, the diversity estimates (columns Theta, Pi, and Heterozygosity) are all scaled up by 10x in the dataset (see supplementary Text S1 for more details). We’ll use the nucleotide diversity column Pi later in this chapter in plots, and it would be useful to have this scaled as per basepair nucleotide diversity (so as to make the scale more intuitive). We can create a new rescaled column called diversity with:

d$diversity <- d$Pi/(10 * 1000)  # rescale, removing 10x and making per bp
summary(d$diversity)
> #      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
> # 0.0000000 0.0005577 0.0010420 0.0012390 0.0016880 0.0265300

Average nucleotide diversity per basepair in this data is around 0.001 (0.12%), roughly what we’d expect from other estimates of human diversity (Hernandez et al., 2012, Perry et al., 2012).

Subsetting Dataframes

  • Let’s look at the total of SNPs per window
  • We see that this varies
summary(d$total.SNPs)
> #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
> #   0.000   3.000   7.000   8.906  12.000  93.000
  • Right-skewed data: 3rd quartile is 12 SNPs, but max is 93 SNPs.
  • We should investigate outliers
  • Subset to select out some rows that have 85 or more SNPs
  • create a logical vector containing whether each obs (row) has 85 or more SNPs
d$total.SNPs >= 85
> #  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> #  [ reached getOption("max.print") -- omitted 59065 entries ]
  • we are using a logical vector above
d[d$total.SNPs >= 85, ]
> #          start      end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 2567   2621001  2622000         93       11337 11.34          13     10
> # 12968 13023001 13024000         88       11784 11.78          11      1
> # 43165 47356001 47357000         87       12505 12.50           9      7
> #       reference.Bases  Theta     Pi Heterozygosity percent.GC
> # 2567             1000 43.420 50.926         81.589    43.9439
> # 12968            1000 33.413 19.030         74.838    28.8288
> # 43165            1000 29.621 27.108         69.573    46.7467
> #       Recombination Divergence Constraint SNPs  cent diversity
> # 2567    0.000706536 0.01701702          0    1 FALSE 0.0050926
> # 12968   0.000082600 0.01401401          0    1 FALSE 0.0019030
> # 43165   0.000500577 0.02002002          0    7 FALSE 0.0027108
  • Subset gives us window of 85 or greater SNPs
  • We can combine queries, e.g. let’s see all windows where Pi (nucleotide diversity) is greater than 16 & GC is greater than 80
d[d$Pi > 16 & d$percent.GC > 80, ]
> #          start      end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 58550 63097001 63098000          5         947  2.39           2      1
> # 58641 63188001 63189000          2        1623  3.21           2      0
> # 58642 63189001 63190000          5        1395  1.89           3      2
> #       reference.Bases  Theta     Pi Heterozygosity percent.GC
> # 58550             397 37.544 41.172         52.784    82.0821
> # 58641             506 16.436 16.436         12.327    82.3824
> # 58642             738 35.052 41.099         35.842    80.5806
> #       Recombination Divergence Constraint SNPs  cent diversity
> # 58550   0.000781326 0.03826531        226    1 FALSE 0.0041172
> # 58641   0.000347382 0.01678657        148    0 FALSE 0.0016436
> # 58642   0.000347382 0.01793722          0    0 FALSE 0.0041099
  • we are extracting all cols above
  • Let’s return only a few
d[d$Pi > 16 & d$percent.GC > 80, c("start", "end", "depth", "Pi")]
> #          start      end depth     Pi
> # 58550 63097001 63098000  2.39 41.172
> # 58641 63188001 63189000  3.21 16.436
> # 58642 63189001 63190000  1.89 41.099
  • we could reorder cols as well
d[d$Pi > 16 & d$percent.GC > 80, c("start", "end", "Pi", "depth")]
> #          start      end     Pi depth
> # 58550 63097001 63098000 41.172  2.39
> # 58641 63188001 63189000 16.436  3.21
> # 58642 63189001 63190000 41.099  1.89
  • Subsetting columns can be good way to summarize data.
  • Average depth in a window (depth col) differes between very high GC content windows (greater than 80%) and all other windows:
summary(d$depth[d$percent.GC >= 80])
> #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
> #    1.05    1.89    2.14    2.24    2.78    3.37
summary(d$depth[d$percent.GC < 80])
> #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
> #   1.000   6.970   8.170   8.185   9.400  21.910

Fairly large diff, but consider how many windows this includes. There are only nine windows that have GC content over 80%.

sum(d$percent.GC >= 80)
> # [1] 9
  • Let’s look at Pi by windows that fall in the centromere and those that do not.
  • d$cent is a logical vector we can subset directly and use the!` operator to get its complement.
summary(d$Pi[d$cent])
> #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
> #    0.00    7.95   16.08   20.41   27.36  194.40
summary(d$Pi[!d$cent])
> #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
> #   0.000   5.557  10.370  12.290  16.790 265.300
  • Centromere does appear to have higher Pi diversity that other regions in this data
d$Pi > 3
> #  [1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE
> # [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
> # [23]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
> # [34]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
> # [45]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
> # [56]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
> # [67]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
> #  [ reached getOption("max.print") -- omitted 59065 entries ]
which(d$Pi > 3)
> #  [1]  2  4  5  6  7 10 12 13 14 15 16 17 18 20 21 22 23 24 25 28 29 30 31
> # [24] 32 33 34 35 37 38 39 41 42 43 44 45 47 48 49 50 51 52 53 54 55 56 57
> # [47] 60 61 62 63 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84
> # [70] 85 86 87 88 89 90
> #  [ reached getOption("max.print") -- omitted 51319 entries ]

which() also has two related functions that return the index of the first minimum or maximum element of a vector: which.min() and which.max(). For example:

d[which.min(d$total.Bases), ]
> #          start      end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 25689 25785001 25786000          0         110     1           0      0
> #       reference.Bases Theta Pi Heterozygosity percent.GC Recombination
> # 25689             110     0  0              0    38.8388      4.63e-05
> #       Divergence Constraint SNPs  cent diversity
> # 25689 0.04946043          0    0 FALSE         0
d[which.max(d$depth), ]
> #        start     end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 8718 8773001 8774000         58       21914 21.91           7      4
> #      reference.Bases  Theta     Pi Heterozygosity percent.GC Recombination
> # 8718            1000 17.676 14.199         26.581    39.3393   0.001990459
> #      Divergence Constraint SNPs  cent diversity
> # 8718 0.01601602          0    1 FALSE 0.0014199

subset() takes two arguments: the dataframe to operate on, and then conditions to include a row. With subset(), d[d$Pi > 16 & d$percent.GC > 80, ] can be expressed as:

subset(d, Pi > 16 & percent.GC > 80)
> #          start      end total.SNPs total.Bases depth unique.SNPs dhSNPs
> # 58550 63097001 63098000          5         947  2.39           2      1
> # 58641 63188001 63189000          2        1623  3.21           2      0
> # 58642 63189001 63190000          5        1395  1.89           3      2
> #       reference.Bases  Theta     Pi Heterozygosity percent.GC
> # 58550             397 37.544 41.172         52.784    82.0821
> # 58641             506 16.436 16.436         12.327    82.3824
> # 58642             738 35.052 41.099         35.842    80.5806
> #       Recombination Divergence Constraint SNPs  cent diversity
> # 58550   0.000781326 0.03826531        226    1 FALSE 0.0041172
> # 58641   0.000347382 0.01678657        148    0 FALSE 0.0016436
> # 58642   0.000347382 0.01793722          0    0 FALSE 0.0041099

Optionally, a third argument can be supplied to specify which columns (and in what order) to include:

subset(d, Pi > 16 & percent.GC > 80, c(start, end, Pi, percent.GC, depth))
> #          start      end     Pi percent.GC depth
> # 58550 63097001 63098000 41.172    82.0821  2.39
> # 58641 63188001 63189000 16.436    82.3824  3.21
> # 58642 63189001 63190000 41.099    80.5806  1.89

Challenge

# install.packages('ggplot2')
library(ggplot2)

ggplot2 to create a scatterplot of nucleotide diversity along the chromosome in the diversity column in our d dataframe. Because our data is window-based, we’ll first add a column called position to our dataframe that’s the midpoint between each window:

d$position <- (d$end + d$start)/2
ggplot(d) + geom_point(aes(x = position, y = diversity))

References

Spencer, Panos AND Hunt, Chris C. A AND Deloukas. 2006. “The Influence of Recombination on Human Genetic Diversity.” PLoS Genet 2 (9). Public Library of Science: 1–11. doi:10.1371/journal.pgen.0020148.