NCBI Intro

NCBI has a lot of data in it. As of today, it has:

All records can be cross-referenced with the 1.3 million species in the NCBI taxonomy or 25.2 thousand disease-associated records in OMIM.

rentrez package

rentrez provides functions that work with the NCBI Eutils API to search, download data from, and otherwise interact with NCBI databases.

library(devtools)
install_github("ropensci/rentrez")

rentrez Uses the EUtils API

Gettting started with rentrez

Install

  • Install from the R Cran repository:
install.packages('rentrez')
  • From the development branch on GitHub:
install_github("ropensci/rentrez")
  • library() tells our R environment to load the package for use.
library(rentrez)

Getting started

entrez_dbs()
#  [1] "pubmed"          "protein"         "nuccore"        
#  [4] "ipg"             "nucleotide"      "nucgss"         
#  [7] "nucest"          "structure"       "sparcle"        
# [10] "genome"          "annotinfo"       "assembly"       
# [13] "bioproject"      "biosample"       "blastdbinfo"    
# [16] "books"           "cdd"             "clinvar"        
# [19] "clone"           "gap"             "gapplus"        
# [22] "grasp"           "dbvar"           "gene"           
# [25] "gds"             "geoprofiles"     "homologene"     
# [28] "medgen"          "mesh"            "ncbisearch"     
# [31] "nlmcatalog"      "omim"            "orgtrack"       
# [34] "pmc"             "popset"          "probe"          
# [37] "proteinclusters" "pcassay"         "biosystems"     
# [40] "pccompound"      "pcsubstance"     "pubmedhealth"   
# [43] "seqannot"        "snp"             "sra"            
# [46] "taxonomy"        "biocollections"  "unigene"        
# [49] "gencoll"         "gtr"

Helper Functions that help you learn about NCBI databases

Function name Return
entrez_db_summary() Brief description of what the database is
entrez_db_searchable() Set of search terms that can used with this database
entrez_db_links() Set of databases that might contain linked records
entrez_db_summary('dbvar')
#  DbName: dbvar
#  MenuName: dbVar
#  Description: dbVar records
#  DbBuild: Build170930-2230.1
#  Count: 6621056
#  LastUpdate: 2017/10/01 01:32
#entrez_db_summary('snp')
entrez_db_searchable("sra")
# Searchable fields for database 'sra'
#   ALL      All terms from all searchable fields 
#   UID      Unique number assigned to publication 
#   FILT     Limits the records 
#   ACCN     Accession number of sequence 
#   TITL     Words in definition line 
#   PROP     Classification by source qualifiers and molecule type 
#   WORD     Free text associated with record 
#   ORGN     Scientific and common names of organism, and all higher levels of taxonomy 
#   AUTH     Author(s) of publication 
#   PDAT     Date sequence added to GenBank 
#   MDAT     Date of last update 
#   GPRJ     BioProject 
#   BSPL     BioSample 
#   PLAT     Platform 
#   STRA     Strategy 
#   SRC      Source 
#   SEL      Selection 
#   LAY      Layout 
#   RLEN     Percent of aligned reads 
#   ACS      Access is public or controlled 
#   ALN      Percent of aligned reads 
#   MBS      Size in megabases

Using search field operators

entrez_search(db="sra",
              term="Tetrahymena thermophila[ORGN]",
              retmax=0)
# Entrez search result with 253 hits (object contains 0 IDs and no web_history object)
#  Search term (as translated):  "Tetrahymena thermophila"[Organism]
entrez_search(db="sra",
              term="Tetrahymena thermophila[ORGN] AND 2013:2015[PDAT]",
              retmax=0)
# Entrez search result with 75 hits (object contains 0 IDs and no web_history object)
#  Search term (as translated):  "Tetrahymena thermophila"[Organism] AND 2013[PDAT] ...
entrez_search(db="sra",
              term="(Tetrahymena thermophila[ORGN] OR Tetrahymena borealis[ORGN]) AND 2013:2015[PDAT]",
              retmax=0)
# Entrez search result with 75 hits (object contains 0 IDs and no web_history object)
#  Search term (as translated):  ("Tetrahymena thermophila"[Organism] OR "Tetrahyme ...
entrez_db_searchable("sra")
# Searchable fields for database 'sra'
#   ALL      All terms from all searchable fields 
#   UID      Unique number assigned to publication 
#   FILT     Limits the records 
#   ACCN     Accession number of sequence 
#   TITL     Words in definition line 
#   PROP     Classification by source qualifiers and molecule type 
#   WORD     Free text associated with record 
#   ORGN     Scientific and common names of organism, and all higher levels of taxonomy 
#   AUTH     Author(s) of publication 
#   PDAT     Date sequence added to GenBank 
#   MDAT     Date of last update 
#   GPRJ     BioProject 
#   BSPL     BioSample 
#   PLAT     Platform 
#   STRA     Strategy 
#   SRC      Source 
#   SEL      Selection 
#   LAY      Layout 
#   RLEN     Percent of aligned reads 
#   ACS      Access is public or controlled 
#   ALN      Percent of aligned reads 
#   MBS      Size in megabases

Precise queries using MeSH terms

entrez_search(db   = "pubmed",
              term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
# Entrez search result with 12 hits (object contains 12 IDs and no web_history object)
#  Search term (as translated):  "malaria, vivax"[MeSH Terms] AND "folic acid antag ...

Challenge (5min)

* Section 2 - Search and plot trendy topics in genetics example

original lesson ended here - go to lesson 2 to show plotting if less time

Narrowing our focus

  • If we know beforehand what sort of links we’d like to find, we can to use the db argument to narrow the focus of a call to entrez_link().

  • For instance, say we are interested in knowing about all of the RNA transcripts associated with the Amyloid Beta Precursor gene in humans.

  • Transcript sequences are stored in the nucleotide database (referred to as nuccore in EUtils), so to find transcripts associated with a given gene we need to set dbfrom=gene and db=nuccore.

nuc_links <- entrez_link(dbfrom='gene', id=351, db='nuccore')
nuc_links
# elink object with contents:
#  $links: IDs for linked records from NCBI
# 
nuc_links$links
# elink result with information from 5 databases:
# [1] gene_nuccore            gene_nuccore_mgc        gene_nuccore_pos       
# [4] gene_nuccore_refseqgene gene_nuccore_refseqrna
  • The object we get back contains links to the nucleotide database generally, but also to special subsets of that database like refseq.

  • We can take advantage of this narrower set of links to find IDs that match unique transcripts from our gene of interest.

nuc_links$links$gene_nuccore_refseqrna
#  [1] "324021747" "324021746" "324021739" "324021737" "324021735"
#  [6] "228008405" "228008404" "228008403" "228008402" "228008401"
  • We can use these ids in calls to entrez_fetch() or entrez_summary() to learn more about the transcripts they represent.

Using more than one ID

  • It is possible to pass more than one ID to entrez_link().

  • By default, doing so will give you a single elink object containing the complete set of links for all of the IDs that you specified.

  • So, if you were looking for protein IDs related to specific genes you could do:

all_links_together <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
all_links_together
# elink object with contents:
#  $links: IDs for linked records from NCBI
# 
all_links_together$links$gene_protein
#  [1] "1034662002" "1034662000" "1034661998" "1034661996" "1034661994"
#  [6] "1034661992" "558472750"  "545685826"  "194394158"  "166221824" 
# [11] "154936864"  "148697547"  "148697546"  "122346659"  "119602646" 
# [16] "119602645"  "119602644"  "119602643"  "119602642"  "81899807"  
# [21] "74215266"   "74186774"   "37787317"   "37787309"   "37787307"  
# [26] "37787305"   "37589273"   "33991172"   "31982089"   "26339824"  
# [31] "26329351"   "21619615"   "10834676"
  • Although this behaviour might sometimes be useful, it means we’ve lost track of which protein ID is linked to which gene ID.

  • To retain that information we can set by_id to TRUE. This gives us a list of elink objects, each once containing links from a single gene ID:

all_links_sep <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id = TRUE)
all_links_sep
# List of 2 elink objects,each containing
#   $links: IDs for linked records from NCBI
# 
lapply(all_links_sep, function(x) x$links$gene_protein)
# [[1]]
#  [1] "1034662002" "1034662000" "1034661998" "1034661996" "1034661994"
#  [6] "1034661992" "558472750"  "545685826"  "194394158"  "166221824" 
# [11] "154936864"  "122346659"  "119602646"  "119602645"  "119602644" 
# [16] "119602643"  "119602642"  "37787309"   "37787307"   "37787305"  
# [21] "33991172"   "21619615"   "10834676"  
# 
# [[2]]
#  [1] "148697547" "148697546" "81899807"  "74215266"  "74186774" 
#  [6] "37787317"  "37589273"  "31982089"  "26339824"  "26329351"

Getting summary data: entrez_summary()

  • Having found the unique IDs for some records via entrez_search or entrez_link(), you are probably going to want to learn something about them.

  • The Eutils API has two ways to get information about a record.

  • entrez_fetch() returns ‘full’ records in varying formats and entrez_summary() returns less information about each record, but in relatively simple format.

  • Very often the summary records have the information you are after, so rentrez provides functions to parse and summarise summary records.

The summary record

  • entrez_summary() takes a vector of unique IDs for the samples you want to get summary information from.

  • Let’s start by finding out something about the paper describing [Taxize](https://github.com/ropensci/taxize), using its PubMed ID:

taxize_summ <- entrez_summary(db="pubmed", id=24555091)
taxize_summ
# esummary result with 42 items:
#  [1] uid               pubdate           epubdate         
#  [4] source            authors           lastauthor       
#  [7] title             sorttitle         volume           
# [10] issue             pages             lang             
# [13] nlmuniqueid       issn              essn             
# [16] pubtype           recordstatus      pubstatus        
# [19] articleids        history           references       
# [22] attributes        pmcrefcount       fulljournalname  
# [25] elocationid       doctype           srccontriblist   
# [28] booktitle         medium            edition          
# [31] publisherlocation publishername     srcdate          
# [34] reportnumber      availablefromurl  locationlabel    
# [37] doccontriblist    docdate           bookname         
# [40] chapter           sortpubdate       sortfirstauthor
  • Once again, the object returned by entrez_summary behaves like a list, so you can extract elements using $.

  • For instance, we could convert our PubMed ID to another article identifier…

taxize_summ$articleids
#       idtype idtypen                           value
# 1     pubmed       1                        24555091
# 2        doi       3 10.12688/f1000research.2-191.v2
# 3        pmc       8                      PMC3901538
# 4        rid       8                        24563765
# 5        eid       8                        24555091
# 6    version       8                               2
# 7 version-id       8                               2
# 8      pmcid       5             pmc-id: PMC3901538;
  • …or see how many times the article has been cited in PubMed Central papers
taxize_summ$pmcrefcount
# [1] 13

Dealing with many records

  • If you give entrez_summary() a vector with more than one ID you’ll get a list of summary records back.

  • Let’s get those Plasmodium vivax papers we found in the entrez_search() section back, and fetch some summary data on each paper:

vivax_search <- entrez_search(db = "pubmed", 
                              term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)
  • rentrez provides a helper function, extract_from_esummary() that takes one or more elements from every summary record in one of these lists.

  • Here it is working with one…

extract_from_esummary(multi_summs, "fulljournalname")
#                                                                                                                 24861816 
# "Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases" 
#                                                                                                                 24145518 
#                                                                                  "Antimicrobial agents and chemotherapy" 
#                                                                                                                 24007534 
#                                                                                                        "Malaria journal" 
#                                                                                                                 23230341 
#                                                                                     "The Korean journal of parasitology" 
#                                                                                                                 23043980 
#                                                                                              "Experimental parasitology" 
#                                                                                                                 20810806 
#                                                                  "The American journal of tropical medicine and hygiene" 
#                                                                                                                 20412783 
#                                                                                                           "Acta tropica" 
#                                                                                                                 19597012 
#                                                                                          "Clinical microbiology reviews" 
#                                                                                                                 17556611 
#                                                                  "The American journal of tropical medicine and hygiene" 
#                                                                                                                 17519409 
#                                                                                                                   "JAMA" 
#                                                                                                                 17368986 
#                                                                                                 "Trends in parasitology" 
#                                                                                                                 12374849 
#                                        "Proceedings of the National Academy of Sciences of the United States of America"
  • and several elements (using knitr package used for dynamic report generation to display output in R)
date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount", "title"))
knitr::kable(head(t(date_and_cite)), row.names = FALSE)
pubdate pmcrefcount title
2014 Aug Prevalence of mutations in the antifolates resistance-associated genes (dhfr and dhps) in Plasmodium vivax parasites from Eastern and Central Sudan.
2014 5 Prevalence of polymorphisms in antifolate drug resistance molecular marker genes pvdhfr and pvdhps in clinical isolates of Plasmodium vivax from Kolkata, India.
2013 Sep 5 2 Prevalence and patterns of antifolate and chloroquine drug resistance markers in Plasmodium vivax across Pakistan.
2012 Dec 13 Prevalence of drug resistance-associated gene mutations in Plasmodium vivax in Central China.
2012 Dec 7 Novel mutations in the antifolate drug resistance marker genes among Plasmodium vivax isolates exhibiting severe manifestations.
2010 Sep 17 Mutations in the antifolate-resistance-associated genes dihydrofolate reductase and dihydropteroate synthase in Plasmodium vivax isolates from malaria-endemic countries.

Fetching full records: entrez_fetch()

  • As useful as the summary records are, sometimes they just don’t have the information that you need.

  • If you want a complete representation of a record you can use entrez_fetch, using the argument rettype to specify the format you’d like the record in.

Fetch DNA sequences in fasta format

gene_ids <- c(351, 11647)
linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
linked_transcripts <- linked_seq_ids$links$gene_nuccore_refseqrna
head(linked_transcripts)
# [1] "1039766414" "1039766413" "1039766411" "1039766410" "1039766409"
# [6] "563317856"
all_recs <- entrez_fetch(db="nuccore", id=linked_transcripts, rettype = "fasta")
class(all_recs)
# [1] "character"
nchar(all_recs)
# [1] 55183
cat(strwrap(substr(all_recs, 1, 500)), sep = "\n")
# >XM_006538500.2 PREDICTED: Mus musculus alkaline phosphatase,
# liver/bone/kidney (Alpl), transcript variant X5, mRNA
# GCGCCCGTGGCTTGCGCGACTCCCACGCGCGCGCTCCGCCGGTCCCGCAGTGACTGTCCCAGCCACGGTG
# GGGACACGTGGAAGGTCAGGCTCCCTGGGGACCCACGACCTCCCGCTCCGGACTCCGCGCGCATCTCTTG
# TGGCCTGGCAGGATGATGGACGTGGCGCCCGCTGAGCCGCTACCCAGGACCTCACCCTCGTGCTAAGCAC
# CTGCTCCCGGTGCCCACGCGCCTCCGTAGTCCACAGCTGCGCCCTTCGTGGTCCCTTGGCACTCTGTCCC
# GTTGGTGTCTAAAGTAGTTGGGGAGCAGCAGGAAGAAGGCACGTGCTGCGATCTTTGGCGGGAGAGATCG
# GAGACCGCGTGCTAGTGTCTGTCTGAGAG
write(all_recs, file = "my_transcripts.fasta")

End Lesson examples.

Topics not covered in this lesson:

  • Fetch a parsed XML document
  • Using NCBI’s Web History features

Where to learn more???