entrez_search()
entrez_search()
entrez_link()
NCBI has a lot of data in it. As of today, it has:
All records can be cross-referenced with the 1.3 million species in the NCBI taxonomy or 25.2 thousand disease-associated records in OMIM.
rentrez
provides functions that work with the NCBI Eutils API to search, download data from, and otherwise interact with NCBI databases.
library(devtools)
install_github("ropensci/rentrez")
install.packages('rentrez')
install_github("ropensci/rentrez")
library()
tells our R environment to load the package for use.library(rentrez)
entrez_dbs
to get a list of the databases we can search.entrez_dbs()
# [1] "pubmed" "protein" "nuccore"
# [4] "ipg" "nucleotide" "nucgss"
# [7] "nucest" "structure" "sparcle"
# [10] "genome" "annotinfo" "assembly"
# [13] "bioproject" "biosample" "blastdbinfo"
# [16] "books" "cdd" "clinvar"
# [19] "clone" "gap" "gapplus"
# [22] "grasp" "dbvar" "gene"
# [25] "gds" "geoprofiles" "homologene"
# [28] "medgen" "mesh" "ncbisearch"
# [31] "nlmcatalog" "omim" "orgtrack"
# [34] "pmc" "popset" "probe"
# [37] "proteinclusters" "pcassay" "biosystems"
# [40] "pccompound" "pcsubstance" "pubmedhealth"
# [43] "seqannot" "snp" "sra"
# [46] "taxonomy" "biocollections" "unigene"
# [49] "gencoll" "gtr"
Helper Functions that help you learn about NCBI databases
Function name | Return |
---|---|
entrez_db_summary() |
Brief description of what the database is |
entrez_db_searchable() |
Set of search terms that can used with this database |
entrez_db_links() |
Set of databases that might contain linked records |
entrez_db_summary('dbvar')
# DbName: dbvar
# MenuName: dbVar
# Description: dbVar records
# DbBuild: Build170930-2230.1
# Count: 6621056
# LastUpdate: 2017/10/01 01:32
#entrez_db_summary('snp')
entrez_db_searchable
to see what search fields and qualifiers are allowableentrez_db_searchable("sra")
# Searchable fields for database 'sra'
# ALL All terms from all searchable fields
# UID Unique number assigned to publication
# FILT Limits the records
# ACCN Accession number of sequence
# TITL Words in definition line
# PROP Classification by source qualifiers and molecule type
# WORD Free text associated with record
# ORGN Scientific and common names of organism, and all higher levels of taxonomy
# AUTH Author(s) of publication
# PDAT Date sequence added to GenBank
# MDAT Date of last update
# GPRJ BioProject
# BSPL BioSample
# PLAT Platform
# STRA Strategy
# SRC Source
# SEL Selection
# LAY Layout
# RLEN Percent of aligned reads
# ACS Access is public or controlled
# ALN Percent of aligned reads
# MBS Size in megabases
entrez_search()
rentrez
is search a given NCBI database to find records that match some keywordsentrez_search()
to do thisdb
) and a search term (term
) so let’s search PubMed for articles about the Natural Language Processing
:r_search <- entrez_search(db="pubmed", term="Natural Language Processing")
list
, and you can get a summary of its contents by printing it.r_search
# Entrez search result with 4811 hits (object contains 20 IDs and no web_history object)
# Search term (as translated): "natural language processing"[MeSH Terms] OR ("nat ...
retmax
, which controls the maximum number of returned values has a default value of 20.$
operator:r_search$ids
# [1] "28971429" "28967001" "28964503" "28962645" "28954419" "28952936"
# [7] "28950906" "28945588" "28940969" "28938912" "28936186" "28935617"
# [13] "28933506" "28932767" "28924815" "28924564" "28919830" "28916254"
# [19] "28906424" "28905434"
another_r_search <- entrez_search(db="pubmed", term="Natural Language Processing", retmax=40)
another_r_search
# Entrez search result with 4811 hits (object contains 40 IDs and no web_history object)
# Search term (as translated): "natural language processing"[MeSH Terms] OR ("nat ...
entrez_search()
Use a DOI to return the PMID of an article using entrez_search
Use an article DOI: Cancer risk reduction and reproductive concerns in female BRCA1/2 mutation carriers. DOI of 10.1007/s10689-007-9171-7.
wcancer_paper <- entrez_search(db="pubmed", term="10.1002/ijc.21536[doi]")
wcancer_paper$ids
# [1] "16331614"
Get some summary info
wcan_summary <- entrez_summary(db="pubmed", wcancer_paper$ids)
wcan_summary$title
# [1] "Tamoxifen and contralateral breast cancer in BRCA1 and BRCA2 carriers: an update."
wcan_summary$authors
# name authtype clusterid
# 1 Gronwald J Author
# 2 Tung N Author
# 3 Foulkes WD Author
# 4 Offit K Author
# 5 Gershoni R Author
# 6 Daly M Author
# 7 Kim-Sing C Author
# 8 Olsson H Author
# 9 Ainsworth P Author
# 10 Eisen A Author
# 11 Saal H Author
# 12 Friedman E Author
# 13 Olopade O Author
# 14 Osborne M Author
# 15 Weitzel J Author
# 16 Lynch H Author
# 17 Ghadirian P Author
# 18 Lubinski J Author
# 19 Sun P Author
# 20 Narod SA Author
# 21 Hereditary Breast Cancer Clinical Study Group. CollectiveName
NCBI has search field operators that we can add to queries query[search field]
.
For instance, we can find next generation sequence datasets for the (amazing…) ciliate Tetrahymena thermophila by using the organism (‘ORGN’) search field:
entrez_search(db="sra",
term="Tetrahymena thermophila[ORGN]",
retmax=0)
# Entrez search result with 253 hits (object contains 0 IDs and no web_history object)
# Search term (as translated): "Tetrahymena thermophila"[Organism]
entrez_search(db="sra",
term="Tetrahymena thermophila[ORGN] AND 2013:2015[PDAT]",
retmax=0)
# Entrez search result with 75 hits (object contains 0 IDs and no web_history object)
# Search term (as translated): "Tetrahymena thermophila"[Organism] AND 2013[PDAT] ...
entrez_search(db="sra",
term="(Tetrahymena thermophila[ORGN] OR Tetrahymena borealis[ORGN]) AND 2013:2015[PDAT]",
retmax=0)
# Entrez search result with 75 hits (object contains 0 IDs and no web_history object)
# Search term (as translated): ("Tetrahymena thermophila"[Organism] OR "Tetrahyme ...
entrez_db_searchable()
entrez_db_searchable("sra")
# Searchable fields for database 'sra'
# ALL All terms from all searchable fields
# UID Unique number assigned to publication
# FILT Limits the records
# ACCN Accession number of sequence
# TITL Words in definition line
# PROP Classification by source qualifiers and molecule type
# WORD Free text associated with record
# ORGN Scientific and common names of organism, and all higher levels of taxonomy
# AUTH Author(s) of publication
# PDAT Date sequence added to GenBank
# MDAT Date of last update
# GPRJ BioProject
# BSPL BioSample
# PLAT Platform
# STRA Strategy
# SRC Source
# SEL Selection
# LAY Layout
# RLEN Percent of aligned reads
# ACS Access is public or controlled
# ALN Percent of aligned reads
# MBS Size in megabases
entrez_search(db = "pubmed",
term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
# Entrez search result with 12 hits (object contains 12 IDs and no web_history object)
# Search term (as translated): "malaria, vivax"[MeSH Terms] AND "folic acid antag ...
rentrez
.entrez_search(db="mesh", term =...)
and learn about the results of your search using the tools described below.entrez_link()
*entrez_link()
allows users to discover these links between records.
Let’s find all NCBI data associated with a single gene (in this case the Amyloid Beta Precursor gene, the product of which is associated with the plaques that form in the brains of Alzheimer’s Disease patients).
we need to provide an ID (id
), the database from which this ID comes (dbfrom
) and the name of a database in which to find linked records (db
)
If we set this last argument to ‘all’ we can find links in multiple databases:
all_the_links <- entrez_link(dbfrom='gene', id=351, db='all')
all_the_links
# elink object with contents:
# $links: IDs for linked records from NCBI
#
all_the_links$links
# elink result with information from 54 databases:
# [1] gene_bioconcepts gene_biosystems
# [3] gene_biosystems_all gene_clinvar
# [5] gene_clinvar_specific gene_dbvar
# [7] gene_genome gene_gtr
# [9] gene_homologene gene_medgen_diseases
# [11] gene_pcassay_alltarget_list gene_pcassay_alltarget_summary
# [13] gene_pcassay_rnai gene_pcassay_target
# [15] gene_probe gene_structure
# [17] gene_bioproject gene_books
# [19] gene_cdd gene_gene_h3k4me3
# [21] gene_gene_neighbors gene_genereviews
# [23] gene_genome2 gene_geoprofiles
# [25] gene_nuccore gene_nuccore_mgc
# [27] gene_nuccore_pos gene_nuccore_refseqgene
# [29] gene_nuccore_refseqrna gene_nucest
# [31] gene_nucest_clust gene_nucleotide
# [33] gene_nucleotide_clust gene_nucleotide_mgc
# [35] gene_nucleotide_mgc_url gene_nucleotide_pos
# [37] gene_omim gene_pcassay_proteintarget
# [39] gene_pccompound gene_pcsubstance
# [41] gene_pmc gene_pmc_nucleotide
# [43] gene_protein gene_protein_refseq
# [45] gene_pubmed gene_pubmed_citedinomim
# [47] gene_pubmed_pmc_nucleotide gene_pubmed_rif
# [49] gene_snp gene_snp_geneview
# [51] gene_sparcle gene_taxonomy
# [53] gene_unigene gene_varview
names of the list elements are in the format [source_database]_[linked_database]
and the elements themselves contain a vector of linked-IDs
if we want to find open access publications associated with this gene we could get linked records in PubMed Central:
all_the_links$links$gene_pmc[1:10]
# [1] "5561919" "5560349" "5559291" "5548265" "5540602" "5434815" "5395029"
# [8] "5360245" "5233555" "5104494"
all_the_links$links$gene_clinvar
# [1] "397432" "396808" "396332" "396150" "394309" "391496" "369359"
# [8] "339659" "339658" "339657" "339656" "339655" "339654" "339653"
# [15] "339652" "339651" "339650" "339649" "339648" "339647" "339646"
# [22] "339645" "339644" "339643" "339642" "339641" "339640" "339639"
# [29] "339638" "339637" "339636" "339635" "339634" "339633" "339632"
# [36] "339631" "339630" "339629" "339628" "339627" "339626" "339625"
# [43] "339624" "339623" "339622" "339621" "339620" "339619" "253512"
# [50] "253403" "236549" "236548" "236547" "221889" "160886" "155682"
# [57] "155309" "155093" "155053" "154360" "154063" "153438" "152839"
# [64] "151388" "150018" "149551" "149418" "149160" "149035" "148411"
# [71] "148262" "148180" "146125" "145984" "145474" "145468" "145332"
# [78] "145107" "144677" "144194" "127268" "98242" "98241" "98240"
# [85] "98239" "98238" "98237" "98236" "98235" "59247" "59246"
# [92] "59245" "59243" "59226" "59224" "59223" "59222" "59221"
# [99] "59010" "59005" "59004" "37145" "32099" "18106" "18105"
# [106] "18104" "18103" "18102" "18101" "18100" "18099" "18098"
# [113] "18097" "18096" "18095" "18094" "18093" "18092" "18091"
# [120] "18090" "18089" "18088" "18087"
If we know beforehand what sort of links we’d like to find, we can to use the db
argument to narrow the focus of a call to entrez_link()
.
For instance, say we are interested in knowing about all of the RNA transcripts associated with the Amyloid Beta Precursor gene in humans.
Transcript sequences are stored in the nucleotide database (referred to as nuccore in EUtils), so to find transcripts associated with a given gene we need to set dbfrom=gene and db=nuccore.
nuc_links <- entrez_link(dbfrom='gene', id=351, db='nuccore')
nuc_links
# elink object with contents:
# $links: IDs for linked records from NCBI
#
nuc_links$links
# elink result with information from 5 databases:
# [1] gene_nuccore gene_nuccore_mgc gene_nuccore_pos
# [4] gene_nuccore_refseqgene gene_nuccore_refseqrna
The object we get back contains links to the nucleotide database generally, but also to special subsets of that database like refseq
.
We can take advantage of this narrower set of links to find IDs that match unique transcripts from our gene of interest.
nuc_links$links$gene_nuccore_refseqrna
# [1] "324021747" "324021746" "324021739" "324021737" "324021735"
# [6] "228008405" "228008404" "228008403" "228008402" "228008401"
entrez_fetch()
or entrez_summary()
to learn more about the transcripts they represent.In addition to finding data within the NCBI, entrez_link
can turn up connections to external databases. Perhaps the most interesting example is finding links to the full text of papers in PubMed.
For example, when I wrote this document the first paper linked to Amyloid Beta Precursor
had a unique ID of 25500142
. We can find links to the full text of that paper with entrez_link
by setting the cmd argument to ‘llinks’
:
paper_links <- entrez_link(dbfrom="pubmed", id=25500142, cmd="llinks")
paper_links
# elink object with contents:
# $linkouts: links to external websites
linkouts
object contains information about an external source of data on this paper:paper_links$linkouts
# $ID_25500142
# $ID_25500142[[1]]
# Linkout from Elsevier Science
# $Url: https://linkinghub.elsevie ...
#
# $ID_25500142[[2]]
# Linkout from Europe PubMed Central
# $Url: http://europepmc.org/abstr ...
#
# $ID_25500142[[3]]
# Linkout from Ovid Technologies, Inc.
# $Url: http://ovidsp.ovid.com/ovi ...
#
# $ID_25500142[[4]]
# Linkout from PubMed Central
# $Url: https://www.ncbi.nlm.nih.g ...
#
# $ID_25500142[[5]]
# Linkout from PubMed Central Canada
# $Url: http://pubmedcentralcanada ...
#
# $ID_25500142[[6]]
# Linkout from MedlinePlus Health Information
# $Url: https://medlineplus.gov/al ...
#
# $ID_25500142[[7]]
# Linkout from Mouse Genome Informatics (MGI)
# $Url: http://www.informatics.jax ...
Each of those linkout objects contains quite a lot of information, but the URL is probably the most useful.
For that reason, rentrez
provides the function linkout_urls
to make extracting just the URL simple:
linkout_urls(paper_links)
# $ID_25500142
# [1] "https://linkinghub.elsevier.com/retrieve/pii/S0014-4886(14)00393-8"
# [2] "http://europepmc.org/abstract/MED/25500142"
# [3] "http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=25500142.ui"
# [4] "https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/25500142/"
# [5] "http://pubmedcentralcanada.ca/pmcc/articles/pmid/25500142"
# [6] "https://medlineplus.gov/alzheimersdisease.html"
# [7] "http://www.informatics.jax.org/reference/25500142"
The full list of options for the cmd argument are given in in-line documentation (?entrez_link)
.
If you are interested in finding full text records for a large number of articles checkout the package fulltext which makes use of multiple sources (including the NCBI) to discover the full text
articles.
It is possible to pass more than one ID to entrez_link()
.
By default, doing so will give you a single elink object containing the complete set of links for all of the IDs that you specified.
So, if you were looking for protein IDs related to specific genes you could do:
all_links_together <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
all_links_together
# elink object with contents:
# $links: IDs for linked records from NCBI
#
all_links_together$links$gene_protein
# [1] "1034662002" "1034662000" "1034661998" "1034661996" "1034661994"
# [6] "1034661992" "558472750" "545685826" "194394158" "166221824"
# [11] "154936864" "148697547" "148697546" "122346659" "119602646"
# [16] "119602645" "119602644" "119602643" "119602642" "81899807"
# [21] "74215266" "74186774" "37787317" "37787309" "37787307"
# [26] "37787305" "37589273" "33991172" "31982089" "26339824"
# [31] "26329351" "21619615" "10834676"
Although this behaviour might sometimes be useful, it means we’ve lost track of which protein
ID is linked to which gene
ID.
To retain that information we can set by_id
to TRUE. This gives us a list of elink objects, each once containing links from a single gene ID:
all_links_sep <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id = TRUE)
all_links_sep
# List of 2 elink objects,each containing
# $links: IDs for linked records from NCBI
#
lapply(all_links_sep, function(x) x$links$gene_protein)
# [[1]]
# [1] "1034662002" "1034662000" "1034661998" "1034661996" "1034661994"
# [6] "1034661992" "558472750" "545685826" "194394158" "166221824"
# [11] "154936864" "122346659" "119602646" "119602645" "119602644"
# [16] "119602643" "119602642" "37787309" "37787307" "37787305"
# [21] "33991172" "21619615" "10834676"
#
# [[2]]
# [1] "148697547" "148697546" "81899807" "74215266" "74186774"
# [6] "37787317" "37589273" "31982089" "26339824" "26329351"
Having found the unique IDs for some records via entrez_search
or entrez_link()
, you are probably going to want to learn something about them.
The Eutils API has two ways to get information about a record.
entrez_fetch()
returns ‘full’ records in varying formats and entrez_summary()
returns less information about each record, but in relatively simple format.
Very often the summary records have the information you are after, so rentrez provides functions to parse and summarise summary records.
entrez_summary()
takes a vector of unique IDs for the samples you want to get summary information from.
Let’s start by finding out something about the paper describing [Taxize](https://github.com/ropensci/taxize)
, using its PubMed ID:
taxize_summ <- entrez_summary(db="pubmed", id=24555091)
taxize_summ
# esummary result with 42 items:
# [1] uid pubdate epubdate
# [4] source authors lastauthor
# [7] title sorttitle volume
# [10] issue pages lang
# [13] nlmuniqueid issn essn
# [16] pubtype recordstatus pubstatus
# [19] articleids history references
# [22] attributes pmcrefcount fulljournalname
# [25] elocationid doctype srccontriblist
# [28] booktitle medium edition
# [31] publisherlocation publishername srcdate
# [34] reportnumber availablefromurl locationlabel
# [37] doccontriblist docdate bookname
# [40] chapter sortpubdate sortfirstauthor
Once again, the object returned by entrez_summary
behaves like a list, so you can extract elements using $.
For instance, we could convert our PubMed ID to another article identifier…
taxize_summ$articleids
# idtype idtypen value
# 1 pubmed 1 24555091
# 2 doi 3 10.12688/f1000research.2-191.v2
# 3 pmc 8 PMC3901538
# 4 rid 8 24563765
# 5 eid 8 24555091
# 6 version 8 2
# 7 version-id 8 2
# 8 pmcid 5 pmc-id: PMC3901538;
taxize_summ$pmcrefcount
# [1] 13
If you give entrez_summary()
a vector with more than one ID you’ll get a list of summary records back.
Let’s get those Plasmodium vivax
papers we found in the entrez_search()
section back, and fetch some summary data on each paper:
vivax_search <- entrez_search(db = "pubmed",
term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)
rentrez
provides a helper function, extract_from_esummary()
that takes one or more elements from every summary record in one of these lists.
Here it is working with one…
extract_from_esummary(multi_summs, "fulljournalname")
# 24861816
# "Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases"
# 24145518
# "Antimicrobial agents and chemotherapy"
# 24007534
# "Malaria journal"
# 23230341
# "The Korean journal of parasitology"
# 23043980
# "Experimental parasitology"
# 20810806
# "The American journal of tropical medicine and hygiene"
# 20412783
# "Acta tropica"
# 19597012
# "Clinical microbiology reviews"
# 17556611
# "The American journal of tropical medicine and hygiene"
# 17519409
# "JAMA"
# 17368986
# "Trends in parasitology"
# 12374849
# "Proceedings of the National Academy of Sciences of the United States of America"
date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount", "title"))
knitr::kable(head(t(date_and_cite)), row.names = FALSE)
pubdate | pmcrefcount | title |
---|---|---|
2014 Aug | Prevalence of mutations in the antifolates resistance-associated genes (dhfr and dhps) in Plasmodium vivax parasites from Eastern and Central Sudan. | |
2014 | 5 | Prevalence of polymorphisms in antifolate drug resistance molecular marker genes pvdhfr and pvdhps in clinical isolates of Plasmodium vivax from Kolkata, India. |
2013 Sep 5 | 2 | Prevalence and patterns of antifolate and chloroquine drug resistance markers in Plasmodium vivax across Pakistan. |
2012 Dec | 13 | Prevalence of drug resistance-associated gene mutations in Plasmodium vivax in Central China. |
2012 Dec | 7 | Novel mutations in the antifolate drug resistance marker genes among Plasmodium vivax isolates exhibiting severe manifestations. |
2010 Sep | 17 | Mutations in the antifolate-resistance-associated genes dihydrofolate reductase and dihydropteroate synthase in Plasmodium vivax isolates from malaria-endemic countries. |
entrez_fetch()
As useful as the summary records are, sometimes they just don’t have the information that you need.
If you want a complete representation of a record you can use entrez_fetch
, using the argument rettype
to specify the format you’d like the record in.
gene_ids <- c(351, 11647)
linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
linked_transcripts <- linked_seq_ids$links$gene_nuccore_refseqrna
head(linked_transcripts)
# [1] "1039766414" "1039766413" "1039766411" "1039766410" "1039766409"
# [6] "563317856"
entrez_fetch
, setting rettype
to “fasta”
(the list of formats available for each database is give in this table):all_recs <- entrez_fetch(db="nuccore", id=linked_transcripts, rettype = "fasta")
class(all_recs)
# [1] "character"
nchar(all_recs)
# [1] 55183
We now have a really huge character vector!
Rather than printing all those thousands of bases we can take a peak at the top of the file:
cat(strwrap(substr(all_recs, 1, 500)), sep = "\n")
# >XM_006538500.2 PREDICTED: Mus musculus alkaline phosphatase,
# liver/bone/kidney (Alpl), transcript variant X5, mRNA
# GCGCCCGTGGCTTGCGCGACTCCCACGCGCGCGCTCCGCCGGTCCCGCAGTGACTGTCCCAGCCACGGTG
# GGGACACGTGGAAGGTCAGGCTCCCTGGGGACCCACGACCTCCCGCTCCGGACTCCGCGCGCATCTCTTG
# TGGCCTGGCAGGATGATGGACGTGGCGCCCGCTGAGCCGCTACCCAGGACCTCACCCTCGTGCTAAGCAC
# CTGCTCCCGGTGCCCACGCGCCTCCGTAGTCCACAGCTGCGCCCTTCGTGGTCCCTTGGCACTCTGTCCC
# GTTGGTGTCTAAAGTAGTTGGGGAGCAGCAGGAAGAAGGCACGTGCTGCGATCTTTGGCGGGAGAGATCG
# GAGACCGCGTGCTAGTGTCTGTCTGAGAG
write(all_recs, file = "my_transcripts.fasta")