reutils
is an R
package for interfacing with NCBI databases such as PubMed, Genbank, or
GEO via the Entrez Programming Utilities (EUtils). It
provides access to the nine basic eutils: einfo
,
esearch
, esummary
, epost
,
efetch
, elink
, egquery
,
espell
, and ecitmatch
.
Please check the relevant usage guidelines when using these services. Note that Entrez server requests are subject to frequency limits. Consider obtaining an NCBI API key if are a heavy user of E-utilities.
With nine E-Utilities, NCBI provides a programmatical interface to the Entrez query and database system for searching and retrieving requested data
Each of these tools corresponds to an R
function in the
reutils package described below.
esearch
esearch
: search and retrieve a list of primary UIDs or
the NCBI History Server information (queryKey and webEnv). The objects
returned by esearch
can be passed on directly to
epost
, esummary
, elink
, or
efetch
.
efetch
efetch
: retrieve data records from NCBI in a specified
retrieval type and retrieval mode as given in this table.
Data are returned as XML or text documents.
esummary
esummary
: retrieve Entrez database summaries (DocSums)
from a list of primary UIDs (Provided as a character vector or as an
esearch
object)
elink
elink
: retrieve a list of UIDs (and relevancy scores)
from a target database that are related to a set of UIDs provided by the
user. The objects returned by elink
can be passed on
directly to epost
, esummary
, or
efetch
.
einfo
einfo
: provide field names, term counts, last update,
and available updates for each database.
epost
epost
: upload primary UIDs to the users’s Web
Environment on the Entrez history server for subsequent use with
esummary
, elink
, or efetch
.
esearch
: Searching the Entrez databasesLet’s search PubMed for articles with Chlamydia psittaci in the title that have been published in 2020 and retrieve a list of PubMed IDs (PMIDs).
pmid <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed")
pmid
#> Object of class 'esearch'
#> List of UIDs from the 'pubmed' database.
#> [1] "33518111" "33463503" "33363353" "33343522" "33126635" "33112195"
#> [7] "32848009" "32830314" "32416138" "32326284" "32316620" "32314307"
#> [13] "32290117" "32183481" "32178660" "32135200" "32071972" "32057555"
#> [19] "32050885" "31951466" "31910921" "31755895" "31436332"
Alternatively we can collect the PMIDs on the history server.
pmid2 <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed", usehistory = TRUE)
pmid2
#> Object of class 'esearch'
#> Web Environment for the 'pubmed' database.
#> Number of UIDs stored on the History server: 23
#> Query Key: 1
#> WebEnv: MCID_67a036e75677871cad001c5d
We can also use esearch
to search GenBank. Here we do a
search for polymorphic membrane proteins (PMPs) in Chlamydiaceae.
cpaf <- esearch("Chlamydiaceae[orgn] and PMP[gene]", "nucleotide")
cpaf
#> Object of class 'esearch'
#> List of UIDs from the 'nucleotide' database.
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"
Some accessors for esearch
objects
getUrl(cpaf)
#> [1] "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?term=Chlamydiaceae%5Borgn%5D%20AND%20PMP%5Bgene%5D&db=nucleotide&retstart=0&retmax=100&rettype=uilist&retmode=xml&email=gerhard.schofl%40gmail.com&tool=reutils"
Extract a vector of GIs:
uid(cpaf)
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"
Get query key and web environment:
Extract the content of an EUtil request as XML.
content(cpaf, "xml")
#> <?xml version="1.0" encoding="UTF-8"?>
#> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
#> <eSearchResult>
#> <Count>7</Count>
#> <RetMax>7</RetMax>
#> <RetStart>0</RetStart>
#> <IdList>
#> <Id>519865230</Id>
#> <Id>410810883</Id>
#> <Id>313847556</Id>
#> <Id>532821947</Id>
#> <Id>519796743</Id>
#> <Id>532821218</Id>
#> <Id>519794601</Id>
#> </IdList>
#> <TranslationSet>
#> <Translation>
#> <From>Chlamydiaceae[orgn]</From>
#> <To>"Chlamydiaceae"[Organism]</To>
#> </Translation>
...
Or extract parts of the XML data using the reference class method
#xmlValue()
and an XPath expression:
esummary
: Retrieving summaries from primary IDsesummary
retrieves document summaries (docsums)
from a list of primary IDs. Let’s find out what the first entry for PMP
is about:
esum <- esummary(cpaf[1])
#> Warning: HTTPS error: Status 429;
esum
#> Object of class 'esummary'
#> [1] "HTTPS error: Status 429; "
We can also parse docsums into a tibble
efetch
: Downloading full records from EntrezFirst we search the protein database for sequences of the chlamydial protease activity factor, CPAF
cpaf <- esearch("Chlamydia[orgn] and CPAF", "protein")
#> Warning: HTTPS error: Status 429;
cpaf
#> Object of class 'esearch'
#> [1] "HTTPS error: Status 429; "
Let’s fetch the FASTA record for the first protein. To do that, we
have to set rettype = "fasta"
and
retmode = "text"
:
cpaff <- efetch(cpaf[1], db = "protein", rettype = "fasta", retmode = "text")
#> Warning: HTTPS error: Status 400;
cpaff
#> Object of class 'efetch'
#> [1] "HTTPS error: Status 400; "
#> EFetch query using the 'protein' database.
#> Query url: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efe...'
#> Retrieval type: 'fasta', retrieval mode: 'text'
Now we can write the sequence to a fasta file by first extracting the
data from the efetch
object using
content()
:
cpafx <- efetch(cpaf, db = "protein", rettype = "fasta", retmode = "xml")
#> Warning: HTTPS error: Status 429;
cpafx
#> Object of class 'efetch'
#> [1] "HTTPS error: Status 429; "
#> EFetch query using the 'protein' database.
#> Query url: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efe...'
#> Retrieval type: 'fasta', retrieval mode: 'xml'
einfo
: Information about the Entrez databasesYou can use einfo
to obtain a list of all database names
accessible through the Entrez utilities:
einfo()
#> Object of class 'einfo'
#> List of Entrez databases
#> [1] "pubmed" "protein" "nuccore" "ipg"
#> [5] "nucleotide" "structure" "genome" "annotinfo"
#> [9] "assembly" "bioproject" "biosample" "blastdbinfo"
#> [13] "books" "cdd" "clinvar" "gap"
#> [17] "gapplus" "grasp" "dbvar" "gene"
#> [21] "gds" "geoprofiles" "medgen" "mesh"
#> [25] "nlmcatalog" "omim" "orgtrack" "pmc"
#> [29] "popset" "proteinclusters" "pcassay" "protfam"
#> [33] "pccompound" "pcsubstance" "seqannot" "snp"
#> [37] "sra" "taxonomy" "biocollections" "gtr"
For each of these databases, we can use einfo
again to
obtain more information: