Introduction to reutils

reutils is an R package for interfacing with NCBI databases such as PubMed, Genbank, or GEO via the Entrez Programming Utilities (EUtils). It provides access to the nine basic eutils: einfo, esearch, esummary, epost, efetch, elink, egquery, espell, and ecitmatch.

Please check the relevant usage guidelines when using these services. Note that Entrez server requests are subject to frequency limits. Consider obtaining an NCBI API key if are a heavy user of E-utilities.

Important functions

With nine E-Utilities, NCBI provides a programmatical interface to the Entrez query and database system for searching and retrieving requested data

Each of these tools corresponds to an R function in the reutils package described below.

esearch

esearch: search and retrieve a list of primary UIDs or the NCBI History Server information (queryKey and webEnv). The objects returned by esearch can be passed on directly to epost, esummary, elink, or efetch.

efetch

efetch: retrieve data records from NCBI in a specified retrieval type and retrieval mode as given in this table. Data are returned as XML or text documents.

esummary

esummary: retrieve Entrez database summaries (DocSums) from a list of primary UIDs (Provided as a character vector or as an esearch object)

einfo

einfo: provide field names, term counts, last update, and available updates for each database.

epost

epost: upload primary UIDs to the users’s Web Environment on the Entrez history server for subsequent use with esummary, elink, or efetch.

Examples

esearch: Searching the Entrez databases

Let’s search PubMed for articles with Chlamydia psittaci in the title that have been published in 2020 and retrieve a list of PubMed IDs (PMIDs).

pmid <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed")
pmid
#> Object of class 'esearch' 
#> List of UIDs from the 'pubmed' database.
#>  [1] "33518111" "33463503" "33363353" "33343522" "33126635" "33112195"
#>  [7] "32848009" "32830314" "32416138" "32326284" "32316620" "32314307"
#> [13] "32290117" "32183481" "32178660" "32135200" "32071972" "32057555"
#> [19] "32050885" "31951466" "31910921" "31755895" "31436332"

Alternatively we can collect the PMIDs on the history server.

pmid2 <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed", usehistory = TRUE)
pmid2
#> Object of class 'esearch' 
#> Web Environment for the 'pubmed' database.
#> Number of UIDs stored on the History server: 23
#> Query Key: 1
#> WebEnv: MCID_67a036e75677871cad001c5d

We can also use esearch to search GenBank. Here we do a search for polymorphic membrane proteins (PMPs) in Chlamydiaceae.

cpaf <- esearch("Chlamydiaceae[orgn] and PMP[gene]", "nucleotide")
cpaf
#> Object of class 'esearch' 
#> List of UIDs from the 'nucleotide' database.
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

Some accessors for esearch objects

getUrl(cpaf)
#> [1] "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?term=Chlamydiaceae%5Borgn%5D%20AND%20PMP%5Bgene%5D&db=nucleotide&retstart=0&retmax=100&rettype=uilist&retmode=xml&email=gerhard.schofl%40gmail.com&tool=reutils"
getError(cpaf)
#> No errors
database(cpaf)
#> [1] "nucleotide"

Extract a vector of GIs:

uid(cpaf)
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

Get query key and web environment:

querykey(pmid2)
#> [1] 1
webenv(pmid2)
#> [1] "MCID_67a036e75677871cad001c5d"

Extract the content of an EUtil request as XML.

content(cpaf, "xml")
#> <?xml version="1.0" encoding="UTF-8"?>
#> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
#> <eSearchResult>
#>   <Count>7</Count>
#>   <RetMax>7</RetMax>
#>   <RetStart>0</RetStart>
#>   <IdList>
#>     <Id>519865230</Id>
#>     <Id>410810883</Id>
#>     <Id>313847556</Id>
#>     <Id>532821947</Id>
#>     <Id>519796743</Id>
#>     <Id>532821218</Id>
#>     <Id>519794601</Id>
#>   </IdList>
#>   <TranslationSet>
#>     <Translation>
#>       <From>Chlamydiaceae[orgn]</From>
#>       <To>"Chlamydiaceae"[Organism]</To>
#>     </Translation>
...

Or extract parts of the XML data using the reference class method #xmlValue() and an XPath expression:

cpaf$xmlValue("//Id")
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

esummary: Retrieving summaries from primary IDs

esummary retrieves document summaries (docsums) from a list of primary IDs. Let’s find out what the first entry for PMP is about:

esum <- esummary(cpaf[1])
#> Warning: HTTPS error: Status 429;
esum
#> Object of class 'esummary' 
#> [1] "HTTPS error: Status 429; "

We can also parse docsums into a tibble

esum <- esummary(cpaf[1:4])
#> Warning: HTTPS error: Status 429;
content(esum, "parsed")
#> Warning: Errors parsing DocumentSummary
#> list()

efetch: Downloading full records from Entrez

First we search the protein database for sequences of the chlamydial protease activity factor, CPAF

cpaf <- esearch("Chlamydia[orgn] and CPAF", "protein")
#> Warning: HTTPS error: Status 429;
cpaf
#> Object of class 'esearch' 
#> [1] "HTTPS error: Status 429; "

Let’s fetch the FASTA record for the first protein. To do that, we have to set rettype = "fasta" and retmode = "text":

cpaff <- efetch(cpaf[1], db = "protein", rettype = "fasta", retmode = "text")
#> Warning: HTTPS error: Status 400;
cpaff
#> Object of class 'efetch' 
#> [1] "HTTPS error: Status 400; "
#> EFetch query using the 'protein' database.
#> Query url: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efe...'
#> Retrieval type: 'fasta', retrieval mode: 'text'

Now we can write the sequence to a fasta file by first extracting the data from the efetch object using content():

write(content(cpaff), file = "~/cpaf.fna")
cpafx <- efetch(cpaf, db = "protein", rettype = "fasta", retmode = "xml")
#> Warning: HTTPS error: Status 429;
cpafx
#> Object of class 'efetch' 
#> [1] "HTTPS error: Status 429; "
#> EFetch query using the 'protein' database.
#> Query url: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efe...'
#> Retrieval type: 'fasta', retrieval mode: 'xml'
aa <- cpafx$xmlValue("//TSeq_sequence")
aa
#> [1] NA
defline <- cpafx$xmlValue("//TSeq_defline")
defline
#> [1] NA

einfo: Information about the Entrez databases

You can use einfo to obtain a list of all database names accessible through the Entrez utilities:

einfo()
#> Object of class 'einfo' 
#> List of Entrez databases
#>  [1] "pubmed"          "protein"         "nuccore"         "ipg"            
#>  [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
#>  [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
#> [13] "books"           "cdd"             "clinvar"         "gap"            
#> [17] "gapplus"         "grasp"           "dbvar"           "gene"           
#> [21] "gds"             "geoprofiles"     "medgen"          "mesh"           
#> [25] "nlmcatalog"      "omim"            "orgtrack"        "pmc"            
#> [29] "popset"          "proteinclusters" "pcassay"         "protfam"        
#> [33] "pccompound"      "pcsubstance"     "seqannot"        "snp"            
#> [37] "sra"             "taxonomy"        "biocollections"  "gtr"

For each of these databases, we can use einfo again to obtain more information:

einfo("taxonomy")
#> Warning: HTTPS error: Status 429;
#> Object of class 'einfo' 
#> [1] "HTTPS error: Status 429; "