Introduction to reutils

reutils is an R package for interfacing with NCBI databases such as PubMed, Genbank, or GEO via the Entrez Programming Utilities (EUtils). It provides access to the nine basic eutils: einfo, esearch, esummary, epost, efetch, elink, egquery, espell, and ecitmatch.

Please check the relevant usage guidelines when using these services. Note that Entrez server requests are subject to frequency limits. Consider obtaining an NCBI API key if are a heavy user of E-utilities.

Important functions

With nine E-Utilities, NCBI provides a programmatical interface to the Entrez query and database system for searching and retrieving requested data

Each of these tools corresponds to an R function in the reutils package described below.


esearch: search and retrieve a list of primary UIDs or the NCBI History Server information (queryKey and webEnv). The objects returned by esearch can be passed on directly to epost, esummary, elink, or efetch.


efetch: retrieve data records from NCBI in a specified retrieval type and retrieval mode as given in this table. Data are returned as XML or text documents.


esummary: retrieve Entrez database summaries (DocSums) from a list of primary UIDs (Provided as a character vector or as an esearch object)


einfo: provide field names, term counts, last update, and available updates for each database.


epost: upload primary UIDs to the users’s Web Environment on the Entrez history server for subsequent use with esummary, elink, or efetch.


esearch: Searching the Entrez databases

Let’s search PubMed for articles with Chlamydia psittaci in the title that have been published in 2020 and retrieve a list of PubMed IDs (PMIDs).

pmid <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed")
#> Object of class 'esearch' 
#> List of UIDs from the 'pubmed' database.
#>  [1] "33518111" "33463503" "33363353" "33343522" "33126635" "33112195"
#>  [7] "32848009" "32830314" "32416138" "32326284" "32316620" "32314307"
#> [13] "32290117" "32183481" "32178660" "32135200" "32071972" "32057555"
#> [19] "32050885" "31951466" "31910921" "31755895" "31436332"

Alternatively we can collect the PMIDs on the history server.

pmid2 <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed", usehistory = TRUE)
#> Object of class 'esearch' 
#> Web Environment for the 'pubmed' database.
#> Number of UIDs stored on the History server: 23
#> Query Key: 1
#> WebEnv: MCID_67a036e75677871cad001c5d

We can also use esearch to search GenBank. Here we do a search for polymorphic membrane proteins (PMPs) in Chlamydiaceae.

cpaf <- esearch("Chlamydiaceae[orgn] and PMP[gene]", "nucleotide")
#> Object of class 'esearch' 
#> List of UIDs from the 'nucleotide' database.
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

Some accessors for esearch objects

#> [1] ""
#> No errors
#> [1] "nucleotide"

Extract a vector of GIs:

#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

Get query key and web environment:

#> [1] 1
#> [1] "MCID_67a036e75677871cad001c5d"

Extract the content of an EUtil request as XML.

content(cpaf, "xml")
#> <?xml version="1.0" encoding="UTF-8"?>
#> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "">
#> <eSearchResult>
#>   <Count>7</Count>
#>   <RetMax>7</RetMax>
#>   <RetStart>0</RetStart>
#>   <IdList>
#>     <Id>519865230</Id>
#>     <Id>410810883</Id>
#>     <Id>313847556</Id>
#>     <Id>532821947</Id>
#>     <Id>519796743</Id>
#>     <Id>532821218</Id>
#>     <Id>519794601</Id>
#>   </IdList>
#>   <TranslationSet>
#>     <Translation>
#>       <From>Chlamydiaceae[orgn]</From>
#>       <To>"Chlamydiaceae"[Organism]</To>
#>     </Translation>

Or extract parts of the XML data using the reference class method #xmlValue() and an XPath expression:

#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

esummary: Retrieving summaries from primary IDs

esummary retrieves document summaries (docsums) from a list of primary IDs. Let’s find out what the first entry for PMP is about:

esum <- esummary(cpaf[1])
#> Warning: HTTPS error: Status 429;
#> Object of class 'esummary' 
#> [1] "HTTPS error: Status 429; "

We can also parse docsums into a tibble

esum <- esummary(cpaf[1:4])
#> Warning: HTTPS error: Status 429;
content(esum, "parsed")
#> Warning: Errors parsing DocumentSummary
#> list()

efetch: Downloading full records from Entrez

First we search the protein database for sequences of the chlamydial protease activity factor, CPAF

cpaf <- esearch("Chlamydia[orgn] and CPAF", "protein")
#> Warning: HTTPS error: Status 429;
#> Object of class 'esearch' 
#> [1] "HTTPS error: Status 429; "

Let’s fetch the FASTA record for the first protein. To do that, we have to set rettype = "fasta" and retmode = "text":

cpaff <- efetch(cpaf[1], db = "protein", rettype = "fasta", retmode = "text")
#> Warning: HTTPS error: Status 400;
#> Object of class 'efetch' 
#> [1] "HTTPS error: Status 400; "
#> EFetch query using the 'protein' database.
#> Query url: ''
#> Retrieval type: 'fasta', retrieval mode: 'text'

Now we can write the sequence to a fasta file by first extracting the data from the efetch object using content():

write(content(cpaff), file = "~/cpaf.fna")
cpafx <- efetch(cpaf, db = "protein", rettype = "fasta", retmode = "xml")
#> Warning: HTTPS error: Status 429;
#> Object of class 'efetch' 
#> [1] "HTTPS error: Status 429; "
#> EFetch query using the 'protein' database.
#> Query url: ''
#> Retrieval type: 'fasta', retrieval mode: 'xml'
aa <- cpafx$xmlValue("//TSeq_sequence")
#> [1] NA
defline <- cpafx$xmlValue("//TSeq_defline")
#> [1] NA

einfo: Information about the Entrez databases

You can use einfo to obtain a list of all database names accessible through the Entrez utilities:

#> Object of class 'einfo' 
#> List of Entrez databases
#>  [1] "pubmed"          "protein"         "nuccore"         "ipg"            
#>  [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
#>  [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
#> [13] "books"           "cdd"             "clinvar"         "gap"            
#> [17] "gapplus"         "grasp"           "dbvar"           "gene"           
#> [21] "gds"             "geoprofiles"     "medgen"          "mesh"           
#> [25] "nlmcatalog"      "omim"            "orgtrack"        "pmc"            
#> [29] "popset"          "proteinclusters" "pcassay"         "protfam"        
#> [33] "pccompound"      "pcsubstance"     "seqannot"        "snp"            
#> [37] "sra"             "taxonomy"        "biocollections"  "gtr"

For each of these databases, we can use einfo again to obtain more information:

#> Warning: HTTPS error: Status 429;
#> Object of class 'einfo' 
#> [1] "HTTPS error: Status 429; "