Introduction to reutils

reutils is an R package for interfacing with NCBI databases such as PubMed, Genbank, or GEO via the Entrez Programming Utilities (EUtils). It provides access to the nine basic eutils: einfo, esearch, esummary, epost, efetch, elink, egquery, espell, and ecitmatch.

Please check the relevant usage guidelines when using these services. Note that Entrez server requests are subject to frequency limits. Consider obtaining an NCBI API key if are a heavy user of E-utilities.

Important functions

With nine E-Utilities, NCBI provides a programmatical interface to the Entrez query and database system for searching and retrieving requested data

Each of these tools corresponds to an R function in the reutils package described below.

`esearch`

esearch: search and retrieve a list of primary UIDs or the NCBI History Server information (queryKey and webEnv). The objects returned by esearch can be passed on directly to epost, esummary, elink, or efetch.

`efetch`

efetch: retrieve data records from NCBI in a specified retrieval type and retrieval mode as given in this table. Data are returned as XML or text documents.

`esummary`

esummary: retrieve Entrez database summaries (DocSums) from a list of primary UIDs (Provided as a character vector or as an esearch object)

`elink`

elink: retrieve a list of UIDs (and relevancy scores) from a target database that are related to a set of UIDs provided by the user. The objects returned by elink can be passed on directly to epost, esummary, or efetch.

`einfo`

einfo: provide field names, term counts, last update, and available updates for each database.

`epost`

epost: upload primary UIDs to the users’s Web Environment on the Entrez history server for subsequent use with esummary, elink, or efetch.

Examples

`esearch`: Searching the Entrez databases

Let’s search PubMed for articles with Chlamydia psittaci in the title that have been published in 2020 and retrieve a list of PubMed IDs (PMIDs).

pmid <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed")
pmid
#> Object of class 'esearch' 
#> List of UIDs from the 'pubmed' database.
#>  [1] "33518111" "33463503" "33363353" "33343522" "33126635" "33112195"
#>  [7] "32848009" "32830314" "32416138" "32326284" "32316620" "32314307"
#> [13] "32290117" "32183481" "32178660" "32135200" "32071972" "32057555"
#> [19] "32050885" "31951466" "31910921" "31755895" "31436332"

Alternatively we can collect the PMIDs on the history server.

pmid2 <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed", usehistory = TRUE)
pmid2
#> Object of class 'esearch' 
#> Web Environment for the 'pubmed' database.
#> Number of UIDs stored on the History server: 23
#> Query Key: 1
#> WebEnv: MCID_67a036e75677871cad001c5d

We can also use esearch to search GenBank. Here we do a search for polymorphic membrane proteins (PMPs) in Chlamydiaceae.

cpaf <- esearch("Chlamydiaceae[orgn] and PMP[gene]", "nucleotide")
cpaf
#> Object of class 'esearch' 
#> List of UIDs from the 'nucleotide' database.
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

Some accessors for esearch objects

getUrl(cpaf)
#> [1] "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?term=Chlamydiaceae%5Borgn%5D%20AND%20PMP%5Bgene%5D&db=nucleotide&retstart=0&retmax=100&rettype=uilist&retmode=xml&email=gerhard.schofl%40gmail.com&tool=reutils"

getError(cpaf)
#> No errors

database(cpaf)
#> [1] "nucleotide"

Extract a vector of GIs:

uid(cpaf)
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

Get query key and web environment:

querykey(pmid2)
#> [1] 1

webenv(pmid2)
#> [1] "MCID_67a036e75677871cad001c5d"

Extract the content of an EUtil request as XML.

content(cpaf, "xml")
#> <?xml version="1.0" encoding="UTF-8"?>
#> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
#> <eSearchResult>
#>   <Count>7</Count>
#>   <RetMax>7</RetMax>
#>   <RetStart>0</RetStart>
#>   <IdList>
#>     <Id>519865230</Id>
#>     <Id>410810883</Id>
#>     <Id>313847556</Id>
#>     <Id>532821947</Id>
#>     <Id>519796743</Id>
#>     <Id>532821218</Id>
#>     <Id>519794601</Id>
#>   </IdList>
#>   <TranslationSet>
#>     <Translation>
#>       <From>Chlamydiaceae[orgn]</From>
#>       <To>"Chlamydiaceae"[Organism]</To>
#>     </Translation>
...

Or extract parts of the XML data using the reference class method #xmlValue() and an XPath expression:

cpaf$xmlValue("//Id")
#> [1] "519865230" "410810883" "313847556" "532821947" "519796743" "532821218"
#> [7] "519794601"

`esummary`: Retrieving summaries from primary IDs

esummary retrieves document summaries (docsums) from a list of primary IDs. Let’s find out what the first entry for PMP is about:

esum <- esummary(cpaf[1])
#> Warning: HTTPS error: Status 429;
esum
#> Object of class 'esummary' 
#> [1] "HTTPS error: Status 429; "

We can also parse docsums into a tibble

esum <- esummary(cpaf[1:4])
#> Warning: HTTPS error: Status 429;
content(esum, "parsed")
#> Warning: Errors parsing DocumentSummary
#> list()

`efetch`: Downloading full records from Entrez

First we search the protein database for sequences of the chlamydial protease activity factor, CPAF

cpaf <- esearch("Chlamydia[orgn] and CPAF", "protein")
#> Warning: HTTPS error: Status 429;
cpaf
#> Object of class 'esearch' 
#> [1] "HTTPS error: Status 429; "

Let’s fetch the FASTA record for the first protein. To do that, we have to set rettype = "fasta" and retmode = "text":

cpaff <- efetch(cpaf[1], db = "protein", rettype = "fasta", retmode = "text")
#> Warning: HTTPS error: Status 400;
cpaff
#> Object of class 'efetch' 
#> [1] "HTTPS error: Status 400; "
#> EFetch query using the 'protein' database.
#> Query url: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efe...'
#> Retrieval type: 'fasta', retrieval mode: 'text'

Now we can write the sequence to a fasta file by first extracting the data from the efetch object using content():

write(content(cpaff), file = "~/cpaf.fna")

cpafx <- efetch(cpaf, db = "protein", rettype = "fasta", retmode = "xml")
#> Warning: HTTPS error: Status 429;
cpafx
#> Object of class 'efetch' 
#> [1] "HTTPS error: Status 429; "
#> EFetch query using the 'protein' database.
#> Query url: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efe...'
#> Retrieval type: 'fasta', retrieval mode: 'xml'

aa <- cpafx$xmlValue("//TSeq_sequence")
aa
#> [1] NA
defline <- cpafx$xmlValue("//TSeq_defline")
defline
#> [1] NA

`einfo`: Information about the Entrez databases

You can use einfo to obtain a list of all database names accessible through the Entrez utilities:

einfo()
#> Object of class 'einfo' 
#> List of Entrez databases
#>  [1] "pubmed"          "protein"         "nuccore"         "ipg"            
#>  [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
#>  [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
#> [13] "books"           "cdd"             "clinvar"         "gap"            
#> [17] "gapplus"         "grasp"           "dbvar"           "gene"           
#> [21] "gds"             "geoprofiles"     "medgen"          "mesh"           
#> [25] "nlmcatalog"      "omim"            "orgtrack"        "pmc"            
#> [29] "popset"          "proteinclusters" "pcassay"         "protfam"        
#> [33] "pccompound"      "pcsubstance"     "seqannot"        "snp"            
#> [37] "sra"             "taxonomy"        "biocollections"  "gtr"

For each of these databases, we can use einfo again to obtain more information:

einfo("taxonomy")
#> Warning: HTTPS error: Status 429;
#> Object of class 'einfo' 
#> [1] "HTTPS error: Status 429; "

- Important functions
- Examples

Introduction to reutils

Important functions

esearch

efetch

esummary

elink

einfo

epost