Question

Retrieving specific subsequences from multiple references on NCBI?

0

Entering edit mode

2.6 years ago

cxr • 0

Hello,

I have a set of about a thousand sequences in tsv form Accession_Number:Strand:Start:End and I'm struggling to figure out the best way to retrieve them off of NCBI's databases. I've tried Batch Entrez but can only grab the entire record for each entry rather than just my specific regions of interest. I was wondering if anyone had insight on how best to go about retrieving multiple subsequences across multiple references on NCBIs databases?

An example of the data I'm working with for context:

BA000007.3  1   5017083 5018620  
CP053370.1  1   10266   11819  
CP053370.1  1   106369  107922  
CP053370.1  1   112532  114085  
CP053370.1  1   216122  217675

ncbi reference databases dna rna • 829 views

ADD COMMENT • link updated 2.6 years ago by vkkodali_ncbi ★ 3.8k • written 2.6 years ago by cxr • 0

score 0 · Answer 1 · 2022-05-09

0

Entering edit mode

2.6 years ago

GenoMax 147k

You can use Entrezdirect in this way. Use a loop structure to go through a list.

$ efetch -db nuccore -id CP053370.1 -seq_start 10266 -seq_stop 10300 -format fasta
>CP053370.1:10266-10300 Lysinibacillus sphaericus strain NEU 1003 chromosome
TTTATGGAGAGTTTGATCCTGGCTCAGGACGAACG

ADD COMMENT • link 2.6 years ago by GenoMax 147k

0

Entering edit mode

I see that the 5 rows posted comprise of only 2 unique seq-ids. The proposed efetch method will download the entire sequence for the seq-id every time. While this won't be an issue for a few sequences, it can become slow for a whole bunch of sequences. Perhaps something like this will help:

## make a bed-like file with regions
$ cat regions.txt 
BA000007.3      5017083 5018620
CP053370.1      10266   11819
CP053370.1      106369  107922
CP053370.1      112532  114085
CP053370.1      216122  217675
## use efetch to download sequences for each seq-id only once
## then use seqkit subseq to extract sequences 
$ cut -f1 regions.txt | sort -u | epost -db nuccore | efetch -format fasta | seqkit subseq --bed regions.txt

ADD REPLY • link 2.6 years ago by vkkodali_ncbi ★ 3.8k