Entering edit mode
2.6 years ago
cxr
•
0
Hello,
I have a set of about a thousand sequences in tsv form Accession_Number:Strand:Start:End and I'm struggling to figure out the best way to retrieve them off of NCBI's databases. I've tried Batch Entrez but can only grab the entire record for each entry rather than just my specific regions of interest. I was wondering if anyone had insight on how best to go about retrieving multiple subsequences across multiple references on NCBIs databases?
An example of the data I'm working with for context:
BA000007.3 1 5017083 5018620
CP053370.1 1 10266 11819
CP053370.1 1 106369 107922
CP053370.1 1 112532 114085
CP053370.1 1 216122 217675
I see that the 5 rows posted comprise of only 2 unique seq-ids. The proposed
efetch
method will download the entire sequence for the seq-id every time. While this won't be an issue for a few sequences, it can become slow for a whole bunch of sequences. Perhaps something like this will help: