Hello! I have been banging my head against a wall for the past 2 days trying to get a fasta file of DNA sequences from a list of protein accessions across a broad range of organisms. I have roamed biostars for questions asking the same thing (it should be a common enough task!) but all of the answers I've found don't get me quite what I want.
Here are a few of the protein accessions for reference (I have ~400 in total):
WP_011071901.1
WP_011071902.1
WP_011071903.1
WP_011789901.1
WP_014610284.1
WP_101053336.1
WP_101053338.1
I would like the DNA coding sequence associated with these proteins. Of all the ways I've tried, these seem to be the 2 most promising approaches thus far.
Approach 1: Drawn directly from this answer on a previous post, I have tried to use eutils to link a protein query with the nuccore database. However, as stated in the linked answer, this returns ALL coding sequences from the nucleotide sequence, not just the CDS for the protein in question. For example, for a single protein accession, I would write:
esearch -query "WP_011071901.1" -db protein |elink -target nuccore|efetch -format fasta_cds_na
but this returns all CDS, not just the CDS for WP_011071901.1. Also, I would want to re-work this in a way where I can do a batch extraction of all my proteins, not just 1 at a time.
Approach 2: First find start and stop positions of protein coding sequence in associated nucleotide, then get DNA sequence from there.
I have successfully extracted the nucleotide accession and the start and stop positions already by making use of the code posted here.
Then, I know I can access the nucleotide sequence one at a time by typing
efetch -db nuccore -id NC_004347.2 -format fasta -seq_start 1859348 -seq_stop 1861363
but this is extremely inefficient given the number of proteins I am working with. I can also see this becoming an issue as some of these protein accessions are associated with more than 1 genome/nucleotide. I know which Nucleotide accession I would want to use for each associated protein if that helps for this approach.
Any guidance or help would be appreciated, thank you!
You are working with
WP*
accessions which are special accessions potentially representing multiple organisms. Where did you get these from? Is there any chance of using non-WP* accessions? Your first solution will work in that case.Unfortunately not, although I DO know which nucleotide records I would want to pull from. Is there a way to specify this in eutils?
Are these genome records or individual genes? Can you provide an example?
Certainly. These are usually whole genomes or contigs/scaffolds from a genome. For example, WP_011071901.1 is found on both NZ_CP053946.1 and NC_004347.2. These are from 2 different genome assemblies of the same organism. However, I know that I'd just like the DNA sequence encoding WP_011071901.1 from NC_004347.2. Let me know if there's other info I could provide. Thank you!
How many genomes do you have? Since you already have the accession numbers and start and stop of each gene, maybe you can download the fasta file for each genome, and use a tool like
FragGeneScan
[It's really fast], and get the gene sequences, and use grep -A 1 $start to extract the genes you're looking for. (grep -A 1 works with FragGeneScan outputs).You can make a bash script to do this these.
Alternatively, you can download both fna and gff file from ncbi and convert the gff to gene sequence, and extract the one you're interested in.
Extract Cds Fastas From A Gff Annotation + Reference Sequence
http://www.metagenomics.wiki/tools/fastq/ncbi-ftp-genome-download/gff-to-ffn
Thank you! I am working with ~160 genomes, so I'd prefer not to download each genome if possible. Also, because I have several proteins per genome (and nucleotide record in some cases), it would also be ideal if I could name each fasta record with the corresponding protein accession, rather than the nucleotide accession and start/stop info. Thank you for both of these options though.
I think you can easily use
sed
to replace them. Something like this: