Entering edit mode
7.5 years ago
l.souza
▴
80
Hello,
This is my situtation:
I have about 2000 DNA sequences to process, but I just want to work with the coding region of them. I have the coordinates of all CDSs (that I got with Prodigal) in a file with this format:
DEFINITION seqnum=1;seqlen=8075;seqhdr="KU821590.1 Foot-and-mouth disease virus - type SAT 1 isolate SAT1/NAM01/2010, complete
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES Location/Qualifiers
CDS 1026..8045
/note="ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.537;conf=99.99;score=1639.83;cscore=1612.89;sscore=26.93;rscore=-13.40;uscore=34.66;tscore=5.68;"
How could I extract the sequence file that corresponds to the coordinates into a FASTA file?
You'll need to parse out the header and coordinate information from your file, then match to the headers in your fasta, and use the coordinates per header to cut each sequence.
Can you post a few more lines of your file from Prodigal?
The file consists of repetitions like this...
Is this genbank format? You can convert it to bed (see some discussion here) and get the regions of interest with bedtools or bedops.
Not all of my sequences are genebank format!
What is the output format you chose for prodigal? Do you have a mix of formats?