Question

DNA sequences trimming methods

1

Entering edit mode

8.1 years ago

l.souza ▴ 80

Hello,

This is my situtation:

I have about 2000 DNA sequences to process, but I just want to work with the coding region of them. I have the coordinates of all CDSs (that I got with Prodigal) in a file with this format:

DEFINITION  seqnum=1;seqlen=8075;seqhdr="KU821590.1 Foot-and-mouth disease virus - type SAT 1 isolate SAT1/NAM01/2010, complete 
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS                      1026..8045

/note="ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.537;conf=99.99;score=1639.83;cscore=1612.89;sscore=26.93;rscore=-13.40;uscore=34.66;tscore=5.68;"

How could I extract the sequence file that corresponds to the coordinates into a FASTA file?

dna trimming sequence cds • 2.4k views

ADD COMMENT • link 8.1 years ago by l.souza ▴ 80

0

Entering edit mode

You'll need to parse out the header and coordinate information from your file, then match to the headers in your fasta, and use the coordinates per header to cut each sequence.

Can you post a few more lines of your file from Prodigal?

ADD REPLY • link 8.1 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

DEFINITION  seqnum=1;seqlen=8075;seqhdr="KU821590.1 Foot-and-mouth disease virus - type SAT 1 isolate SAT1/NAM01/2010, complete 
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS             1026..8045
                 /note="ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.537;conf=99.99;score=1639.83;cscore=1612.89;sscore=26.93;rscore=-13.40;uscore=34.66;tscore=5.68;"

DEFINITION  seqnum=2;seqlen=8010;seqhdr="KR108948.1 Foot-and-mouth disease virus - type SAT 1 isolate KNP/196/91/1 polyprotein gene, partial 
cds";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS             1011..>8009
                 /note="ID=2_1;partial=01;start_type=ATG;rbs_motif=TTTA;rbs_spacer=14bp;gc_cont=0.537;conf=99.99;score=1624.62;cscore=1579.25;sscore=45.37;rscore=16.14;uscore=23.55;tscore=5.68;"

DEFINITION  seqnum=3;seqlen=8144;seqhdr="JF749860.1 Foot-and-mouth disease virus - type SAT 1 isolate KEN_004/2002, complete 
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS             1018..8037
                 /note="ID=3_1;partial=00;start_type=ATG;rbs_motif=AAA;rbs_spacer=14bp;gc_cont=0.540;conf=99.99;score=1468.42;cscore=1472.82;sscore=-4.40;rscore=0.64;uscore=-10.72;tscore=5.68;"

DEFINITION  seqnum=4;seqlen=8156;seqhdr="KM268899.1 Foot-and-mouth disease virus - type SAT 1 isolate TAN/22/2012, complete 
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS             1006..8025
                 /note="ID=4_1;partial=00;start_type=ATG;rbs_motif=TTTTA;rbs_spacer=14bp;gc_cont=0.537;conf=99.99;score=1462.32;cscore=1401.95;sscore=60.38;rscore=17.40;uscore=37.30;tscore=5.68;"

The file consists of repetitions like this...

ADD REPLY • link 8.1 years ago by l.souza ▴ 80

0

Entering edit mode

Is this genbank format? You can convert it to bed (see some discussion here) and get the regions of interest with bedtools or bedops.

ADD REPLY • link 8.1 years ago by h.mon 35k

0

Entering edit mode

Not all of my sequences are genebank format!

ADD REPLY • link 8.1 years ago by l.souza ▴ 80

0

Entering edit mode

What is the output format you chose for prodigal? Do you have a mix of formats?

ADD REPLY • link 8.1 years ago by h.mon 35k

score 1 · Accepted Answer · 2017-06-07

1

Entering edit mode

8.1 years ago

l.souza ▴ 80

I could solve my problem calling ' -d ' in PRODIGAL parametres.

ADD COMMENT • link 8.1 years ago by l.souza ▴ 80