Hi guys,
I have a couple thousands sequences that I translated using getorf from EMBOSS. I now need to select the the longest sequence out of many sequences. How can we do this? For example I have translated these two sets of contigs:
Contig: 01003765.1 Contig562 mRNA sequence
and
Contig: 5553765.1_1 Contig562 mRNA sequence
>01003765.1_1 [227 - 259] Contig562 mRNA sequence
ATGTCCTCCAGAAATGTTCGACGTTATTTTGGA
>01003765.1_2 [259 - 291] Contig562 mRNA sequence
ATGAGAATCAACTGGACGATGCCTGTGAACATA
>01003765.1_3 [427 - 459] Contig562 mRNA sequence
ATGTCTACGGATACAGTCACGATTCTAGAGGTA
>01003765.1_4 [201 - 476] Contig562 mRNA sequence
ATGGTAGCTGCCGACAAATTGGCTCAATGTCCTCCAGAAATGTTCGACGTTATTTTGGAT
GAGAATCAACTGGACGATGCCTGTGAACATATAGCGGAATATCTGGAGGCATATTGGAGA
GCTACTCATCCAGAAATTGTAACGAGTACAACACGACAGATCGGTAGTCCACCTCAAGCA
TCGCCTAGTGGAGACATGGGAGAAACGACGCTTCCAGCTCAGCACGATGTCTACGGATAC
AGTCACGATTCTAGAGGTATAAATTCAGGGTTCAGC
>5553765.1_1 [227 - 259] Contig562 mRNA sequence
ATGTCCTCCAGAAATGTTCGACGTTATTTTGGACAACTGGACGATGCCTGTGAACATATAGCGGAATATCTGGAGGCATATTGGAGAGCTACTCATCCAGAAATTGTAACGAGTACAACACGACAGATCGGTAGTCCACCTCAAGCATCGCCTAGTGGAGACATGGGAGAAACGACGCTTCCAGCTCAGCACGATGTCTACGGATAC</code>AGTCACGATTCTAGAGGTATAAATTCAGGGTTCAGC
>5553765.1_2 [259 - 291] Contig562 mRNA sequence
ATGAGAATCAACTGGACGATGCCTGTGAACATA
>5553765.1_3 [427 - 459] Contig562 mRNA sequence
ATGTCTACGGATACAGTCACGATTCTAGAGGTA
>5553765.1_4 [201 - 476] Contig562 mRNA sequence
ATGGTAGCTGCCGACAAATTGGCTCAATGTCCTCCAGAAATGTTCGACGTTATTTTGGAT
GAGAAT
As a result, I only want the longest sequence for each contig translated in the same file:
>01003765.1_4 [201 - 476] Contig562 mRNA sequence
ATGGTAGCTGCCGACAAATTGGCTCAATGTCCTCCAGAAATGTTCGACGTTATTTTGGAT
GAGAATCAACTGGACGATGCCTGTGAACATATAGCGGAATATCTGGAGGCATATTGGAGA
GCTACTCATCCAGAAATTGTAACGAGTACAACACGACAGATCGGTAGTCCACCTCAAGCA
TCGCCTAGTGGAGACATGGGAGAAACGACGCTTCCAGCTCAGCACGATGTCTACGGATAC
AGTCACGATTCTAGAGGTATAAATTCAGGGTTCAGC
>5553765.1_1 [227 - 259] Contig562 mRNA sequence
ATGTCCTCCAGAAATGTTCGACGTTATTTTGGACAACTGGACGATGCCTGTGAACATATAGCGGAATATCTGGAGGCATATTGGAGAGCTACTCATCCAGAAATTGTAACGAGTACAACACGACAGATCGGTAGTCCACCTCAAGCATCGCCTAGTGGAGACATGGGAGAAACGACGCTTCCAGCTCAGCACGATGTCTACGGATACAGTCACGATTCTAGAGGTATAAATTCAGGGTTCAGC
I would really appreciate your help if you could provide me the solution to this problem. Thank you.
I want to ask how you extract the orfs from the sequences,and I want to ask the orf maybe imcomplete,I am trying to extract the complete and imcomplete orf from the DNA sequences,I hope I can get help from you