Question

extract the longest isoform from multi fasta file

0

Entering edit mode

8.9 years ago

Alex ▴ 50

Hi; I have a fasta file of transcript sequences and some of the transcripts are in multiple isoforms. I want to make a uniq list of the transcripts and choose the longest sequence where a transcript has several isoforms. Original:

PB.1.1|1:94-3818(+)|c14521/f1p1/240(240 is the sequence length)
AATGGGAGAAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAAGTATCCAT GGATTCTTTCAACGAGTGTGATAGAGAACTACAGTCAAATGCTGTTGTTTTTCAACCAAA AAAGGATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCCTCATATTCTTGGCT CCTCTTCAAAAAGAATGAATTCAGTTTTGGAGCTGTTTCATGTTCTGGGCATCAGTAAAA

PB.1.2|1:1699-3803(+)|c70289/f1p2/124(124 is the sequence length)
TTACAGTATTGAATTTGTTATGAAACCAAAGCTTGAGTTTCTGCTAAGAACCATGAAGAA GCCACTTAAAGCAGTTGTAGAATACCCAAGGTACTTCAGTTATTCACTCGAGGGGAAGATTTAC

ideal Unique isoform:

PB.1.1|1:94-3818(+)|c14521/f1p1/214 
AATGGGAGAAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAAGTATCCAT GGATTCTTTCAACGAGTGTGATAGAGAACTACAGTCAAATGCTGTTGTTTTTCAACCAAA AAAGGATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCCTCATATTCTTGGCT CCTCTTCAAAAAGAATGAATTCAGTTTTGGAGCTGTTTCATGTTCTGGGCATCAGTAAAA

Have any awk scripts or tools to obtain the sequence,Thank you very much

Thanks

Alex

RNA-Seq • 6.2k views

ADD COMMENT • link updated 8.9 years ago by Farbod ★ 3.4k • written 8.9 years ago by Alex ▴ 50

0

Entering edit mode

How to extract the longest isoform from multi fasta file[1]

ADD REPLY • link 8.9 years ago by Sej Modha 5.3k

0

Entering edit mode

Would PB.1.1 be the first isoform, and PB.1.2 be the second for the gene cluster, given your headers? You mentioned awk, but I like to do this with python, and a default dict where the "gene id" is the key, and the values are the sequence lengths. Then you can grab the longest sequence in each value list. Your approach would depend on how big your fasta file is, whether you would store this dict to memory or loop through is sequence and write out.

ADD REPLY • link 8.9 years ago by st.ph.n ★ 2.7k

score 1 · Answer 1 · 2016-12-01

1

Entering edit mode

8.9 years ago

Farbod ★ 3.4k

Dear Alex, Hi.

You can have a look at perl Script to get the longest isoform for each 'gene' of Trinity team.

And please be aware that : The longest transcript isn't always the best transcript!

Instead you can collect longest ORF for each gene using tools like Transdecoder.