Hi; I have a fasta file of transcript sequences and some of the transcripts are in multiple isoforms. I want to make a uniq list of the transcripts and choose the longest sequence where a transcript has several isoforms. Original:
PB.1.1|1:94-3818(+)|c14521/f1p1/240(240 is the sequence length)
AATGGGAGAAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAAGTATCCAT GGATTCTTTCAACGAGTGTGATAGAGAACTACAGTCAAATGCTGTTGTTTTTCAACCAAA AAAGGATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCCTCATATTCTTGGCT CCTCTTCAAAAAGAATGAATTCAGTTTTGGAGCTGTTTCATGTTCTGGGCATCAGTAAAA
PB.1.2|1:1699-3803(+)|c70289/f1p2/124(124 is the sequence length)
TTACAGTATTGAATTTGTTATGAAACCAAAGCTTGAGTTTCTGCTAAGAACCATGAAGAA GCCACTTAAAGCAGTTGTAGAATACCCAAGGTACTTCAGTTATTCACTCGAGGGGAAGATTTAC
ideal Unique isoform:
PB.1.1|1:94-3818(+)|c14521/f1p1/214
AATGGGAGAAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAAGTATCCAT GGATTCTTTCAACGAGTGTGATAGAGAACTACAGTCAAATGCTGTTGTTTTTCAACCAAA AAAGGATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCCTCATATTCTTGGCT CCTCTTCAAAAAGAATGAATTCAGTTTTGGAGCTGTTTCATGTTCTGGGCATCAGTAAAA
Have any awk scripts or tools to obtain the sequence,Thank you very much
Thanks
Alex
How to extract the longest isoform from multi fasta file[1]
Would PB.1.1 be the first isoform, and PB.1.2 be the second for the gene cluster, given your headers? You mentioned awk, but I like to do this with python, and a default dict where the "gene id" is the key, and the values are the sequence lengths. Then you can grab the longest sequence in each value list. Your approach would depend on how big your fasta file is, whether you would store this dict to memory or loop through is sequence and write out.