Good afternoon,
I have fasta file transcripts.fasta
(output of rnaSPAdes). It looks like this:
>NODE_1_length_180_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAAACCCTAGCTATTCTTAGC
ATGATATTGGGGAATCTCATTGCTATGACTCAAACAAGCATGAAACGTATGCTTGCATAT
TCGTCCATAGGTCAAATCGGATATGTAATTATTGGAATAATTGTTGGAGACTCAAATGAT
>NODE_2_length_40_cov_5904.526310_g0_i1
AAAATTGCCGTGAGCAAACATATTAATGACGAGGAACGCT
etc
Here g
is gene number, i
is isoform number.
Please tell me how to extract the longest isoforms of each gene from it? Thank you in advance!
Best regards, Poecile
Try this with bioawk, datamash:
I can’t thank you enough! It worked for me.