Question

Extract the longest isoform from transcripts.fasta (rnaSPAdes output)

0

Entering edit mode

2.8 years ago

poecile.pal ▴ 50

Good afternoon,

I have fasta file transcripts.fasta (output of rnaSPAdes). It looks like this:

>NODE_1_length_180_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAAACCCTAGCTATTCTTAGC
ATGATATTGGGGAATCTCATTGCTATGACTCAAACAAGCATGAAACGTATGCTTGCATAT
TCGTCCATAGGTCAAATCGGATATGTAATTATTGGAATAATTGTTGGAGACTCAAATGAT
>NODE_2_length_40_cov_5904.526310_g0_i1
AAAATTGCCGTGAGCAAACATATTAATGACGAGGAACGCT

etc

Here g is gene number, i is isoform number.

Please tell me how to extract the longest isoforms of each gene from it? Thank you in advance!

Best regards, Poecile

transcriptome isoform assembly rna-seq fasta • 1.4k views

ADD COMMENT • link updated 2.8 years ago by O.rka ▴ 740 • written 2.8 years ago by poecile.pal ▴ 50

2

Entering edit mode

Try this with bioawk, datamash:

$ bioawk -tc fastx '{print $name, $seq, length($seq)}' test.fa | awk  -v OFS="\t" -F '\t|_' '{print $0, $7}' | datamash -sg4 max 3  -f | cut -f1,2 | awk -v OFS="\n" '{print ">"$1,$2}'

>NODE_1_length_180_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAAACCCTAGCTATTCTTAGCATGATATTGGGGAATCTCATTGCTATGACTCAAACAAGCATGAAACGTATGCTTGCATATTCGTCCATAGGTCAAATCGGATATGTAATTATTGGAATAATTGTTGGAGACTCAAATGAT

ADD REPLY • link 2.8 years ago by cpad0112 21k

0

Entering edit mode

I can’t thank you enough! It worked for me.

ADD REPLY • link 2.8 years ago by poecile.pal ▴ 50

score 3 · Accepted Answer · 2022-03-03

3

Entering edit mode

2.8 years ago

O.rka ▴ 740

I’m on my phone but will try to code it out in Python:

from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd

with open(path_input, “r”) as f:
    d = dict()
    for id, seq in SimpleFastaParser(f):
        d[id] = seq 
    sequences = pd.Series(d)

lengths = sequences.map(len)

index = lengths.groupby(lambda x: x.split(“_”)[-2]).idxmax()

longest_isoforms = sequences[index]

with open(path_output, "w") as f:
    for id, seq in longest_isoforms.items():
         print(">{}\n{}".format(id,seq), file=f)

Sorry for the autocorrects and formatting. Hope this helps. I’ll clean it up and test later but wanted to answer before I forget.

Edit: Cleaned up the code this morning. Let me know if you have any questions.

Dependencies are pandas and biopython which are pretty standard. This assumes you're ok w/ Python.

ADD COMMENT • link 2.8 years ago by O.rka ▴ 740

0

Entering edit mode

It works great too! I just replaced Len with len. Thanks a million!

ADD REPLY • link 2.8 years ago by poecile.pal ▴ 50

0

Entering edit mode

Thanks! My bad, I corrected it. I wrote it on my phone so it was full of syntax errors.

ADD REPLY • link 2.8 years ago by O.rka ▴ 740