Extract the longest isoform from transcripts.fasta (rnaSPAdes output)
1
0
Entering edit mode
2.8 years ago
poecile.pal ▴ 50

Good afternoon,

I have fasta file transcripts.fasta (output of rnaSPAdes). It looks like this:

>NODE_1_length_180_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAAACCCTAGCTATTCTTAGC
ATGATATTGGGGAATCTCATTGCTATGACTCAAACAAGCATGAAACGTATGCTTGCATAT
TCGTCCATAGGTCAAATCGGATATGTAATTATTGGAATAATTGTTGGAGACTCAAATGAT
>NODE_2_length_40_cov_5904.526310_g0_i1
AAAATTGCCGTGAGCAAACATATTAATGACGAGGAACGCT

etc

Here g is gene number, i is isoform number.

Please tell me how to extract the longest isoforms of each gene from it? Thank you in advance!

Best regards, Poecile

transcriptome isoform assembly rna-seq fasta • 1.4k views
ADD COMMENT
2
Entering edit mode

Try this with bioawk, datamash:

$ bioawk -tc fastx '{print $name, $seq, length($seq)}' test.fa | awk  -v OFS="\t" -F '\t|_' '{print $0, $7}' | datamash -sg4 max 3  -f | cut -f1,2 | awk -v OFS="\n" '{print ">"$1,$2}'

>NODE_1_length_180_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAAACCCTAGCTATTCTTAGCATGATATTGGGGAATCTCATTGCTATGACTCAAACAAGCATGAAACGTATGCTTGCATATTCGTCCATAGGTCAAATCGGATATGTAATTATTGGAATAATTGTTGGAGACTCAAATGAT
ADD REPLY
0
Entering edit mode

I can’t thank you enough! It worked for me.

ADD REPLY
3
Entering edit mode
2.8 years ago
O.rka ▴ 740

I’m on my phone but will try to code it out in Python:

from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd

with open(path_input, “r”) as f:
    d = dict()
    for id, seq in SimpleFastaParser(f):
        d[id] = seq 
    sequences = pd.Series(d)

lengths = sequences.map(len)

index = lengths.groupby(lambda x: x.split(“_”)[-2]).idxmax()

longest_isoforms = sequences[index]

with open(path_output, "w") as f:
    for id, seq in longest_isoforms.items():
         print(">{}\n{}".format(id,seq), file=f)

Sorry for the autocorrects and formatting. Hope this helps. I’ll clean it up and test later but wanted to answer before I forget.

Edit: Cleaned up the code this morning. Let me know if you have any questions.

Dependencies are pandas and biopython which are pretty standard. This assumes you're ok w/ Python.

ADD COMMENT
0
Entering edit mode

It works great too! I just replaced Len with len. Thanks a million!

ADD REPLY
0
Entering edit mode

Thanks! My bad, I corrected it. I wrote it on my phone so it was full of syntax errors.

ADD REPLY

Login before adding your answer.

Traffic: 1793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6