Extract the longest isoform from transcripts.fasta (rnaSPAdes output)
1
Good afternoon,
I have fasta file transcripts.fasta
(output of rnaSPAdes). It looks like this:
>NODE_1_length_180_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAAACCCTAGCTATTCTTAGC
ATGATATTGGGGAATCTCATTGCTATGACTCAAACAAGCATGAAACGTATGCTTGCATAT
TCGTCCATAGGTCAAATCGGATATGTAATTATTGGAATAATTGTTGGAGACTCAAATGAT
>NODE_2_length_40_cov_5904.526310_g0_i1
AAAATTGCCGTGAGCAAACATATTAATGACGAGGAACGCT
etc
Here g
is gene number, i
is isoform number.
Please tell me how to extract the longest isoforms of each gene from it?
Thank you in advance!
Best regards,
Poecile
transcriptome
isoform
assembly
rna-seq
fasta
• 1.7k views
I’m on my phone but will try to code it out in Python:
from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd
with open(path_input, “r”) as f:
d = dict()
for id, seq in SimpleFastaParser(f):
d[id] = seq
sequences = pd.Series(d)
lengths = sequences.map(len)
index = lengths.groupby(lambda x: x.split(“_”)[-2]).idxmax()
longest_isoforms = sequences[index]
with open(path_output, "w") as f:
for id, seq in longest_isoforms.items():
print(">{}\n{}".format(id,seq), file=f)
Sorry for the autocorrects and formatting. Hope this helps. I’ll clean it up and test later but wanted to answer before I forget.
Edit:
Cleaned up the code this morning. Let me know if you have any questions.
Dependencies are pandas and biopython which are pretty standard. This assumes you're ok w/ Python.
Login before adding your answer.
Traffic: 2443 users visited in the last hour
Try this with bioawk, datamash:
I can’t thank you enough! It worked for me.