Trinity FASTA entry example:
>TRINITY_DN1000|c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]
AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC
ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA
AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC
CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA TAAAGCA
Description of Trinity headers: 'TRINITY_DN1000|c115_g5'
corresponds to
'gene id: TRINITY_DN1000|c115_g5' encoding 'isoform id:
TRINITY_DN1000|c115_g5_i1'.
I prefer single line FASTA, borrowed from answer on this post:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < Trinity.fasta > Trinity_single.fasta
Here's a python script to get the longest isoform per gene:
#!/usr/bin/env python
import sys
from collections import defaultdict
transcripts = defaultdict(list)
with open(sys.argv[1], 'r') as f:
for line in f:
if line.startswith(">"):
transcripts['_'.join(line.strip().split(' ')[0].split('|')[1].split('_')[:2])].append(next(f).strip()) # Groups isoforms by gene id
with open(sys.argv[1].split(".fasta")[1], 'w') as out:
for t in transcripts:
print >> out, '>' + t, max(transcripts[t], key=len)
Save as get_longest_iso.py
, run as python get_longest_iso.py Trinity_single.fasta
Are you trying to get the longest isoform per gene cluster?
yes, I just want to have one longest isoform. How can I realize that? ZQ