How to extract the longest isoform from multi fasta file
2
1
Entering edit mode
10.4 years ago
jon.brate ▴ 310

Hi,

I have a fasta file of transcript sequences and some of the transcripts are in multiple isoforms. I want to make a uniq list of the transcripts and choose the longest sequence where a transcript has several isoforms.

Like this:

Original:

>scitn003313.1
CCCTGGCAATCTAAGCCACTGCCGG
>scitn003313.2
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

Unique:

>scitn003313.2
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA
perl extract fasta python isoform • 15k views
ADD COMMENT
11
Entering edit mode
10.4 years ago
$ cat input.fa |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' |\ #linearize fasta
tr "." "\t" |\ #extract version from header
awk -F '   '  '{printf("%s\t%d\n",$0,length($3));}' |\ #extact length
sort -t '    ' -k1,1 -k4,4nr |\ #sort on name, inverse length
sort -t '    ' -k1,1 -u -s |\ #sort on name, unique, stable sort (keep previous order)
sed 's/    /./' |\ #restore name
cut -f 1,2 |\ #cut name, sequence
tr "\t" "\n"  |\ #back to fasta
fold -w 60 #pretty fasta
ADD COMMENT
1
Entering edit mode

Thanks Pierre!

I seem to manage up to the sort part, and I get this message:

sort: multi-character tab `    '

I tried to read about this but don't understand whats wrong.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I see. It works great. Thanks!

edit: I removed the fold command to avoid line breaks in the sequences

ADD REPLY
0
Entering edit mode

when I type above command one by one I also got this message: sort: multi-character tab ` '.

Please share how did you resolve. I am not able to understand from above link

ADD REPLY
0
Entering edit mode

Hi Pierre:

Thank you for this answer. I have nearly a identical situation with one key difference, which I will explain below.

The annotation I am working with denotes slightly different nomenclature for protein isoforms.

For the sake of clarity I'll illustrate this using Phillippe's example.

>scitn003313.1a
CCCTGGCAATCTAAGCCACTGCCGG
>scitn003313.1b
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

Where the following represents unique proteins.

>scitn003313.1b
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

Thus, for me to similarly retrieve only unique ids from my fasta file would require tweaking the tr expression. However, it appears splitting with a multi-character delimiter with tr is not straightforward.

How could we modify the your script to obtain a unique set of proteins?

Here's a link to the fasta file in question ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS256.protein.fa.gz

Thanks in advance! - Taylor

ADD REPLY
0
Entering edit mode

I don't get the difference with the original question.

it appears splitting with a multi-character delimiter with tr is not straightforward.

what is the multi-character delimiter ?

ADD REPLY
0
Entering edit mode

The difference is the nomenclature in this new file.

Letters following the digit after the period represent different isoforms of the same protein.

e.g.

scitn003313.1a

scitn003313.1b

are different isoforms. Whereas

scitn003313.1a

scitn003313.2

Are entirely different proteins.

I tried to delimit with the period and the number that followed, but that may not be the solution we need here.

ADD REPLY
0
Entering edit mode

when I type above command one by one I also got this message: sort: multi-character tab ` '.

Please share how did you resolve. I am not able to understand from above link

ADD REPLY
0
Entering edit mode

of course in sort -t ' ' the argument of the option '-t' is a tabulation....

ADD REPLY
0
Entering edit mode

Hi Pierre,

when I type above command It just mention number of filtered sequences not the fasta file.

[root@psgl mapped_bam]# cat stringtie_transcript.fasta |awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' |tr "." "\t" |awk -F ' ' '{printf("%s\t%d\n",$0,length($3));}' |sort -t ' ' -k1,1 -k4,4nr |sort -t ' ' -k1,1 -u -s |sed 's/ /./' |cut -f 1,2 |tr "\t" "\n" |fold -w 60 > filter_strg.fasta

head filter_strg.fasta

STRG 41115

there were around 77104 sequences including isoforms in the input file.

Please suggest how I can get fasta sequences for these 41115 transcripts.

ADD REPLY
5
Entering edit mode
10.4 years ago
Philippe ★ 1.9k

A small script using the seqinr library in R. Maybe not the most straightforward but I wrote this quickly.

# Loading library seqinr
library(seqinr)

# Loading your input file
fa <- read.fasta("input.fa")

# Getting sequences and gene names from fa object
genes <- sapply(strsplit(names(fa), "\\."), function(v) {return(v[1])})
sequences <- sapply(fa, c2s)

# Extracting longest transcript for each gene
filtered_seq <- tapply(sequences, genes, function(v) {return(v[which(nchar(v)==max(nchar(v)))])})

# Creating an object suitable for the write.fasta function
obj <- tapply(filtered_seq, 1:length(filtered_seq), s2c)

# Writing output file
write.fasta(obj , names(filtered_seq), file="output.fa")
ADD COMMENT
0
Entering edit mode

Hi,

The above code works fine. If I want to retain only those sequences which starts with "atg" and ends with "tag or tga ot taa". How can I achieve that with above command?

ADD REPLY
0
Entering edit mode

It doesn't work for me at the line

obj <- tapply(filtered_seq, 1:length(filtered_seq), s2c)
ADD REPLY
0
Entering edit mode

I had a problem at the same point. This was my inelegant but effective workaround:

# Creating an object suitable for the write.fasta function
obj <- tapply(filtered_seq, 1:length(filtered_seq), c2s)
obj <- tapply(obj, 1:length(obj), s2c)
names(obj) <- names(filtered_seq)
# Writing output file
write.fasta(obj, names(obj), file="./longest.fa")
ADD REPLY
0
Entering edit mode

Hi Philippe, I wonder if there is any way of indicate second point and not first one in this part of the script genes <- sapply(strsplit(names(fa), "\."), function(v) {return(v[1])}) The names of my transcript have this format MSTRG.5.1 And I need my gene names like this: MSTRG.5 Best regards!

ADD REPLY

Login before adding your answer.

Traffic: 1602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6