Hi,
I have a fasta file of transcript sequences and some of the transcripts are in multiple isoforms. I want to make a uniq list of the transcripts and choose the longest sequence where a transcript has several isoforms.
Like this:
Original:
>scitn003313.1
CCCTGGCAATCTAAGCCACTGCCGG
>scitn003313.2
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA
Unique:
>scitn003313.2
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA
Thanks Pierre!
I seem to manage up to the sort part, and I get this message:
I tried to read about this but don't understand whats wrong.
it's a 'tab' , not some spaces : http://stackoverflow.com/questions/10627989/how-do-i-insert-a-tab-character-in-iterm
I see. It works great. Thanks!
edit: I removed the fold command to avoid line breaks in the sequences
when I type above command one by one I also got this message: sort: multi-character tab ` '.
Please share how did you resolve. I am not able to understand from above link
Hi Pierre:
Thank you for this answer. I have nearly a identical situation with one key difference, which I will explain below.
The annotation I am working with denotes slightly different nomenclature for protein isoforms.
For the sake of clarity I'll illustrate this using Phillippe's example.
Where the following represents unique proteins.
Thus, for me to similarly retrieve only unique ids from my fasta file would require tweaking the tr expression. However, it appears splitting with a multi-character delimiter with tr is not straightforward.
How could we modify the your script to obtain a unique set of proteins?
Here's a link to the fasta file in question ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS256.protein.fa.gz
Thanks in advance! - Taylor
I don't get the difference with the original question.
what is the
multi-character delimiter
?The difference is the nomenclature in this new file.
Letters following the digit after the period represent different isoforms of the same protein.
e.g.
are different isoforms. Whereas
Are entirely different proteins.
I tried to delimit with the period and the number that followed, but that may not be the solution we need here.
when I type above command one by one I also got this message: sort: multi-character tab ` '.
Please share how did you resolve. I am not able to understand from above link
of course in
sort -t ' '
the argument of the option '-t' is a tabulation....Hi Pierre,
when I type above command It just mention number of filtered sequences not the fasta file.
[root@psgl mapped_bam]# cat stringtie_transcript.fasta |awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' |tr "." "\t" |awk -F ' ' '{printf("%s\t%d\n",$0,length($3));}' |sort -t ' ' -k1,1 -k4,4nr |sort -t ' ' -k1,1 -u -s |sed 's/ /./' |cut -f 1,2 |tr "\t" "\n" |fold -w 60 > filter_strg.fasta
head filter_strg.fasta
there were around 77104 sequences including isoforms in the input file.
Please suggest how I can get fasta sequences for these 41115 transcripts.