How do you extract longest transcript from fasta file using length and header IDs?
1
0
Entering edit mode
5.2 years ago
DNAngel ▴ 250

Here is an example file that I made up to practise and it is called Medaka_TEST.fa where the header is GENE_ID | TRANSCRIPT_ID:

$ cat Medaka_TEST.fa
>ENSORLG000000000AA|ENSORLT000000000AD   # This gene has two transcripts (next sequence)
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000AA|ENSORLT000000000AT
TOOSHORT
>ENSORLG000000000CC|ENSORLT000000000AG  # Only one occurrence of this gene
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G   # This gene has two transcripts (next sequence)
IWANTTHISONEKEEPMEPLEASE
>ENSORLG000000000FF|ENSORLT0000000000A
SHORTAGAIN

I have searched a lot and found a post (How to extract the longest isoform from multi fasta file) that created a command that worked nicely to extract the longest transcript when there were multiple transcripts belonging to the same GENE. However, I cannot manipulate it properly when my headers have a two-part name separated by "|". I run into a lot of weird issues.

Here is my modified script after so many trials, but I cannot get it to include IDs:

$ cat Medaka_TEST.fa | \
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;} {printf("%s",$0);} \
END {if(N>0) printf("\n");}' | \ # put sequences on same line as headers
tr "|" "\t" | \ # separate GENE_ID and TRANSCRIPT_ID
awk -F '      ' '{printf("%s\t%d\n",$0,length($3));}' | \ # get the length of the sequence should be third column?
sort -t '     ' -k1,1 -k4,4nr | \ # sort by name, reverse length
sort -t '     ' -k1,1 -u -s | \ # sort on name, unique ID, and keeping it in original order
sed 's/       /|/' | \ # return to original name
cut -f 1,3 | \ # cut header and sequence
tr "\t" "\n" # convert back to fasta

Output:

>ENSORLG000000000AA|ENSORLT000000000AD   # correct!
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000CC|ENSORLT000000000AG    # correct! 
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000A  #wrong! It is grabbing this sequence based on alphabetical order of the second ID (the transcript ID)
SHORTAGAIN

I cannot get it to exclude returning the sequence based on alphabetical order of the TRANSCRIPT_ID. I tested this by changing this third sequence's TRANSCRIPT_ID from ENSORLT0000000000A to ENSORLT0000000000Z, and then it worked by returning the other ENSORLG000000000FF with the longer transcript but this should not happen.

I also think the issue is here in my first awk call:

awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;}

In the second printf statement I have to use multiple "\t" in order to properly separate my ID name and the sequence, otherwise it just overwrites my ID names.

Example, if I use the original poster's command with just one "\t" my names are grossly overwritten (why??):

$ cat Medaka_TEST.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}'
>ENSORLGKEEPTHISONETHISISTHELONGEST0AD
>ENSORLGTOOSHORT0AA|ENSORLT000000000AT
>ENSORLGTHISISFINENODUPLICATES000000AG
>ENSORLGIWANTTHISONEKEEPMEPLEASE00000G
>ENSORLGSHORTAGAINF|ENSORLT0000000000A

But after trial and error for some reason if I put 6 "\t" the name and sequence is fine:

$cat Medaka_TEST.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t\t\t\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}'
>ENSORLG000000000AA|ENSORLT000000000AD  KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000AA|ENSORLT000000000AT  TOOSHORT
>ENSORLG000000000CC|ENSORLT000000000AG  THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G  IWANTTHISONEKEEPMEPLEASE
>ENSORLG000000000FF|ENSORLT0000000000A  SHORTAGAIN

I've tried different combinations as I think the way my names are split are causing the script to not obtain proper lengths?

Here is my desired output if I can fix my commands to obtain longest transcripts when GENE_IDs have multiple occurrences.

>ENSORLG000000000AA|ENSORLT000000000AD
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000CC|ENSORLT000000000AG
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G
IWANTTHISONEKEEPMEPLEASE
awk fasta • 2.6k views
ADD COMMENT
2
Entering edit mode
5.2 years ago
JC 13k

The commands are complex to control, imho it's better if you put that logic in a script:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs =();
my ($g, $t);
while (<>) {
    chomp;
    if (/>/) {
        ($g, $t) = split (/\|/, $_);
    }
    else {
        $seqs{$g}{$t} .= $_;
    }
}

foreach $g (sort keys %seqs) { # iterate over genes
    my $longest = -1;
    my $fasta = '';
    foreach $t (keys %{ $seqs{$g} }) { # iterate over transcripts
        my $seq = $seqs{$g}{$t};
        if (length $seq > $longest) {
           $fasta = "$g|$t\n$seq\n";
        }
    }
    print $fasta;
}

Output example

$ perl getLongest.pl < file.fa
>ENSORLG000000000AA|ENSORLT000000000AD
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000CC|ENSORLT000000000AG
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G
IWANTTHISONEKEEPMEPLEASE
ADD COMMENT
0
Entering edit mode

It worked! Thank you :)

ADD REPLY

Login before adding your answer.

Traffic: 1280 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6