Here is an example file that I made up to practise and it is called Medaka_TEST.fa where the header is GENE_ID | TRANSCRIPT_ID:
$ cat Medaka_TEST.fa
>ENSORLG000000000AA|ENSORLT000000000AD # This gene has two transcripts (next sequence)
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000AA|ENSORLT000000000AT
TOOSHORT
>ENSORLG000000000CC|ENSORLT000000000AG # Only one occurrence of this gene
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G # This gene has two transcripts (next sequence)
IWANTTHISONEKEEPMEPLEASE
>ENSORLG000000000FF|ENSORLT0000000000A
SHORTAGAIN
I have searched a lot and found a post (How to extract the longest isoform from multi fasta file) that created a command that worked nicely to extract the longest transcript when there were multiple transcripts belonging to the same GENE. However, I cannot manipulate it properly when my headers have a two-part name separated by "|". I run into a lot of weird issues.
Here is my modified script after so many trials, but I cannot get it to include IDs:
$ cat Medaka_TEST.fa | \
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;} {printf("%s",$0);} \
END {if(N>0) printf("\n");}' | \ # put sequences on same line as headers
tr "|" "\t" | \ # separate GENE_ID and TRANSCRIPT_ID
awk -F ' ' '{printf("%s\t%d\n",$0,length($3));}' | \ # get the length of the sequence should be third column?
sort -t ' ' -k1,1 -k4,4nr | \ # sort by name, reverse length
sort -t ' ' -k1,1 -u -s | \ # sort on name, unique ID, and keeping it in original order
sed 's/ /|/' | \ # return to original name
cut -f 1,3 | \ # cut header and sequence
tr "\t" "\n" # convert back to fasta
Output:
>ENSORLG000000000AA|ENSORLT000000000AD # correct!
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000CC|ENSORLT000000000AG # correct!
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000A #wrong! It is grabbing this sequence based on alphabetical order of the second ID (the transcript ID)
SHORTAGAIN
I cannot get it to exclude returning the sequence based on alphabetical order of the TRANSCRIPT_ID. I tested this by changing this third sequence's TRANSCRIPT_ID from ENSORLT0000000000A to ENSORLT0000000000Z, and then it worked by returning the other ENSORLG000000000FF with the longer transcript but this should not happen.
I also think the issue is here in my first awk call:
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;}
In the second printf statement I have to use multiple "\t" in order to properly separate my ID name and the sequence, otherwise it just overwrites my ID names.
Example, if I use the original poster's command with just one "\t" my names are grossly overwritten (why??):
$ cat Medaka_TEST.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}'
>ENSORLGKEEPTHISONETHISISTHELONGEST0AD
>ENSORLGTOOSHORT0AA|ENSORLT000000000AT
>ENSORLGTHISISFINENODUPLICATES000000AG
>ENSORLGIWANTTHISONEKEEPMEPLEASE00000G
>ENSORLGSHORTAGAINF|ENSORLT0000000000A
But after trial and error for some reason if I put 6 "\t" the name and sequence is fine:
$cat Medaka_TEST.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t\t\t\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}'
>ENSORLG000000000AA|ENSORLT000000000AD KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000AA|ENSORLT000000000AT TOOSHORT
>ENSORLG000000000CC|ENSORLT000000000AG THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G IWANTTHISONEKEEPMEPLEASE
>ENSORLG000000000FF|ENSORLT0000000000A SHORTAGAIN
I've tried different combinations as I think the way my names are split are causing the script to not obtain proper lengths?
Here is my desired output if I can fix my commands to obtain longest transcripts when GENE_IDs have multiple occurrences.
>ENSORLG000000000AA|ENSORLT000000000AD
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000CC|ENSORLT000000000AG
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G
IWANTTHISONEKEEPMEPLEASE
It worked! Thank you :)