I need to rename a fasta file according to a blast output file where my fasta was a query. Not all sequences had a hit.
my.fasta:
>ASSI-1_0
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT
blast.out (one hit per query and just first two columns):
ASSI-3_2 scaf0270669_20068.102
ASSI-4_3 scaf0189112_70083.538
ASSI-5_4 scaf0083789_70072.963
ASSI-8_7 scaf0423760_50193.589
ASSI-11_10 scaf0285971_60192.428
ASSI-12_11 scaf0409557_70062.641
What I need to get is this:
>ASSI-1_0
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 scaf0270669_20068.102
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 scaf0189112_70083.538
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 scaf0083789_70072.963
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT
The way I went about this is this:
Pull out fasta headers and get rid of ">":
grep "^>" my.fasta | sed -e 's/>//g' > headers.fasta
Cut first two columns out of blast output and only leave one hit per query:
awk '!x[$1]++' blast.out | cut -f1,2 > headers.blast
Then I merge two header files but with
tac
so then I can use myawk
command to get rid of duplicates, and then I usetac
again to restore name order:tac headers.blast headers.fasta | sort -r -V | awk '!x[$1]++' | tac > headers.clean
headers.clean:
ASSI-1_0
ASSI-2_1
ASSI-3_2 scaf0270669_20068.102
ASSI-4_3 scaf0189112_70083.538
ASSI-5_4 scaf0083789_70072.963
Then replace the original headers with new headers (only works for fasta formats where sequences are in one-line):
awk 'NR%2==0' my.fasta | paste -d'\n' headers.clean -> my_renamed.fasta
The third step is what's giving me trouble. It's a very clumsy approach and I want to see if anybody has a better way?
Would this command work when not all my sequences in the my.fasta file had a blast hit?
yes that's the option -a of join
I'm getting this when I run it:
I wonder if it's because some blast outputs are like so:
Rather than:
I just realised why it didn't work. Because I used blast output file that included other columns apart from the first two.
I'm wondering, how do I order my sequences to be
>ASSI-1_0
then>ASSI-2_1
rather than>ASSI-1_0
and then>STE-100000_218031
?