Renaming fasta file according to a name list (blast output)
3
0
Entering edit mode
8.2 years ago
san.san ▴ 190

I need to rename a fasta file according to a blast output file where my fasta was a query. Not all sequences had a hit.

my.fasta:

>ASSI-1_0 
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1 
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT

blast.out (one hit per query and just first two columns):

ASSI-3_2    scaf0270669_20068.102
ASSI-4_3    scaf0189112_70083.538
ASSI-5_4    scaf0083789_70072.963
ASSI-8_7    scaf0423760_50193.589
ASSI-11_10  scaf0285971_60192.428
ASSI-12_11  scaf0409557_70062.641

What I need to get is this:

>ASSI-1_0
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 scaf0270669_20068.102
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 scaf0189112_70083.538
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 scaf0083789_70072.963
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT

The way I went about this is this:

  1. Pull out fasta headers and get rid of ">":

    grep "^>" my.fasta | sed -e 's/>//g' > headers.fasta

  2. Cut first two columns out of blast output and only leave one hit per query:

    awk '!x[$1]++' blast.out | cut -f1,2 > headers.blast

  3. Then I merge two header files but with tac so then I can use my awk command to get rid of duplicates, and then I use tac again to restore name order:

    tac headers.blast headers.fasta | sort -r -V | awk '!x[$1]++' | tac > headers.clean

headers.clean:

ASSI-1_0
ASSI-2_1
ASSI-3_2    scaf0270669_20068.102
ASSI-4_3    scaf0189112_70083.538
ASSI-5_4    scaf0083789_70072.963
  1. Then replace the original headers with new headers (only works for fasta formats where sequences are in one-line):

    awk 'NR%2==0' my.fasta | paste -d'\n' headers.clean -> my_renamed.fasta

The third step is what's giving me trouble. It's a very clumsy approach and I want to see if anybody has a better way?

fasta sequence command-line • 3.5k views
ADD COMMENT
3
Entering edit mode
8.2 years ago

use the join command

$ join -t '    ' -a 2 -1 1 -2 1 <( cat input.blast | sort -t ' '  -k1,1 )  <(cat input.fasta | paste - - | cut -c2- | sort -t '        ' -k1,1) | awk -F '      ' '{printf(">%s",$1);i=2;if(NF==3) {printf(" %s",$2);i++;} printf("\n%s\n",$i);}'

the options -t (join , sort) and -F (awk) require a tabulation. (Ctrl-v tab)

>ASSI-1_0
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 scaf0270669_20068.102
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 scaf0189112_70083.538
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 scaf0083789_70072.963
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT
ADD COMMENT
0
Entering edit mode

Would this command work when not all my sequences in the my.fasta file had a blast hit?

ADD REPLY
1
Entering edit mode

yes that's the option -a of join

       -a FILENUM
              also  print unpairable lines from file FILENUM, where FILENUM is
              1 or 2, corresponding to FILE1 or FILE2



ADD REPLY
0
Entering edit mode

I'm getting this when I run it:

>STE-1_0
TCTTAACTCATTGTGTGTGTATGCC
>STE-100000_218031
scaf0293423_20071.471
>STE-10000_10255
TCTGGAAAATAAAGCGGTCCACCCC
>STE-100001_218032
scaf0391581

I wonder if it's because some blast outputs are like so:

STE-121254_274055   scaf7351783

Rather than:

STE-121254_274056   scaf0143876_10040.799
ADD REPLY
0
Entering edit mode

I just realised why it didn't work. Because I used blast output file that included other columns apart from the first two.

ADD REPLY
0
Entering edit mode

I'm wondering, how do I order my sequences to be >ASSI-1_0 then >ASSI-2_1 rather than >ASSI-1_0 and then >STE-100000_218031?

ADD REPLY
1
Entering edit mode
8.2 years ago

I'd like to use SeqKit, only one single command.

Just download the executable binaries for your operating system (Windows/Linux/Mac OS X) and run:

$ seqkit replace -p '^([^ ]+)' -r '$1 {kv}' -k blast.out my.fasta
[INFO] read key-value file: blast.out
[INFO] 6 pairs of key-value loaded
>ASSI-1_0  
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1  
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 scaf0270669_20068.102 
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 scaf0189112_70083.538 
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 scaf0083789_70072.963 
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT
ADD COMMENT
0
Entering edit mode

Awesome! I needed to do the same and this worked perfectly!

ADD REPLY
0
Entering edit mode
8.2 years ago

I would write a biopython script to get it done, create a dictionary out of the blast output and loop over your fasta file

ADD COMMENT

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6