Extract flanking region of -500 nt upstream and downstream of BLAST result on genome using perl scripts
1
1
Entering edit mode
5.4 years ago

I have performed BLAST on a set of 380 sequences on a genome using BLAST+. I need to extract the aligned region along with (-/+)500 nt flanks for each hit from the genome. Can anyone provide me a script in perl or guide a way to do it. I know it can be done but have not found any reliable sources. I am quite new to this field.

alignment genome sequence assembly • 3.4k views
ADD COMMENT
1
Entering edit mode
5.4 years ago
jean.elbers ★ 1.7k

Not tested

makeblastdb -in reference.fa -dbtype nucl
samtools faidx reference.fa
cut -f 1,2 reference.fa.fai > chrom.sizes
blastn -db reference -query query.fasta -outfmt 6 -out query.fasta-against-reference.blast
cut -f 2,9,10 query.fasta-against-reference.blast | awk '{if($2 > $3) {t = $2; $2 = $3; $3 = t; print;} else if($2 < $3) print; }' OFS='\t' |awk '{a=$2-1;print $1,a,$3;}' OFS='\t'|bedtools sort > query.fasta-against-reference.blast.bed
bedtools slop -i query.fasta-against-reference.blast.bed -g chrom.sizes -b 500 > query.fasta-against-reference.blast.500.bed
bedtools getfasta -fi reference.fa -bed query.fasta-against-reference.blast.500.bed -fo query.fasta-against-reference.blast.500.fasta

explanation

makeblastdb -in reference.fa -dbtype nucl # make a BLAST database
samtools faidx reference.fa # make a FASTA index
cut -f 1,2 reference.fa.fai > chrom.sizes # get the chromosome sizes
blastn -db reference -query query.fasta -outfmt 6 -out query.fasta-against-reference.blast # perform BLAST (note use your desired settings)
cut -f 2,9,10 query.fasta-against-reference.blast | awk '{if($2 > $3) {t = $2; $2 = $3; $3 = t; print;} else if($2 < $3) print; }' OFS='\t' |awk '{a=$2-1;print $1,a,$3;}' OFS='\t'|bedtools sort > query.fasta-against-reference.blast.bed # convert BLAST 6 format to BED
bedtools slop -i query.fasta-against-reference.blast.bed -g chrom.sizes -b 500 > query.fasta-against-reference.blast.500.bed # add on 500 bases upstream and downstream
bedtools getfasta -fi reference.fa -bed query.fasta-against-reference.blast.500.bed -fo query.fasta-against-reference.blast.500.fasta # get the accompanying bases for BLAST coordinates but 500 bases upstream and downstream
ADD COMMENT
0
Entering edit mode

Thank you so much. I'll use it and let you know if any problems

ADD REPLY
0
Entering edit mode

I can't seem to get this to work for every sequence. The sequences in the .fna file only match the if, in the oufmt 6 blast file, the s_start is greater than S_end. If S_end is greater than S_start, it prints the wrong sequence. Is there a way to fix this? I have tried altering the awk commands, but bedtools doesn't like it.

Thanks

ADD REPLY
0
Entering edit mode

It is possible there are errors. I said "not tested" above the code, but I will look at this.

ADD REPLY
1
Entering edit mode

Does this fix the problem? (still untested)

makeblastdb -in reference.fa -dbtype nucl 
samtools faidx reference.fa
cut -f 1,2 reference.fa.fai > chrom.sizes
blastn -db reference -query query.fasta -outfmt 6 -out query.fasta-against-reference.blast
cut -f 2,9,10 query.fasta-against-reference.blast | awk '{if($2 > $3) {print $1,$3,$2;} else if($2 < $3) print $1,$2,$3;}' OFS='\t' |awk '{a=$2-1;print $1,a,$3;}' OFS='\t'|bedtools sort > query.fasta-against-reference.blast.bed
bedtools slop -i query.fasta-against-reference.blast.bed -g chrom.sizes -b 500 > query.fasta-against-reference.blast.500.bed
bedtools getfasta -fi reference.fa -bed query.fasta-against-reference.blast.500.bed -fo query.fasta-against-reference.blast.500.fasta
ADD REPLY
0
Entering edit mode

Thank you for your help with this, I appreciate the time you are taking to help. I have been trying to tweak it over the weekend, but unfortunately, I am still having the same problem.

When the condition if($2 > $3) is met and it prints $1,$3,$2 the incorrect sequence is printed in the output fasta, 'query.fasta-against-reference.blast.500.fasta.

I cannot think how to solve this problem, because I cannot work out how to show the position of the sequence when $2 > $3.

I have attached an image to help explain.

Example of s_start > s_end not matching

ADD REPLY
0
Entering edit mode

Does this help?

makeblastdb -in reference.fa -dbtype nucl 
samtools faidx reference.fa
cut -f 1,2 reference.fa.fai > chrom.sizes
blastn -db reference -query query.fasta -outfmt 6 -out query.fasta-against-reference.blast
cut -f 2,9,10 query.fasta-against-reference.blast | awk '{if ($2>$3)print $1,$3,$2;else print $1,$2,$3;}' OFS='\t' |awk '{a=$2-1;print $1,a,$3;}' OFS='\t'|bedtools sort > query.fasta-against-reference.blast.bed
bedtools slop -i query.fasta-against-reference.blast.bed -g chrom.sizes -b 500 > query.fasta-against-reference.blast.500.bed
bedtools getfasta -fi reference.fa -bed query.fasta-against-reference.blast.500.bed -fo query.fasta-against-reference.blast.500.fasta
ADD REPLY
0
Entering edit mode

Still having the same problem - I've had a look online and I can't see instances where this has been a problem.

In theory, the if the S_start > S_end, then the sequence goes from the S_end to the S_start, so the "coordinates" provided by blast should be correct, and swapping $2 and $3 should work, right? So do you think there is an issue with my reference.fa? Or the coordinates from the blast are incorrect in some instances?

ADD REPLY
0
Entering edit mode

I will be honest. I am not sure what the problem is, can you post not an image but some results by using the code insertion functions?

ADD REPLY
0
Entering edit mode

I have been manually checking to see if the 'query.fasta-against-reference.blast.500.fasta' matches the blastn standard outfmt. When the sequence start is greater than the sequence end, the output in the fasta does not match the blastn standard output.

Blastn output (not outfmt 6 option)

   >Sequence1

    Query  1       TACAGTGGGGGGCAATAAGTATGAATACCCTTTGATGTACTGACACACACCTCTTAGCCT  60
                   |||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||
    Sbjct  484980  TACAGTGGGGGGCAATAAGTATGAATACCCTTTGGTGTACTGACACACACCTCTTAGCCT  484921

    Query  61      AAAACGAGCTATGTTAACTCCTAATCCTGGTTAGAGGTTTTCCCTAACCAATTAGTATAA  120
                   ||||  ||||||||| ||||||||||||||||||||||||||||||||| ||||||||| 
    Sbjct  484920  AAAAGAAGCTATGTTGACTCCTAATCCTGGTTAGAGGTTTTCCCTAACCGATTAGTATAT  484861

    Query  121     GAGATAGCTATGTGTGAGGTTAATATACACACTCTAGCTGTTAAAATGGGCTAGCGGCAG  180
                    ||||||||||||||||||||||||| ||||||||| || ||||||| ||||||||||||
    Sbjct  484860  AAGATAGCTATGTGTGAGGTTAATATGCACACTCTAACTATTAAAATAGGCTAGCGGCAG  484801

    Query  181     AAAA-CAATACACGCCCGGTATTCATACTTATTGCCTCCCACTGTA  225
                   |||| |||||||||||||||||||||||||||||||||||||||||
    Sbjct  484800  AAAAACAATACACGCCCGGTATTCATACTTATTGCCTCCCACTGTA  484755

query.fasta-against-reference.blast.500.fasta output

 >Sequence1:484754-484980
    TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA

The subject sequence in the blastn output should match the 'query.fasta-against-reference.blast.500.fasta' output, but if you look the sequences don't match, for instance the first 11 bases :Blastn TACAGTGGGGGG, query.fasta TACAGTGGGAGG.

I don't understand why they don't match. Can you provide any insight?

ADD REPLY
0
Entering edit mode

This is very strange indeed. I am wondering if it has something to do with a bug in Bedtools (by the way, what version of bedtools are you using?)

Just for kicks, can you check the following:

samtools faidx reference.fa Sequence1:484754-484980

It should produce

>Sequence1:484754-484980
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA

Unless there is something wrong with blastn (what version are you using?)

EDIT

So, the first 11 bases from the Query in your BLAST output match the reverse complement of the last 11 bases of the output from Bedtools.

>Last_11_bases
CCCCCACTGTA

>Last_11_bases_reverse_complement
TACAGTGGGGG
ADD REPLY
0
Entering edit mode

Try the following updated code:

makeblastdb -in reference.fa -dbtype nucl 
samtools faidx reference.fa
cut -f 1,2 reference.fa.fai > chrom.sizes
blastn -db reference -query query.fasta -outfmt 6 -out query.fasta-against-reference.blast
cut -f 2,9,10 query.fasta-against-reference.blast | awk '{if ($2>$3)print $1,$3,$2,".",".","-";else print $1,$2,$3,".",".","+";}' OFS='\t' |awk '{a=$2-1;print $1,a,$3,$4,$5,$6;}' OFS='\t'|bedtools sort > query.fasta-against-reference.blast.bed
bedtools slop -i query.fasta-against-reference.blast.bed -g chrom.sizes -b 500 > query.fasta-against-reference.blast.500.bed
bedtools getfasta -s -fi reference.fa -bed query.fasta-against-reference.blast.500.bed -fo query.fasta-against-reference.blast.500.fasta

EDIT: added -s to bedtools getfasta

ADD REPLY
0
Entering edit mode

Well spotted - the query BLAST matching the bedtools reverse complement.

The versions are as follows: bedtools v2.29.2 Nucleotide-Nucleotide BLAST 2.9.0+

The samtools faidx reference.fa Sequence1:484754-484980 output is

>Sequence1:484754-484980
ATACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCT
ATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGG
TTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTG
TGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA

Which is exactly the same as you predicted it would produce only with an additional A at the front, although that may have been my fault when I initially copied the sequence.

I tried the updated code - but had the same problem;

>Sequence1:484754-484980
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA

Thank you for all of your help

ADD REPLY
0
Entering edit mode

Try it one more time. I forgot to change the bedtools getfasta command to force strandness with the -s option.

ADD REPLY
0
Entering edit mode

SOLVED IT|!!!! Great, thank you very much for your help. I appreciate this has taken a lot of time. I would have been very stuck without it.

ADD REPLY
0
Entering edit mode

You are welcome! Glad to help!

ADD REPLY

Login before adding your answer.

Traffic: 984 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6