I have performed BLAST on a set of 380 sequences on a genome using BLAST+. I need to extract the aligned region along with (-/+)500 nt flanks for each hit from the genome. Can anyone provide me a script in perl or guide a way to do it. I know it can be done but have not found any reliable sources. I am quite new to this field.
I can't seem to get this to work for every sequence. The sequences in the .fna file only match the if, in the oufmt 6 blast file, the s_start is greater than S_end. If S_end is greater than S_start, it prints the wrong sequence. Is there a way to fix this? I have tried altering the awk commands, but bedtools doesn't like it.
Thank you for your help with this, I appreciate the time you are taking to help. I have been trying to tweak it over the weekend, but unfortunately, I am still having the same problem.
When the condition if($2 > $3) is met and it prints $1,$3,$2 the incorrect sequence is printed in the output fasta, 'query.fasta-against-reference.blast.500.fasta.
I cannot think how to solve this problem, because I cannot work out how to show the position of the sequence when $2 > $3.
Still having the same problem - I've had a look online and I can't see instances where this has been a problem.
In theory, the if the S_start > S_end, then the sequence goes from the S_end to the S_start, so the "coordinates" provided by blast should be correct, and swapping $2 and $3 should work, right? So do you think there is an issue with my reference.fa? Or the coordinates from the blast are incorrect in some instances?
I have been manually checking to see if the 'query.fasta-against-reference.blast.500.fasta' matches the blastn standard outfmt. When the sequence start is greater than the sequence end, the output in the fasta does not match the blastn standard output.
The subject sequence in the blastn output should match the 'query.fasta-against-reference.blast.500.fasta' output, but if you look the sequences don't match, for instance the first 11 bases :Blastn TACAGTGGGGGG, query.fasta TACAGTGGGAGG.
I don't understand why they don't match. Can you provide any insight?
Which is exactly the same as you predicted it would produce only with an additional A at the front, although that may have been my fault when I initially copied the sequence.
I tried the updated code - but had the same problem;
Thank you so much. I'll use it and let you know if any problems
I can't seem to get this to work for every sequence. The sequences in the .fna file only match the if, in the oufmt 6 blast file, the s_start is greater than S_end. If S_end is greater than S_start, it prints the wrong sequence. Is there a way to fix this? I have tried altering the awk commands, but bedtools doesn't like it.
Thanks
It is possible there are errors. I said "not tested" above the code, but I will look at this.
Does this fix the problem? (still untested)
Thank you for your help with this, I appreciate the time you are taking to help. I have been trying to tweak it over the weekend, but unfortunately, I am still having the same problem.
When the condition
if($2 > $3)
is met and it prints $1,$3,$2 the incorrect sequence is printed in the output fasta, 'query.fasta-against-reference.blast.500.fasta.I cannot think how to solve this problem, because I cannot work out how to show the position of the sequence when $2 > $3.
I have attached an image to help explain.
Does this help?
Still having the same problem - I've had a look online and I can't see instances where this has been a problem.
In theory, the if the S_start > S_end, then the sequence goes from the S_end to the S_start, so the "coordinates" provided by blast should be correct, and swapping $2 and $3 should work, right? So do you think there is an issue with my reference.fa? Or the coordinates from the blast are incorrect in some instances?
I will be honest. I am not sure what the problem is, can you post not an image but some results by using the code insertion functions?
I have been manually checking to see if the 'query.fasta-against-reference.blast.500.fasta' matches the blastn standard outfmt. When the sequence start is greater than the sequence end, the output in the fasta does not match the blastn standard output.
Blastn output (not outfmt 6 option)
query.fasta-against-reference.blast.500.fasta output
The subject sequence in the blastn output should match the 'query.fasta-against-reference.blast.500.fasta' output, but if you look the sequences don't match, for instance the first 11 bases :Blastn TACAGTGGGGGG, query.fasta TACAGTGGGAGG.
I don't understand why they don't match. Can you provide any insight?
This is very strange indeed. I am wondering if it has something to do with a bug in Bedtools (by the way, what version of bedtools are you using?)
Just for kicks, can you check the following:
It should produce
Unless there is something wrong with blastn (what version are you using?)
EDIT
So, the first 11 bases from the Query in your BLAST output match the reverse complement of the last 11 bases of the output from Bedtools.
Try the following updated code:
EDIT: added
-s
tobedtools getfasta
Well spotted - the query BLAST matching the bedtools reverse complement.
The versions are as follows: bedtools v2.29.2 Nucleotide-Nucleotide BLAST 2.9.0+
The samtools faidx reference.fa Sequence1:484754-484980 output is
Which is exactly the same as you predicted it would produce only with an additional A at the front, although that may have been my fault when I initially copied the sequence.
I tried the updated code - but had the same problem;
Thank you for all of your help
Try it one more time. I forgot to change the
bedtools getfasta
command to force strandness with the-s
option.SOLVED IT|!!!! Great, thank you very much for your help. I appreciate this has taken a lot of time. I would have been very stuck without it.
You are welcome! Glad to help!