I have the output from BLASTN searches and want to extract 2500 bases upstream and downstream of each BLASTN hit from an assembled genome.
I have generated fastas containing each BLASTN sequence, and have a fasta for the assembled genome.
I have been trying to use pcregrep for this:
pcregrep -i -A0 -B0 -M -f Blastn_hit.fna Assembled_genome.fna > Blastn_hit_+_bases.fna
However, there is no output.
I believe this is because the Blastn_hit.fna lines are longer than those in Assembled_genome.fna, so I have to indicate a new line using (\n|.)
in the BLASTN file. The only problem is I don’t know where the new lines are, and so don’t know where to enter (\n|.)
in Blastn_hit.fna. Is there a way to use pcregrep without indicating where new lines are, or is there an alternative tool or script I can use that will find the BLASTN hit and print 2500 bases upstream and downstream?
I am very new to this and have very limited knowledge, so answers with more of a ‘for dummies’ approach would be appreciated.
(I know that -A and -B will print lines, not characters, but I can work out how many characters there are to a line and so know how many lines should be printed)
not sure what the blast cmd is you executed but if you did not already you should work with the tab-output format.
from that format you can easily get the columns denoting the start/stop of hits, then using eg awk or such add/subtract X from it to get the coordinates of the region you want.
Thank you - I have now used blast outfmt 6 and managed to create the fastas required.