Extracting upstream and downstream bases from a Blastn hit.
3
0
Entering edit mode
4.7 years ago
jamie.pike ▴ 80

I have the output from BLASTN searches and want to extract 2500 bases upstream and downstream of each BLASTN hit from an assembled genome.

I have generated fastas containing each BLASTN sequence, and have a fasta for the assembled genome.

I have been trying to use pcregrep for this:

pcregrep -i -A0 -B0 -M -f Blastn_hit.fna Assembled_genome.fna > Blastn_hit_+_bases.fna

However, there is no output.

I believe this is because the Blastn_hit.fna lines are longer than those in Assembled_genome.fna, so I have to indicate a new line using (\n|.) in the BLASTN file. The only problem is I don’t know where the new lines are, and so don’t know where to enter (\n|.) in Blastn_hit.fna. Is there a way to use pcregrep without indicating where new lines are, or is there an alternative tool or script I can use that will find the BLASTN hit and print 2500 bases upstream and downstream?

I am very new to this and have very limited knowledge, so answers with more of a ‘for dummies’ approach would be appreciated.

(I know that -A and -B will print lines, not characters, but I can work out how many characters there are to a line and so know how many lines should be printed)

blastn extracting bases pcregrep • 1.6k views
ADD COMMENT
0
Entering edit mode

not sure what the blast cmd is you executed but if you did not already you should work with the tab-output format.

from that format you can easily get the columns denoting the start/stop of hits, then using eg awk or such add/subtract X from it to get the coordinates of the region you want.

ADD REPLY
1
Entering edit mode

Thank you - I have now used blast outfmt 6 and managed to create the fastas required.

ADD REPLY
1
Entering edit mode
4.7 years ago
GenoMax 147k

One way of doing this reliably is to use bedtools solution with -outfmt 6 with blastn: Finding upstream or downstream sequences on BLAST on linux

This thread adds more detail on how to do this: A: Extract flanking region of -500 nt upstream and downstream of BLAST result on ge

ADD COMMENT
0
Entering edit mode

Thank you for your advice - the links were very useful.

ADD REPLY
0
Entering edit mode
4.7 years ago
gayachit ▴ 200

You could also try and use python code Extracting An Up Stream Or Downstream Sequence From Given Position

If you need I can tweak the code to get what you need

ADD COMMENT
0
Entering edit mode

Thank you - I have since managed to get what I needed. But I have a question about the python code. I don't fully understand it as I am very new to python. In linked code you provided, I assume that est_fasta_file and est_mirna_file would be the files I have generated, if so, how do I know what blast format is correct?

Thank you

ADD REPLY
1
Entering edit mode

Your est_fasta_file is the fasta file that you are blasting and est-mirna-file is the blast output generated. The blast output format used is tab-separated -outfmt 6

ADD REPLY

Login before adding your answer.

Traffic: 2714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6