Hi everyone.
This question has been asked in various forms before, however am new to the field and trying to learn and the answers posted don't seem to be working. Sorry, I need to specify that I'm trying to extract the sequence corresponding to the Subject start and subject end coordinates for each hit from a multifasta file of concatenated sequences. Thanks for the useful command though, I will make note of that for other purposes.
I ran blastall search that returned a file of hits that look like this with up to 500 hits ( -b 500) to the database of concatenated fasta files. I'm looking for a method to extract the sequence using the s start and s end for each hit along with the corresponding subject ID from the database of concatenated sequences.
Thanking you in advance.
Chandly
# BLASTN 2.2.26 [Sep-21-2011]
# Query: querysequence.fasta
# Database: blast_db.fna
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
NCTC11168 NZ_FPEE01000064.1 100.00 960 0 0 1 960 16113 15154 0.0 1861
NCTC11168 NZ_PSND01000007.1 100.00 960 0 0 1 960 49840 50799 0.0 1861
NCTC11168 NZ_PIBG01000017.1 100.00 960 0 0 1 960 12161 13120 0.0 1861
NCTC11168 NZ_PHZN01000005.1 100.00 960 0 0 1 960 49939 50898 0.0 1861
NCTC11168 NZ_NFPZ01000033.1 100.00 960 0 0 1 960 16585 15626 0.0 1861
Something like
awk 'BEGIN{OFS="\t"}{print $2,$9,$10}' your_blast_file
will get you the start and stops. Are the subject sequences in a local multi-fasta file? You can find many threads here to extract sequences from that point on.Thank you genomax for trying to help. I've been looking though answers posted and am not finding an answer that works and that I can understand. Yes, the sequences are in a local multi-fasta file created by running the cat command on sequence data ( complete, contigs/scaffold) from NCBI assembly.
BioPythons slice notation will be want you need here.
I have a gist which employs a blastfile to slice up genbanks, but you could do the same with a Rasta read in to Biopython as a SeqRecord:
Thank you, I've downloaded the script from github, but by any chance would you be able to provide a usage statement example? Also, the assemblies were grabbed from refseq in case that matters. I have a database consisting of multifasta sequences called something like mydb.fasta and blastall hits as in the original post.
I’m not in front of a computer right now to put together the full solution.
The script will explain how it works if you run
python script.py -h
. But it wont work for fast as as it is. You’ll need to write something a little bespoke yourself.The line you’ll be most interested in is 89, the slice function
Thanks all, but I'm starting from the very beginning here and have not made it to any kind of scripting so it's all greek.