Sequence extract from multifasta using blastall coordinates

Entering edit mode

6.8 years ago

chland • 0

Hi everyone.

This question has been asked in various forms before, however am new to the field and trying to learn and the answers posted don't seem to be working. Sorry, I need to specify that I'm trying to extract the sequence corresponding to the Subject start and subject end coordinates for each hit from a multifasta file of concatenated sequences. Thanks for the useful command though, I will make note of that for other purposes.

I ran blastall search that returned a file of hits that look like this with up to 500 hits ( -b 500) to the database of concatenated fasta files. I'm looking for a method to extract the sequence using the s start and s end for each hit along with the corresponding subject ID from the database of concatenated sequences.
Thanking you in advance. Chandly

# BLASTN 2.2.26 [Sep-21-2011]
# Query: querysequence.fasta
# Database: blast_db.fna
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
NCTC11168   NZ_FPEE01000064.1   100.00  960 0   0   1   960 16113   15154   0.0 1861
NCTC11168   NZ_PSND01000007.1   100.00  960 0   0   1   960 49840   50799   0.0 1861
NCTC11168   NZ_PIBG01000017.1   100.00  960 0   0   1   960 12161   13120   0.0 1861
NCTC11168   NZ_PHZN01000005.1   100.00  960 0   0   1   960 49939   50898   0.0 1861
NCTC11168   NZ_NFPZ01000033.1   100.00  960 0   0   1   960 16585   15626   0.0 1861

sequence blastall extract • 2.6k views

ADD COMMENT • link updated 6.8 years ago by h.mon 35k • written 6.8 years ago by chland • 0

Entering edit mode

Something like awk 'BEGIN{OFS="\t"}{print $2,$9,$10}' your_blast_file will get you the start and stops. Are the subject sequences in a local multi-fasta file? You can find many threads here to extract sequences from that point on.

ADD REPLY • link 6.8 years ago by GenoMax 148k

Entering edit mode

Thank you genomax for trying to help. I've been looking though answers posted and am not finding an answer that works and that I can understand. Yes, the sequences are in a local multi-fasta file created by running the cat command on sequence data ( complete, contigs/scaffold) from NCBI assembly.

ADD REPLY • link 6.8 years ago by chland • 0

Entering edit mode

BioPythons slice notation will be want you need here.

I have a gist which employs a blastfile to slice up genbanks, but you could do the same with a Rasta read in to Biopython as a SeqRecord:

	#!/usr/bin/python

	# This script is designed to take a genbank file and 'slice out'/'subset'
	# regions (genes/operons etc.) and produce a separate file. This can be
	# done explicitly by telling the script which base sites to use, or can
	# 'decide' for itself by blasting a fasta of the sequence you're inter-
	# ed in against the Genbank you want to slice a record out of.

	# Note, the script (obviously) does not preseve the index number of the
	# bases from the original

	# Based upon the tutorial at:
	# http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc44

	# This script depends on BLAST and having the makeblastdb functionality
	# available if BLAST_MODE is active. It also depends on Biopython.

	# Set up and handle arguments:

	import warnings
	import sys
	import subprocess
	import os
	import time
	from Bio import SeqIO

	__author__ = "Joe R. J. Healey"
	__version__ = "1.1.0"
	__title__ = "Genbank_slicer"
	__license__ = "GPLv3"
	__author_email__ = "J.R.J.Healey@warwick.ac.uk"

	# Import SearchIO and suppress experimental warning
	from Bio import BiopythonExperimentalWarning
	with warnings.catch_warnings():
	warnings.simplefilter('ignore', BiopythonExperimentalWarning)
	from Bio import SearchIO


	def convert(basename, genbank):
	'''Convert the provided genbank to a fasta to BLAST.'''

	refFasta = "{}.fasta.tmp".format(basename)
	SeqIO.convert(genbank, 'genbank', refFasta, 'fasta')

	return refFasta

	def runBlast(basename, refFasta, fasta, verbose):
	'''Synthesise BLAST commands and make system calls'''

	resultHandle = "{}.blastout.tmp".format(basename)
	blastdb_cmd = 'makeblastdb -in {0} -dbtype nucl -title temp_blastdb'.format(refFasta)
	blastn_cmd = 'blastn -query {0} -strand both -task blastn -db {1} -perc_identity 100 -outfmt 6 -out {2} -max_target_seqs 1'.format(fasta, refFasta, resultHandle)

	print("Constructing BLAST Database: " + '\n' + blastdb_cmd)
	print("BLASTing: " + '\n' + blastn_cmd)
	DB_process = subprocess.Popen(blastdb_cmd,
	shell=True,
	stdin=subprocess.PIPE,
	stdout=subprocess.PIPE,
	stderr=subprocess.PIPE)
	DB_process.wait()
	BLAST_process = subprocess.Popen(blastn_cmd,
	shell=True,
	stdin=subprocess.PIPE,
	stdout=subprocess.PIPE,
	stderr=subprocess.PIPE)
	BLAST_process.wait()

	return resultHandle

	def getIndices(resultHandle):
	'''If not provided directly by the user, this function retrieves the best BLAST hit's indices.'''

	blast_result = SearchIO.read(resultHandle, 'blast-tab')

	print(blast_result[0][0])
	start = blast_result[0][0].hit_start
	end = blast_result[0][0].hit_end

	return start, end

	def slice(start, end, genbank, FPoffset, TPoffset):
	'''Subset the provided genbank to return the sub record.'''

	seqObj = SeqIO.read(genbank, 'genbank')
	subRecord = seqObj[start-FPoffset:end+TPoffset]

	return subRecord


	def main():
	###################################################################################################
	# Parse command line arguments
	import argparse
	try:
	parser = argparse.ArgumentParser(
	description='This script slices entries such as genes or operons out of a genbank, subsetting them as their own file.')
	parser.add_argument(
	'-g',
	'--genbank',
	action='store',
	required=True,
	help='The genbank file you wish to subset.')
	parser.add_argument(
	'-o',
	'--outfile',
	action='store',
	help='If specifed, the script will write a file, otherwise redirect STDOUT for pipelines.')
	parser.add_argument(
	'-s',
	'--start',
	type=int,
	help='Integer base index to slice from.')
	parser.add_argument(
	'-e',
	'--end',
	type=int,
	default=0,
	help='Integer base index to slice to.')
	parser.add_argument(
	'-F',
	'--FPoffset',
	action='store',
	type=int,
	default=0,
	help='If you want to capture regions around the target site, specify the 5\' offset.')
	parser.add_argument(
	'-T',
	'--TPoffset',
	action='store',
	type=int,
	default=0,
	help='If you want to capture regions around the target site, specify the 3\' offset.')
	parser.add_argument(
	'-b',
	'--blastmode',
	action='store_true',
	help='If flag is set, provide an input fasta (-f \| --fasta), and BLAST will be called to find your sequence indices in the original genbank.')
	parser.add_argument(
	'-f',
	'--fasta',
	action='store',
	help='The operon fasta to pull annotations from the provided genbank.')
	parser.add_argument(
	'-t',
	'--tidy',
	action='store_true',
	help='Tell the script whether or not to remove the temporary files it generated during processing. Off by default. WARNING: removes files based on the "tmp" string.')
	parser.add_argument(
	'-m',
	'--meta',
	action='store',
	default=None,
	help='Specify a string to use as the Genbank meta-data fields if the parent one doesn\'t contain anything. Else inherits from parent sequence.')
	parser.add_argument(
	'-v',
	'--verbose',
	action='store_true',
	help='Verbose behaviour. Shows progress of BLAST etc. if appropriate.')


	args = parser.parse_args()

	except NameError:
	print "An exception occured with argument parsing. Check your provided options."

	genbank = args.genbank
	fasta = args.fasta
	split = os.path.splitext(args.genbank)
	basename = os.path.basename(split[0])
	start = args.start
	end = args.end
	FPoffset = args.FPoffset
	TPoffset = args.TPoffset
	blastMode = args.blastmode
	outfile = args.outfile
	fasta = args.fasta
	tidy = args.tidy
	meta = args.meta
	verbose = args.verbose

	# Main code:
	if blastMode is not False and fasta is not None:
	refFasta = convert(basename,genbank)
	resultHandle = runBlast(basename, refFasta, fasta, verbose)
	start, end = getIndices(resultHandle)
	else:
	if fasta is None:
	print("No fasta was provided so BLAST mode cannot be used.")

	if start is None or end is None:
	print('No slice indices have been specified or retrieved from blastout')
	sys.exit(1)


	subRecord = slice(start, end, genbank, FPoffset, TPoffset)
	# Populate the metadata of the genbank
	if meta is not None:
	subRecord.id = meta
	subRecord.locus = meta + 'subregion'
	subRecord.accession = meta
	subRecord.name = meta
	subRecord.annotations["date"] = time.strftime("%d-%b-%Y").upper()
	# Other field options include:
	# subRecord.annotations["source"]
	# ["taxonomy"]
	# ["keywords"]
	# ["references"]
	# ["accessions"]
	# ["data_file_division"] e.g. BCT, PLN, UNK
	# ["organism"]
	# ["topology"]

	if outfile is not None:
	SeqIO.write(subRecord, outfile, "genbank")
	else:
	print(subRecord.format('genbank'))

	if tidy is True:
	if verbose is True:
	rm="rm -v ./tmp"
	else:
	rm="rm ./tmp"

	subprocess.Popen(rm,shell=True)

	if __name__ == "__main__":
	main()

view raw Genbank_slicer.py hosted with ❤ by GitHub

ADD REPLY • link 6.8 years ago by Joe 22k

Entering edit mode

Thank you, I've downloaded the script from github, but by any chance would you be able to provide a usage statement example? Also, the assemblies were grabbed from refseq in case that matters. I have a database consisting of multifasta sequences called something like mydb.fasta and blastall hits as in the original post.

ADD REPLY • link 6.8 years ago by chland • 0

Entering edit mode

I’m not in front of a computer right now to put together the full solution.

The script will explain how it works if you run python script.py -h. But it wont work for fast as as it is. You’ll need to write something a little bespoke yourself.

The line you’ll be most interested in is 89, the slice function

ADD REPLY • link 6.8 years ago by Joe 22k

Entering edit mode

Thanks all, but I'm starting from the very beginning here and have not made it to any kind of scripting so it's all greek.

ADD REPLY • link 6.8 years ago by chland • 0

Entering edit mode

6.8 years ago

h.mon 35k

My experience (but I never seriously benchmarked) is that, for big fasta files, method 2 bellow is faster than method 1.

Method 1:

a) index the fasta file containing the original sequences:

samtools faidx file.fasta

b) extract sequences interval of interest:

samtools faidx file.fasta $(cat blast.txt | grep -v "^#" | awk 'BEGIN{OFS="\t"}{print $2":"$9"-"$10}')

Method 2:

a) extract the subject hit accessions:

cat blast.txt | grep -v "^#" | cut -f2 | sort | uniq > accessions.txt

b) extract the subject hit sequences with formatdb (I will illustrate here with blastdbcmd, which is the equivalent command for the Blast+ suite):

blastdbcmd -db nt -entry_batch accessions.txt -out subjects.fasta

c) index the subjects fasta:

samtools faidx subjects.fasta

d) extract sequences interval of interest:

samtools faidx subjects.fasta $(cat blast.txt | grep -v "^#" | awk 'BEGIN{OFS="\t"}{print $2":"$9"-"$10}')

ADD COMMENT • link 6.8 years ago by h.mon 35k

Entering edit mode

Spoke too soon. Thanks @h.mon. I've used the blast suite of tools and will try this method. The python script is too advanced for a complete beginner

ADD REPLY • link 6.8 years ago by chland • 0

Entering edit mode

Hi, Thank you for this useful method @h.mon

what if the $9>$10. As you can see above blast out. when i tried with my data i am getting

665:1927-1645 [faidx] Zero length sequence: 665:1927-1645

when there is $9>$10.

ADD REPLY • link 5.3 years ago by baurumon ▴ 30