I have a set of ~200 known sequences of a particular domain that I want to use to fish out similar sequences from NCBI-NT.As you can imagine, my blast results for this dataset are redundant in the sense that multiple sequences from my start set will pick up some of the same sequences. Furthermore, as I was intending upon using the GI or Accession number as the Fasta Name there is the issue that multiple domains may be found for a single GI or Accession so simply reducing by GI number will not work. Therefore, I want to develop a parsing strategy that will return a fasta file with all the significant sequences labelled according to 1) Accession Number, 2) Start and 3) End sites where there are no redundant sequences and where,I keep the longest alignment in the case that a sequence is returned on multiple blast hits.
In strategizing the best approach I can think of a few ways to achieve this results:
- Iterate through the Blast XML and make a list/dict of accession numbers with start/end locations. Use the list to find the longest sequence for each domain in an Accession-number-list. Use the new list to reiterate through the blast output for sequences to keep.(what I was planning on doing)
- write out a fasta for each accession number, align the sequences and keep only the longest sequence for each domain. then recombine.(also not too difficult)
Okay.. so maybe there are not so many ways when i try to write them down. In any event I am wondering how some of you would approach this problem and if I am overlooking a more obvious way (split/apply/recombine tables? ). I am using BioPython for parsing XML output - but a command line approach would be great as well.
thanks, zach cp
edit: I ended up using the following function which looks at the Accession number and the the alignment start positions. I keep only sequences that have a a unique Accession Number or whose starting alignment is at least 500 base pairs from other alignments. Although I do not get the full length sequence the representative sequences are larely the same (+/1 only a few base pairs on each end)
def write_unique_fasta(xmlfile, length = 200):
startdict = defaultdict(list)
blast_records = NCBIXML.parse(open(xmlfile, "r"))
output_handle = open('{}.fasta'.format(str(xmlfile).split('.')[0]), 'w')
convert = 0
total = 0
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if len(hsp.sbjct) > length:
total += 1
title = ((alignment.title).split('|')[3]).strip()
#check to see if Accession number exists in dictionary.if so only write a record if its start site is >1kb
if not title in startdict.keys():
print "New Accession: {} Add It".format(title)
output_handle.write('> {} {} {} \n'.format((alignment.title).split('|')[3], hsp.sbjct_start, hsp.sbjct_end))
output_handle.write('{} \n'.format(hsp.sbjct))
startdict[title].append(hsp.sbjct_start)
convert += 1
else:
print "Accession {} Found. Scan for Similar Sequence....".format(title)
counter = 0
for start in list(set(startdict[title])):
if (start - 500 < hsp.sbjct_start < start + 500):
#print counter
counter += 1
if counter == 0:
output_handle.write('> {} {} {} \n'.format((alignment.title).split('|')[3], hsp.sbjct_start, hsp.sbjct_end))
output_handle.write('{} \n'.format(hsp.sbjct))
startdict[title].append(hsp.sbjct_start)
convert += 1
output_handle.close()
blast_records.close()
print "Converted = {}".format(convert)
print "Total = {}".format(total)
I might be remembering wrong, but I thought that splice variants had their own unique GI numbers and accession codes. Also, it seems to me (but I might be wrong) that you are misinterpreting HSP's. If you're getting the same accession numbers with different alignment results they could be different HSP's found on the same subject, in which case you shouldn't take the longest alignment but the whole subject sequence.
Thanks Niek. Indeed I can have multiple HSP's per sequence. I was going to distinguish them based on their start location. However for a given HSP from as given Accession numbers the alignment site might vary a little based on the query sequence. By non-redundant I meant each unique HSP. And by longest I meant getting the longest version of the HSP as seen by any of my query sequences.
Why do you want to distinguish between the HSP's?
the more general result I want is to obtain is all unique sequences in NCBI-NT that are related to my start set. I was planning on iteratively BLASTING until I stop turning up new sequences and therefore wanted to keep track of HSPs so I can keep track each round. In order to do this I was keeping track of HSPs by their accession number and their location in that sequence (start/end). If you have a better method, I am all ears.
If you just want to blast until you don't get new sequences you only have to save the accession number, no need to know the start position of the HSP.