Hi there! This is my first post on your forum and I hope you'll be able to help me.
This is my first intership in bioinformatics and I'm kind of learning everything by myself. I learnt how to use python 2 days ago. I'm actually working with a large dataset (200Go) of metagenomics sequences. My boss asked me to blast them against the 16SMicrobial NCBI database and then "copy/paste" in a new file the sequences with hits, assuming they are 16S related. Easier to say than to do when you blast about 413000 sequences.
So I decided to try to make a script with Python that would do it. I also installed Biopython because it seemed to be a very good tool.
My first idea is to use a script i found on the Biopython website. this script is used to retrieve sequences with no blast hits. This would at least help to separate sequences i don't from sequences i need.
from Bio import SeqIO
from Bio.Blast import NCBIXML
q_dict = SeqIO.to_dict(SeqIO.parse(open("queries.fasta"), "fasta"))
hits = []
for record in NCBIXML.parse(open("BLAST_RESULTS.xml")):
# As of BLAST 2.2.19 the xml output for multiple-query blast searches
# skips queries with no hits so we could just append the ID of every blast
# record to our 'hit list'. It's possible that the NCBI will change this
# behaviour in the future so let's make the appending conditional on there
# being some hits (ie, a non-empty alignments list) recorded in the blast record
if record.alignments:
# The blast record's 'query' contains the sequences description as a
# string. We used the ID as the key in our dictionary so we'll need to
# split the string and get the first field to remove the right entries
hits.append(record.query.split()[0])
misses = set(q_dict.keys()) - set(hits)
orphan_records = [q_dict[name] for name in misses]
This is my first and only hint, but I'm sure there is a better way to do it.
Can someone help me please?
I think we need a bit more clarification on what you are trying to accomplish. So you've parsed out the IDs you want to keep and now you want to get the fasta sequences that correspond to those ids?
If all you need are the IDs with and without a match, the tabular output is enough and easier to work with than the XML (and smaller on disk too), but you seem to have everything done and working. What are you asking for? How to save these as a FASTA file?
If you want just the records with this, something like this would work:
The only other refinement I'd suggest is using a generator expression rather than a list expression - it would avoid having everything in memory. i.e.
and